Re: [HACKERS] jsonb generator functions
Andrew Dunstan wrote: OK, here is a new patch version that * uses find_coercion_path() to find the cast function if any, as discussed elsewhere * removes calls to getTypeOutputInfo() except where required * honors a cast to json only for rendering both json and jsonb * adds processing for the date type that was previously missing in datum_to_jsonb Did this go anywhere? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 8 December 2014 at 11:46, Michael Paquier michael.paqu...@gmail.com wrote: * ideally we'd like to be able to differentiate the types of usage. which then allows the user to control the level of compression depending upon the type of action. My first cut at what those settings should be are ALL LOGICAL PHYSICAL VACUUM. VACUUM - only compress while running vacuum commands PHYSICAL - only compress while running physical DDL commands (ALTER TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't typically be used for logical decoding LOGICAL - compress FPIs for record types that change tables ALL - all user commands (each level includes all prior levels) Well, that's clearly an optimization so I don't think this should be done for a first shot but those are interesting fresh ideas. It is important that we offer an option that retains user performance. I don't see that as an optimisation, but as an essential item. The current feature will reduce WAL volume, at the expense of foreground user performance. Worse, that will all happen around time of new checkpoint, so I expect this will have a large impact. Presumably testing has been done to show the impact on user response times? If not, we need that. The most important distinction is between foreground and background tasks. If you think the above is too complex, then we should make the parameter into a USET, but set it to on in VACUUM, CLUSTER and autovacuum. Technically speaking, note that we would need to support such things with a new API to switch a new context flag in registered_buffers of xloginsert.c for each block, and decide if the block is compressed based on this context flag, and the compression level wanted. * name should not be wal_compression - we're not compressing all wal records, just fpis. There is no evidence that we even want to compress other record types, nor that our compression mechanism is effective at doing so. Simple = keep name as compress_full_page_writes Though perhaps we should have it called wal_compression_level I don't really like those new names, but I'd prefer wal_compression_level if we go down that road with 'none' as default value. We may still decide in the future to support compression at the record level instead of context level, particularly if we have an API able to do palloc_return_null_at_oom, so the idea of WAL compression is not related only to FPIs IMHO. We may yet decide, but the pglz implementation is not effective on smaller record lengths. Nor has any testing been done to show that is even desirable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On 12/05/2014 05:05 PM, Robert Haas wrote: [ reviving a very old thread ] On Tue, Feb 10, 2009 at 4:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: For example, maybe we could keep track of counts of snapshots removed since the last xmin calculation, and only run this routine if the number is different from zero (or some small positive integer). I think most of the callers of SnapshotResetXmin already know they removed something. It might be interesting for FreeSnapshot or something nearby to note whether the snapshot being killed has xmin = proc's xmin, and only do the update calculation if so. I still dislike the assumption that all resource owners are children of a known owner. I suspect in fact that it's demonstrably wrong right now, let alone in future (cf comments in PortalRun). If we're going to do this then snapmgr.c needs to track the snapshots for itself. Of course that's going to make the is it worth it question even more pressing. I've run into essentially the same problem Jeff originally complained about with a large customer who has long-running transactions that make extensive use of cursors. Cursors are opened and closed over time but it is rare for the number open to reach exactly zero, so what ends up happening is that the backend xmin does not advance. As you can imagine, that doesn't work out well. The approach I came up with initially was similar to Jeff's: start at the topmost resource owner and traverse them all, visiting every snapshot along the way. But then I found this thread and saw the comment that this might be demonstrably wrong and referring to the comments in PortalRun. Having reviewed those comments, which don't seem to have changed much in the last five years, I can't understand how they related to this issue. It's true that the TopTransaction resource owner could get swapped out under us during an internal commit, but why should SnapshotResetXmin() have to care? It just traverses the one that is in effect at the time it gets called. The only real danger I see here is that there could be *more than one* toplevel resource owner. I wonder if we could solve that problem by adding a registry of active toplevel resource owners, so that if we have a forest rather than a tree we can still find everything. I don't immediately see the problem either, but I have to say that grovelling through all the resource owners seems ugly anyway. Resource owners are not meant to be traversed like that. And there could be a lot of them, and you have to visit every one of them. That could get expensive if there are a lot of resource owners. BTW, you could easily detect that you haven't seen all the registered snapshots, after traversing the resource owner, as we keep the counter of them. So you could just fall back to not advancing the xmin if it happens. The problem I see with having snapmgr.c track the snapshots for itself is that it is mostly duplicating of bookkeeping which is already being done. Since this problem doesn't affect the majority of users, it's not desirable to add a lot of extra bookkeeping to cater to it - but even if it did affect a lot of users, we still want it to be as cheap as possible, and reusing the tracking that resource owners are already doing seems like the way to get there. I would prefer doing separate bookkeeping in snapmgr.c. It seems like it wouldn't be difficult to do. It has to be cheap, but I don't see any reason to believe that it'd be more expensive than traversing through all resource owners. A binary heap or some other priority queue implementation should work pretty well for this. I think there are a couple of things we can do to make this cheaper. Suppose we keep track of the oldest xmin of any registered snapshot and the number of registered snapshots that have that particular xmin. Every time we deregister a snapshot, we check whether it is one of the ones with the minimum xmin; if it is, we decrement the count. When the count reaches 0, we know that a traversal of all registered snapshots is guaranteed to find a newer value to advertise in MyProc-xmin; that way, we can arrange to do the work only when it's likely to pay off. In most cases this won't happen until the last snapshot is unregistered, because our snapshots normally form a stack, with newer snapshots having been taken later. But if somebody unregisters the oldest snapshot we'll immediately notice and recalculate. Yeah, that's a reasonable optimization. It's a reasonable optimization even if you do the bookkeeping in snapmgr.c. And that optimization won't save you in the cases where it doesn't apply. For example, what if snapshots happen to form a queue, rather than a stack: DECLARE c1 CURSOR FOR ...; DECLARE c2 CURSOR FOR ...; DECLARE c3 CURSOR FOR ...; ... DECLARE c1000 CURSOR FOR ...; CLOSE c1; CLOSE c2; CLOSE c3; ... CLOSE c1000; It's not hard to imagine an application doing that.
[HACKERS] Casting issues with domains
Hello all, We experienced some casting issues with domains. We experienced the problem while querying the information_schema btw, but here is a simpler test case : postgres=# create table test1 (a text); CREATE TABLE postgres=# insert into test1 select generate_series(1,10); INSERT 0 10 postgres=# create index idx1 on test1(a); CREATE INDEX postgres=# analyze test1 ; ANALYZE; postgres=# explain select * from test1 where a = 'toto'; QUERY PLAN --- Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5) Index Cond: (a = 'toto'::text) (2 lignes) Now we create a tstdom domain and cast the a column to tstdom in the view definition : postgres=# create domain tstdom as text; CREATE DOMAIN postgres=# create view test2 as select a::tstdom from test1 ; CREATE VIEW postgres=# explain select * from test2 where a='toto'; QUERY PLAN -- Seq Scan on test1 (cost=0.00..1693.00 rows=500 width=5) Filter: (((a)::tstdom)::text = 'toto'::text) (2 lignes) As you can see, a is casted to tstdom then again to text. This casts prevents the optimizer to choose an index scan to retrieve the data. The casts are however strictly equivalent and should be not prevent the optimizer to use indexes. Also, the same problem appears in the information_schema views, as every object names are casted to information_schema.sql_identifier. Even if this domain is declared as name, no index will be used because of this cast. Shouldn't the planner simplify the casts when it's possible ? Regards, Thomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Fri, 2014-12-05 at 10:00 -0800, Josh Berkus wrote: I thought the point of INSERT ... ON CONFLICT update was so that you didn't have to care if it was a new row or not? If you do care, it seems like it makes more sense to do your own INSERTs and UPDATEs, as Django currently does. Django tries to update the object if it already exists in the database. If it doesn't, then Django does an insert. This is suboptimal from concurrency standpoint, and does two round trips to the database instead of just one. For Django, both insert and update are OK when saving an object to the database, but Django needs to know which one was done. I too agree that this doesn't need to be handled in the first version of the patch. - Anssi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API
On 12/05/2014 08:03 PM, David Rowley wrote: On 2 December 2014 at 15:36, Craig Ringer cr...@2ndquadrant.com mailto:cr...@2ndquadrant.com wrote: On 12/01/2014 09:51 PM, Marco Nenciarini wrote: I think this is a leftover, as you don't use elog afterwards. Good catch, fixed. I've looked over this again and tested it on a windows 8.1 machine. I cannot find any problems The only comments about the code I have would maybe be to use some constants like: #define FILETIME_PER_SEC1000L #define FILETIME_PER_USEC10 [snip] I'll leave it up to the committer to decide if it's better with or without the attached patch. I think it's more readable, and that's pretty much always a good thing. Patches with your changes attached. I used FILETIME_UNITS_PER_SEC though. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 9c4e7f5539f0d518b0fe94d12bc562d95967f6a6 Mon Sep 17 00:00:00 2001 From: Craig Ringer cr...@2ndquadrant.com Date: Fri, 12 Sep 2014 12:41:35 +0800 Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in win32 gettimeofday MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the win32 port gettimeofday function. This is not necessary and limits the reported precision to the 1ms granularity that the SYSTEMTIME struct can represent. By using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture timestamps at 100ns granularity, which is then rounded to 1µs granularity for storage in a PostgreSQL timestamp. On most Windows systems this change will actually have no significant effect on timestamp resolution as the system timer tick is typically between 1ms and 15ms depending on what timer resolution currently running applications have requested. You can check this with clockres.exe from sysinternals. Despite the platform limiation this change still permits capture of finer timestamps where the system is capable of producing them and it gets rid of an unnecessary syscall. The higher resolution GetSystemTimePreciseAsFileTime call available on Windows 8 and Windows Server 2012 has the same interface as GetSystemTimeAsFileTime, so switching to GetSystemTimeAsFileTime makes it easier to use the Precise variant later. --- src/port/gettimeofday.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c index 75a9199..ecc0b4d 100644 --- a/src/port/gettimeofday.c +++ b/src/port/gettimeofday.c @@ -31,10 +31,17 @@ #include sys/time.h -/* FILETIME of Jan 1 1970 00:00:00. */ +/* FILETIME of Jan 1 1970 00:00:00, the PostgreSQL epoch */ static const unsigned __int64 epoch = UINT64CONST(1164447360); /* + * FILETIME represents the number of 100-nanosecond intervals since + * January 1, 1601 (UTC). + */ +#define FILETIME_UNITS_PER_SEC 1000L +#define FILETIME_UNITS_PER_USEC 10 + +/* * timezone information is stored outside the kernel so tzp isn't used anymore. * * Note: this function is not for Win32 high precision timing purpose. See @@ -44,16 +51,15 @@ int gettimeofday(struct timeval * tp, struct timezone * tzp) { FILETIME file_time; - SYSTEMTIME system_time; ULARGE_INTEGER ularge; - GetSystemTime(system_time); - SystemTimeToFileTime(system_time, file_time); + GetSystemTimeAsFileTime(file_time); ularge.LowPart = file_time.dwLowDateTime; ularge.HighPart = file_time.dwHighDateTime; - tp-tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L); - tp-tv_usec = (long) (system_time.wMilliseconds * 1000); + tp-tv_sec = (long) ((ularge.QuadPart - epoch) / FILETIME_UNITS_PER_SEC); + tp-tv_usec = (long) (((ularge.QuadPart - epoch) % FILETIME_UNITS_PER_SEC) + / FILETIME_UNITS_PER_USEC); return 0; } -- 1.9.3 From 30cfd48f05bee68602c3088f49c1e30c2251a84f Mon Sep 17 00:00:00 2001 From: Craig Ringer cr...@2ndquadrant.com Date: Thu, 18 Sep 2014 23:02:14 +0800 Subject: [PATCH 2/2] On Windows, use GetSystemTimePreciseAsFileTime when available PostgreSQL on Windows 8 or Windows Server 2012 will now obtain high-resolution timestamps by dynamically loading the the GetSystemTimePreciseAsFileTime function. It'll fall back to GetSystemTimeAsFileTime if the higher precision variant isn't found, so the same binaries without problems on older Windows releases. No attempt is made to detect the Windows version. Only the presence or absence of the desired function is considered. --- src/backend/main/main.c | 6 ++ src/include/port.h | 2 ++ src/port/gettimeofday.c | 53 +++-- 3 files changed, 59 insertions(+), 2 deletions(-) diff --git a/src/backend/main/main.c b/src/backend/main/main.c index c51b391..73c30c5 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -260,6 +260,12 @@ startup_hacks(const char
Re: [HACKERS] inherit support for foreign tables
(2014/12/08 15:17), Ashutosh Bapat wrote: On Sat, Dec 6, 2014 at 9:21 PM, Noah Misch n...@leadboat.com mailto:n...@leadboat.com wrote: Does this inheritance patch add any atomicity problem that goes away when one breaks up the inheritance hierarchy and UPDATEs each table separately? If not, this limitation is okay. If the UPDATES crafted after breaking up the inheritance hierarchy are needed to be run within the same transaction (as the UPDATE on inheritance hierarchy would do), yes, there is atomicity problem. ISTM that your concern would basically a known problem. Consider the following transaction. BEGIN TRANSACTION; UPDATE foo SET a = 100; -- updates on table foo in remote server1 UPDATE bar SET a = 100; -- updates on table bar in remote server2 COMMIT TRANSACTION; This transaction would cause the atomicity problem if pgfdw_xact_callback() for XACT_EVENT_PRE_COMMIT for foo succeeded and then that for bar failed during CommitTransaction(). Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I don't immediately see the problem either, but I have to say that grovelling through all the resource owners seems ugly anyway. Resource owners are not meant to be traversed like that. And there could be a lot of them, and you have to visit every one of them. That could get expensive if there are a lot of resource owners. 1. I don't really see why resource owners shouldn't be traversed like that. They are clearly intended to form a hierarchy, and there's existing code that recurses through the hierarchy from a given level downward. What's ugly about that? 2. If you have a lot of resource owners, you probably have a lot of snapshots, so walking a list will be expensive, too. It will be disproportionately expensive to walk the resource owner tree only if there are lots of resource owners but very few of them have any snapshots. But I don't think that can really happen. If you've got lots of resource owners and each one has a snapshot, you'll traverse ~3 pointers per snapshot: ~1 to find the next ResourceOwner, 1 to find the snapshot array, and 1 to reach the snapshot itself. A non-inlined list would traverse only 2 pointers per snapshot, but that doesn't seem like enough of a difference to get excited about. BTW, you could easily detect that you haven't seen all the registered snapshots, after traversing the resource owner, as we keep the counter of them. So you could just fall back to not advancing the xmin if it happens. Not a bad idea. Or we could elog(FATAL) or fail an assertion if we don't see them all, and then if it happens we call it a bug and fix it. I would prefer doing separate bookkeeping in snapmgr.c. It seems like it wouldn't be difficult to do. It has to be cheap, but I don't see any reason to believe that it'd be more expensive than traversing through all resource owners. A binary heap or some other priority queue implementation should work pretty well for this. That's optimizing for making the xmin recomputation cheap, but we don't expect xmin recomputation to happen very often, so I'm not sure that's the right trade-off. And that optimization won't save you in the cases where it doesn't apply. For example, what if snapshots happen to form a queue, rather than a stack: DECLARE c1 CURSOR FOR ...; DECLARE c2 CURSOR FOR ...; DECLARE c3 CURSOR FOR ...; ... DECLARE c1000 CURSOR FOR ...; CLOSE c1; CLOSE c2; CLOSE c3; ... CLOSE c1000; It's not hard to imagine an application doing that. Sure, you could rewrite it to close the cursors in different order, but on the face of it that's not an unreasonable thing for an application to do. I think we should avoid the O(n^2) behaviour in that case. You won't actually get O(n^2) behavior here unless those DECLARE CURSOR statements all get snapshots with different xmins; if many of them have snapshots that share an xmin, then the optimization of recomputing only when there are no snaps with a given xmin will save you. That's a bit pathological, but maybe we should try to cater to it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_recvlogical description
Hi, The pg_recvlogical docs was rewritten but someone forgot to tweak the help description. It is a bit late in the 9.4 cycle but let be consistent. Regards, -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento From 4c6c146e100575fbf02c6160bd919fbd92505708 Mon Sep 17 00:00:00 2001 From: Euler Taveira eu...@timbira.com Date: Mon, 8 Dec 2014 11:25:29 -0300 Subject: [PATCH] Let be consistent with the docs. The new sentence describes better ('control' instead of 'receive') what pg_recvlogical is for. --- src/bin/pg_basebackup/pg_recvlogical.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c index 0d97638..9966cd7 100644 --- a/src/bin/pg_basebackup/pg_recvlogical.c +++ b/src/bin/pg_basebackup/pg_recvlogical.c @@ -62,7 +62,7 @@ static void disconnect_and_exit(int code); static void usage(void) { - printf(_(%s receives PostgreSQL logical change streams.\n\n), + printf(_(%s controls PostgreSQL logical decoding streams.\n\n), progname); printf(_(Usage:\n)); printf(_( %s [OPTION]...\n), progname); -- 2.1.3 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Role Attribute Bitmask Catalog Representation
Michael, This work will certainly continue to be pursued. If a simple move is possible/acceptable, then I think that would be the best option. I can handle that as it would appear that I am capable of moving it, if that is acceptable. Yes please. Actually I could have done it, just found the option to do so. Let's see what shows up with your work. I have moved it to commitfest 2014-12 and marked as Waiting on Author if that is acceptable. Thanks, Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] Casting issues with domains
Thomas Reiss thomas.re...@dalibo.com writes: postgres=# explain select * from test2 where a='toto'; QUERY PLAN -- Seq Scan on test1 (cost=0.00..1693.00 rows=500 width=5) Filter: (((a)::tstdom)::text = 'toto'::text) (2 lignes) As you can see, a is casted to tstdom then again to text. This casts prevents the optimizer to choose an index scan to retrieve the data. The casts are however strictly equivalent and should be not prevent the optimizer to use indexes. No, they are not equivalent. The optimizer can't simply drop the cast-to-domain, because that cast might result in a runtime error due to a domain CHECK constraint violation. (This is true even if no such constraint exists at planning time, unfortunately. If we had a mechanism to force replanning at ALTER DOMAIN ADD CONSTRAINT, maybe the no-constraints case could be handled better, but we don't; and adding one would also imply adding more locks around domain usage, so it's not all that attractive to do it.) The short answer is that SQL domains are not zero-cost type aliases. Perhaps there would be value in having a feature that *is* a a zero-cost alias, but it wouldn't be a domain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Dumping database creation options and ACLs
Hello. As of now, the only way to restore database options and ACLs is to use pg_dumpall without the globals options. The often recommended pg_dumpall -g + individual dumps of the target databases doesn't restore those. Since pg_dump/pg_restore offer the ability to create the database, it should do so with the correct owner, options and database ACLs. There was some discussion about those issues a while ago (see http://www.postgresql.org/message-id/11646.1272814...@sss.pgh.pa.us for example). As I understand it, the best way to handle that would be to push these modifications in pg_dump, but it is unclear how it should be done with regards to restoring to a different database. In the meantime, it would be great to add an option to pg_dumpall allowing to dump this information. We could add the db creation in the output of pg_dumpall -g, and add a specific --createdb-only option (similar to --roles- only and --tablespaces-only). Would such a patch be welcome ? PS: this email was originally sent to the pgsql-bugs mailing list -- Ronan Dunklau http://dalibo.com - http://dalibo.org signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote: Well, if an alias is used, and you refer to an attribute using a non-alias name (i.e. the original table name), then you'll already get an error suggesting that the alias be used instead -- of course, that's nothing new. It doesn't matter to the existing hinting mechanism if the attribute name is otherwise wrong. Once you fix the code to use the alias suggested, you'll then get this new Levenshtein-based hint. In that case, I think I favor giving no hint at all when the RTE name is specified but doesn't match exactly. I don't follow. The existing mechanism only concerns what to do when the original table name was used when an alias should have been used instead. What does that have to do with this patch? Just that that's the case in which it seems useful to give a hint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user set local_preload_libraries.
On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote: My radical proposal therefore would have been to embrace this inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether, relying on users interpreting the parameter names to indicate that changing them later may or may not have an effect. Unfortunately, the concerns about ignore_system_indexes prevent that. What exactly are those concerns? Do you have a link to previous discussion? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing DDL deparsing support
On Sat, Dec 6, 2014 at 10:43 PM, Bruce Momjian br...@momjian.us wrote: This causes creation DDL is checked if it is used in the regression database, but what about ALTER and DROP? pg_dump doesn't issue those, except in special cases like inheritance. The proposed testing mechanism should cover any ALTER commands that are in the regression tests provided that those objects are not subsequently dropped -- because if the ALTER commands aren't replayed properly, then the later pg_dump won't produce the same output. There probably are some gaps in our current regression tests in this area, but that's probably a good thing to fix regardless of this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote: Just that that's the case in which it seems useful to give a hint. I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Besides, considering every visible RTE (while penalizing non-exact alias names iff the user provided an alias name) is actually going to make bad hints less likely, by increasing the number of equidistant low quality matches in a way that swamps the mechanism into providing no actual match at all. That's an important additional protection against low quality matches. What do other people think here? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Sat, Dec 6, 2014 at 12:13 AM, David Rowley dgrowle...@gmail.com wrote: It's bare-bones core support for allowing aggregate states to be merged together with another aggregate state. I would imagine that if a query such as: SELECT MAX(value) FROM bigtable; was run, then a series of parallel workers could go off and each find the max value from their portion of the table and then perhaps some other node type would then take all the intermediate results from the workers, once they're finished, and join all of the aggregate states into one and return that. Naturally, you'd need to check that all aggregates used in the targetlist had a merge function first. I think this is great infrastructure and could also be useful for pushing down aggregates in cases involving foreign data wrappers. But I suggest we discuss it on a separate thread because it's not related to parallel seq scan per se. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think we have access to this information in planner (RelOptInfo - pages), if we want, we can use that to eliminate the small relations from parallelism, but question is how big relations do we want to consider for parallelism, one way is to check via tests which I am planning to follow, do you think we have any heuristic which we can use to decide how big relations should be consider for parallelism? Surely the Path machinery needs to decide this in particular cases based on cost. We should assign some cost to starting a parallel worker via some new GUC, like parallel_startup_cost = 100,000. And then we should also assign a cost to the act of relaying a tuple from the parallel worker to the master, maybe cpu_tuple_cost (or some new GUC). For a small relation, or a query with a LIMIT clause, the parallel startup cost will make starting a lot of workers look unattractive, but for bigger relations it will make sense from a cost perspective, which is exactly what we want. There are probably other important considerations based on goals for overall resource utilization, and also because at a certain point adding more workers won't help because the disk will be saturated. I don't know exactly what we should do about those issues yet, but the steps described in the previous paragraph seem like a good place to start anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan p...@heroku.com wrote: I think it's very possible that the wrong alias may be provided by the user, and that we should consider that when providing a hint. Note that the existing mechanism (the mechanism that I'm trying to improve) only ever shows this error message: There is a column named \%s\ in table \%s\, but it cannot be referenced from this part of the query. I think it's pretty clear that this general class of user error is common. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost sfr...@snowman.net wrote: For my 2c, I'd like to see it support exactly what the SeqScan node supports and then also what Foreign Scan supports. That would mean we'd then be able to push filtering down to the workers which would be great. Even better would be figuring out how to parallelize an Append node (perhaps only possible when the nodes underneath are all SeqScan or ForeignScan nodes) since that would allow us to then parallelize the work across multiple tables and remote servers. I don't see how we can support the stuff ForeignScan does; presumably any parallelism there is up to the FDW to implement, using whatever in-core tools we provide. I do agree that parallelizing Append nodes is useful; but let's get one thing done first before we start trying to do thing #2. I'm not entirely following this. How can the worker be responsible for its own plan when the information passed to it (per the above paragraph..) is pretty minimal? In general, I don't think we need to have specifics like this worker is going to do exactly X because we will eventually need some communication to happen between the worker and the master process where the worker can ask for more work because it's finished what it was tasked with and the master will need to give it another chunk of work to do. I don't think we want exactly what each worker process will do to be fully formed at the outset because, even with the best information available, given concurrent load on the system, it's not going to be perfect and we'll end up starving workers. The plan, as formed by the master, should be more along the lines of this is what I'm gonna have my workers do along w/ how many workers, etc, and then it goes and does it. Perhaps for an 'explain analyze' we return information about what workers actually *did* what, but that's a whole different discussion. I agree with this. For a first version, I think it's OK to start a worker up for a particular sequential scan and have it help with that sequential scan until the scan is completed, and then exit. It should not, as the present version of the patch does, assign a fixed block range to each worker; instead, workers should allocate a block or chunk of blocks to work on until no blocks remain. That way, even if every worker but one gets stuck, the rest of the scan can still finish. Eventually, we will want to be smarter about sharing works between multiple parts of the plan, but I think it is just fine to leave that as a future enhancement for now. - Master backend is just responsible for coordination among workers It shares the required information to workers and then fetch the data processed by each worker, by using some more logic, we might be able to make master backend also fetch data from heap rather than doing just co-ordination among workers. I don't think this is really necessary... I think it would be an awfully good idea to make this work. The master thread may be significantly faster than any of the others because it has no IPC costs. We don't want to leave our best resource sitting on the bench. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
On 12/08/2014 04:21 AM, Alvaro Herrera wrote: Andrew Dunstan wrote: OK, here is a new patch version that * uses find_coercion_path() to find the cast function if any, as discussed elsewhere * removes calls to getTypeOutputInfo() except where required * honors a cast to json only for rendering both json and jsonb * adds processing for the date type that was previously missing in datum_to_jsonb Did this go anywhere? Not, yet. I hope to get to it this week. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
All, Pardon me for jumping into this late. In general, I like Alvaro's approach. However, I wanted to list the major shortcomings of the existing replication system (based on complaints by PGX's users and on IRC) and compare them to Alvaro's proposed implementation to make sure that enough of them are addressed, and that the ones which aren't addressed are not being addressed as a clear decision. We can't address *all* of the limitations of the current system, but let's make sure that we're addressing enough of them to make implementing a 2nd partitioning system worthwhile. Where I have ? is because I'm not clear from Alvaro's proposal whether they're addressed or not. 1.The Trigger Problem the need to write triggers for INSERT/UPDATE/DELETE. Addressed. 2. The Clutter Problem cluttering up system views and dumps with hundreds of partitioned tables Addressed. 3. Creation Problem the need two write triggers and/or cron jobs to create new partitions Addressed. 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? 5. Constant Problem Since current partitioned query planning happens before the rewrite phase, SELECTs do not use partition logic to evaluate even simple expressions, let alone IMMUTABLE or STABLE functions. Addressed?? 6. Unique Index Problem Cannot create a unique index across multiple partitions, which prevents the partitioned table from being FK'd. Not Addressed (but could be addressed in the future) 7. JOIN Problem Two partitioned tables being JOINed need to append and materialize before the join, causing a very slow join under some circumstances, even if both tables are partitioned on the same ranges. Not Addressed? (but could be addressed in the future) 8. COPY Problem Cannot bulk-load into the Master, just into individual partitions. Addressed. 9. Hibernate Problem When using the trigger method, inserts into the master partition return 0, which Hibernate and some other ORMs regard as an insert failure. Addressed. 10. Scaling Problem Inheritance partitioning becomes prohibitively slow for the planner at somewhere between 100 and 500 partitions depending on various factors. No idea? 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. 12. Extra Constraint Evaluation Inheritance partitioning evaluates *all* constraints on the partitions, whether they are part of the partitioning scheme or not. This is way expensive if those are, say, polygon comparisons. Addressed. Additionally, I believe that Alvaro's proposal will make the following activities which are supported by partition-by-inheritance more difficult or impossible. Again, these are probably acceptable because inheritance partitioning isn't going away. However, we should consciously decide that: A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then attach it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. B. Catchall Partition Many partitioning schemes currently contain a catchall partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com wrote: I guess you could list or hash partition on multiple columns, too. How would you distinguish values in list partition for multiple columns? I mean for range partition, we are sure there will be either one value for each column, but for list it could be multiple and not fixed for each partition, so I think it will not be easy to support the multicolumn partition key for list partitions. I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: So just to clarify, first and last destinations are considered defined if you have something like: ... PARTITION p1 VALUES LESS THAN 10 PARTITION p2 VALUES BETWEEN 10 AND 20 PARTITION p3 VALUES GREATER THAN 20 ... And not defined if: ... PARTITION p1 VALUES BETWEEN 10 AND 20 ... Yes. For pg_dump --binary-upgrade, you need a statement like SELECT binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for each pg_class entry. So you can't easily have a single SQL statement creating multiple such entries. Hmm, do you mean pg_dump cannot emit such a SQL or there shouldn't be one in the first place? I mean that the binary upgrade script needs to set the OID for every pg_class object being restored, and it does that by stashing away up to one (1) pg_class OID before each CREATE statement. If a single CREATE statement generates multiple pg_class entries, this method doesn't work. Makes sense. This would double as a way to create subpartitions too? And that would have to play well with any choice we end up making about how we treat subpartitioning key (one of the points discussed above) Yes, I think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila amit.kapil...@gmail.com wrote: Sure, I don't feel we should not provide anyway to take dump for individual partition but not at level of independent table. May be something like --table table_name --partition partition_name. In general, I think we should try to avoid exposing that partitions are individual tables as that might hinder any future enhancement in that area (example if we someone finds a different and better way to arrange the partition data, then due to the currently exposed syntax, we might feel blocked). I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs si...@2ndquadrant.com wrote: * parameter should be SUSET - it doesn't *need* to be set only at server start since all records are independent of each other Why not USERSET? There's no point in trying to prohibit users from doing things that will cause bad performance because they can do that anyway. * ideally we'd like to be able to differentiate the types of usage. which then allows the user to control the level of compression depending upon the type of action. My first cut at what those settings should be are ALL LOGICAL PHYSICAL VACUUM. VACUUM - only compress while running vacuum commands PHYSICAL - only compress while running physical DDL commands (ALTER TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't typically be used for logical decoding LOGICAL - compress FPIs for record types that change tables ALL - all user commands (each level includes all prior levels) Interesting idea, but what evidence do we have that a simple on/off switch isn't good enough? * name should not be wal_compression - we're not compressing all wal records, just fpis. There is no evidence that we even want to compress other record types, nor that our compression mechanism is effective at doing so. Simple = keep name as compress_full_page_writes Quite right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 2014-12-08 14:09:19 -0500, Robert Haas wrote: records, just fpis. There is no evidence that we even want to compress other record types, nor that our compression mechanism is effective at doing so. Simple = keep name as compress_full_page_writes Quite right. I don't really agree with this. There's lots of records which can be quite big where compression could help a fair bit. Most prominently HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest chunk of WAL. And these are big and repetitive enough that compression is very likely to be beneficial. I still think that just compressing the whole record if it's above a certain size is going to be better than compressing individual parts. Michael argued thta that'd be complicated because of the varying size of the required 'scratch space'. I don't buy that argument though. It's easy enough to simply compress all the data in some fixed chunk size. I.e. always compress 64kb in one go. If there's more compress that independently. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 11:05 AM, Robert Haas wrote: I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. I don't thing its feasible to drop inheritance partitioning at this point; too many user exploit a lot of peculiarities of that system which wouldn't be supported by any other system. So any new partitioning system we're talking about would be *in addition* to the existing system. Hence my prior email trying to make sure that a new proposed system is sufficiently different from the existing one to be worthwhile. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. On the other hand, as long as partitions exist exclusively at the planner layer, we can't fix the existing major shortcomings of inheritance partitioning, such as its inability to handle expressions. Again, see previous. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Mon, Dec 8, 2014 at 2:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-12-08 14:09:19 -0500, Robert Haas wrote: records, just fpis. There is no evidence that we even want to compress other record types, nor that our compression mechanism is effective at doing so. Simple = keep name as compress_full_page_writes Quite right. I don't really agree with this. There's lots of records which can be quite big where compression could help a fair bit. Most prominently HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest chunk of WAL. And these are big and repetitive enough that compression is very likely to be beneficial. I still think that just compressing the whole record if it's above a certain size is going to be better than compressing individual parts. Michael argued thta that'd be complicated because of the varying size of the required 'scratch space'. I don't buy that argument though. It's easy enough to simply compress all the data in some fixed chunk size. I.e. always compress 64kb in one go. If there's more compress that independently. I agree that idea is worth considering. But I think we should decide which way is better and then do just one or the other. I can't see the point in adding wal_compress=full_pages now and then offering an alternative wal_compress=big_records in 9.5. I think it's also quite likely that there may be cases where context-aware compression strategies can be employed. For example, the prefix/suffix compression of updates that Amit did last cycle exploit the likely commonality between the old and new tuple. We might have cases like that where there are meaningful trade-offs to be made between CPU and I/O, or other reasons to have user-exposed knobs. I think we'll be much happier if those are completely separate GUCs, so we can say things like compress_gin_wal=true and compress_brin_effort=3.14 rather than trying to have a single wal_compress GUC and assuming that we can shoehorn all future needs into it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-12-08 14:05:52 -0500, Robert Haas wrote: On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila amit.kapil...@gmail.com wrote: Sure, I don't feel we should not provide anyway to take dump for individual partition but not at level of independent table. May be something like --table table_name --partition partition_name. In general, I think we should try to avoid exposing that partitions are individual tables as that might hinder any future enhancement in that area (example if we someone finds a different and better way to arrange the partition data, then due to the currently exposed syntax, we might feel blocked). I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. I don't think this makes much sense. That'd severely restrict our ability to do stuff for a long time. Unless we can absolutely rely on the fact that partitions have the same schema and such we'll rob ourselves of significant optimization opportunities. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. I think you're making a wrong fundamental assumption here. Just because we define partitions to not be full relations doesn't mean we have to treat them entirely separate. I don't see why a pg_class.relkind = 'p' entry would be something actually problematic. That'd easily allow to treat them differently in all the relevant places (all of ALTER TABLE, DML et al) and still allow all of the current planner/executor infrastructure. We can even allow direct SELECTs from individual partitions if we want to - that's trivial to achieve. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus j...@agliodbs.com wrote: On 12/08/2014 11:05 AM, Robert Haas wrote: I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. I don't thing its feasible to drop inheritance partitioning at this point; too many user exploit a lot of peculiarities of that system which wouldn't be supported by any other system. So any new partitioning system we're talking about would be *in addition* to the existing system. Hence my prior email trying to make sure that a new proposed system is sufficiently different from the existing one to be worthwhile. I think any new partitioning system should keep the good things about the existing system, of which there are some, and not try to reinvent the wheel. The yard stick for a new system shouldn't be is this different enough? but does this solve the problems without creating new ones?. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. On the other hand, as long as partitions exist exclusively at the planner layer, we can't fix the existing major shortcomings of inheritance partitioning, such as its inability to handle expressions. Again, see previous. Huh? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] compiler warnings under MinGW for 9.4
In the past, building under MinGW produced so many warnings that I never bothered to read them. Now most of them have been removed, so the ones that are left might be worth reporting. Using gcc.exe (GCC) 4.6.2 on REL9_4_STABLE eadd80c08ddfc485db84b9af7cca54a0d50ebe6d I get: mingwcompat.c:60:1: warning: 'RegisterWaitForSingleObject' redeclared without dllimport attribute: previous dllimport ignored [-Wattributes] input.c:382:1: warning: 'saveHistory' defined but not used [-Wunused-function] Does anyone have opinions on how to address these? Cheers, Jeff
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund and...@2ndquadrant.com wrote: I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. I don't think this makes much sense. That'd severely restrict our ability to do stuff for a long time. Unless we can absolutely rely on the fact that partitions have the same schema and such we'll rob ourselves of significant optimization opportunities. I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. I think you're making a wrong fundamental assumption here. Just because we define partitions to not be full relations doesn't mean we have to treat them entirely separate. I don't see why a pg_class.relkind = 'p' entry would be something actually problematic. That'd easily allow to treat them differently in all the relevant places (all of ALTER TABLE, DML et al) and still allow all of the current planner/executor infrastructure. We can even allow direct SELECTs from individual partitions if we want to - that's trivial to achieve. We may just be using different words to talk about more-or-less the same thing, then. What I'm saying is that I want these things to keep working: - Indexes. - Merge append and any other inheritance-aware query planning techniques. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I am not necessarily saying that I have a problem with putting other restrictions on partitions, like requiring them to have the same tuple descriptor or the same ACLs as their parents. Those kinds of details bear discussion, but I'm not intrinsically opposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Lockless StrategyGetBuffer() clock sweep
On 2014-10-30 07:55:08 -0400, Robert Haas wrote: On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund and...@2ndquadrant.com wrote: But if it is, then how about adding a flag that is 4 bytes wide or less alongside bgwriterLatch, and just checking the flag instead of checking bgwriterLatch itself? Yea, that'd be nicer. I didn't do it because it made the patch slightly more invasive... The complexity really is only needed because we're not guaranteed that 64bit reads are atomic. Although we actually can be sure, because there's no platform with nonatomic intptr_t reads - so 64bit platforms actually *do* have atomic 64bit reads/writes. So if we just have a integer 'setBgwriterLatch' somewhere we're good. We don't even need to take a lock to set it. Afaics the worst that can happen is that several processes set the latch... OK, that design is fine with me. There's a slight problem with this, namely restarts of bgwriter. If it crashes the reference to the relevant latch will currently be reset via StrategyNotifyBgWriter(). In reality that's not a problem because sizeof(void*) writes are always atomic, but currently we don't assume that. We'd sometimes write to a old latch, but that's harmless because they're never deallocated. So I can see several solutions right now: 1) Redefine our requirements so that aligned sizeof(void*) writes are always atomic. That doesn't affect any currently supported platform and won't ever affect any future platform either. E.g. linux has had that requirement for a long time. 2) Use a explicitly defined latch for the bgworker instead of using the PGPROC-procLatch. That way it never has to be reset and all SetLatch()s will eventually go to the right process. 3) Continue requiring the spinlock when setting the latch. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 2014-12-08 14:48:50 -0500, Robert Haas wrote: On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund and...@2ndquadrant.com wrote: I guess I'm in disagreement with you - and, perhaps - the majority on this point. I think that ship has already sailed: partitions ARE tables. We can try to make it less necessary for users to ever look at those tables as separate objects, and I think that's a good idea. But trying to go from a system where partitions are tables, which is what we have today, to a system where they are not seems like a bad idea to me. If we make a major break from how things work today, we're going to end up having to reimplement stuff that already works. I don't think this makes much sense. That'd severely restrict our ability to do stuff for a long time. Unless we can absolutely rely on the fact that partitions have the same schema and such we'll rob ourselves of significant optimization opportunities. I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. That'll be a lot of places you'll need to touch. More fundamentally: Why should we name something a table that's not one? Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. I think you're making a wrong fundamental assumption here. Just because we define partitions to not be full relations doesn't mean we have to treat them entirely separate. I don't see why a pg_class.relkind = 'p' entry would be something actually problematic. That'd easily allow to treat them differently in all the relevant places (all of ALTER TABLE, DML et al) and still allow all of the current planner/executor infrastructure. We can even allow direct SELECTs from individual partitions if we want to - that's trivial to achieve. We may just be using different words to talk about more-or-less the same thing, then. That might be What I'm saying is that I want these things to keep working: - Indexes. Nobody argued against that I think. - Merge append and any other inheritance-aware query planning techniques. Same here. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I think that might be ok in some cases, but in general I'd be very wary to allow that. I think it might be ok to allow direct read access, but everything else I'd be opposed. I'd much rather go the route of allowing to few things and then gradually opening up if required than the other way round (as that pretty much will never happen because it'll break deployed systems). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 11:40 AM, Robert Haas wrote: I don't thing its feasible to drop inheritance partitioning at this point; too many user exploit a lot of peculiarities of that system which wouldn't be supported by any other system. So any new partitioning system we're talking about would be *in addition* to the existing system. Hence my prior email trying to make sure that a new proposed system is sufficiently different from the existing one to be worthwhile. I think any new partitioning system should keep the good things about the existing system, of which there are some, and not try to reinvent the wheel. The yard stick for a new system shouldn't be is this different enough? but does this solve the problems without creating new ones?. It's unrealistic to assume that a new system would support all of the features of the existing inheritance partitioning without restriction. In fact, I'd say that such a requirement amounts to saying don't bother trying. For example, inheritance allows us to have different indexes, constraints, and even columns on partitions. We can have overlapping partitions, and heterogenous multilevel partitioning (partition this customer by month but partition that customer by week). We can even add triggers on individual partitions to reroute data away from a specific partition. A requirement to support all of these peculiar uses of inheritance partitioning would doom any new partitioning project. Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. On the other hand, as long as partitions exist exclusively at the planner layer, we can't fix the existing major shortcomings of inheritance partitioning, such as its inability to handle expressions. Again, see previous. Huh? Explained in the other email I posted on this thread. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund and...@2ndquadrant.com wrote: I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. That'll be a lot of places you'll need to touch. More fundamentally: Why should we name something a table that's not one? Well, I'm not convinced that it isn't one. And adding a new relkind will involve a bunch of code churn, too. But I don't much care to pre-litigate this: when someone has got a patch, we can either agree that the approach is OK or argue that it is problematic because X. I think we need to hammer down the design in broad strokes first, and I'm not sure we're totally there yet. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I think that might be ok in some cases, but in general I'd be very wary to allow that. I think it might be ok to allow direct read access, but everything else I'd be opposed. I'd much rather go the route of allowing to few things and then gradually opening up if required than the other way round (as that pretty much will never happen because it'll break deployed systems). Why? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus j...@agliodbs.com wrote: I think any new partitioning system should keep the good things about the existing system, of which there are some, and not try to reinvent the wheel. The yard stick for a new system shouldn't be is this different enough? but does this solve the problems without creating new ones?. It's unrealistic to assume that a new system would support all of the features of the existing inheritance partitioning without restriction. In fact, I'd say that such a requirement amounts to saying don't bother trying. For example, inheritance allows us to have different indexes, constraints, and even columns on partitions. We can have overlapping partitions, and heterogenous multilevel partitioning (partition this customer by month but partition that customer by week). We can even add triggers on individual partitions to reroute data away from a specific partition. A requirement to support all of these peculiar uses of inheritance partitioning would doom any new partitioning project. I don't think it has to be possible to support every use case that we can support today; clearly, a part of the goal here is to be LESS general so that we can be more performant. But I think the urge to change too many things at once had better be tempered by a clear-eyed vision of what can reasonably be accomplished in one patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 12/08/2014 09:21 PM, Andres Freund wrote: I still think that just compressing the whole record if it's above a certain size is going to be better than compressing individual parts. Michael argued thta that'd be complicated because of the varying size of the required 'scratch space'. I don't buy that argument though. It's easy enough to simply compress all the data in some fixed chunk size. I.e. always compress 64kb in one go. If there's more compress that independently. Doing it in fixed-size chunks doesn't help - you have to hold onto the compressed data until it's written to the WAL buffers. But you could just allocate a large enough scratch buffer, and give up if it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, it didn't compress very well, so there's probably no point in compressing it anyway. Now, an exception to that might be a record that contains something else than page data, like a commit record with millions of subxids, but I think we could live with not compressing those, even though it would be beneficial to do so. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On 12/6/14, 5:21 PM, Kouhei Kaigai wrote: Yes please. We have other contrib modules that exist as tests, so this seems reasonable to me. I can't improve the docs without the example code. Is that available now? Please wait for a few days. The ctidscan module is not adjusted for the latest interface yet. I've made some minor edits, with an emphasis on not changing original intent. Each section was saved as a separate edit, so if anyone objects to something just revert the relevant change. Once the code is available more editing can be done. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 12/7/14, 6:16 PM, Simon Riggs wrote: On 20 October 2014 at 10:57, Jim Nasby jim.na...@bluetreble.com wrote: Currently, a non-freeze vacuum will punt on any page it can't get a cleanup lock on, with no retry. Presumably this should be a rare occurrence, but I think it's bad that we just assume that and won't warn the user if something bad is going on. (I'm having email problems, so I can't see later mails on this thread, so replying here.) Logging patch looks fine, but I would rather not add a line of text for each VACUUM, just in case this is non-zero. I think we should add that log line only if the blocks skipped 0. I thought about doing that, but I'm loath to duplicate a rather large ereport call. Happy to make the change if that's the consensus though. What I'm more interested in is what you plan to do with the information once we get it? The assumption that skipping blocks is something bad is strange. I added it because VACUUM could and did regularly hang on busy tables, which resulted in bloat because other blocks that needed cleaning didn't get any attention. Which is better, spend time obsessively trying to vacuum particular blocks, or to spend the time on other blocks that are in need of cleaning and are available to be cleaned? Which is better, have autovacuum or system wide vacuum progress on to other tables that need cleaning, or spend lots of effort retrying? How do we know what is the best next action? I'd really want to see some analysis of those things before we spend even more cycles on this. That's the entire point of logging this information. There is an underlying assumption that we won't actually skip many pages, but there's no data to back that up, nor is there currently any way to get that data. My hope is that the logging shows that there isn't anything more that needs to be done here. If this is something that causes problems, at least now DBAs will be aware of it and hopefully we'll be able to identify specific problem scenarios and find a solution. BTW, my initial proposal[1] was strictly logging. The only difference was raising it to a warning if a significant portion of the table was skipped. I only investigated retrying locks at the suggestion of others. I never intended this to become a big time sink. [1]: Currently, a non-freeze vacuum will punt on any page it can't get a cleanup lock on, with no retry. Presumably this should be a rare occurrence, but I think it's bad that we just assume that and won't warn the user if something bad is going on. My thought is that if we skip any pages elog(LOG) how many we skipped. If we skip more than 1% of the pages we visited (not relpages) then elog(WARNING) instead. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Casting issues with domains
On 12/8/14, 9:18 AM, Tom Lane wrote: The short answer is that SQL domains are not zero-cost type aliases. Perhaps there would be value in having a feature that*is* a a zero-cost alias, but it wouldn't be a domain. Note that you can actually re-use the support functions of one data type to create a new one. So if you wanted a special type called document that actually behaved the same as text you could do that fairly easily (though not as easily as creating a domain). If we were going to expend energy here, I suspect it would be more useful to look at ways of creating new types without requiring C. C isn't an option on many (even most) environments in today's cloud world, aside from the intimidation factor. There are comments in the code that hypothesize about making cstring a full type; that might be all that's needed. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Tue, Dec 9, 2014 at 5:33 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/08/2014 09:21 PM, Andres Freund wrote: I still think that just compressing the whole record if it's above a certain size is going to be better than compressing individual parts. Michael argued thta that'd be complicated because of the varying size of the required 'scratch space'. I don't buy that argument though. It's easy enough to simply compress all the data in some fixed chunk size. I.e. always compress 64kb in one go. If there's more compress that independently. Doing it in fixed-size chunks doesn't help - you have to hold onto the compressed data until it's written to the WAL buffers. But you could just allocate a large enough scratch buffer, and give up if it doesn't fit. If the compressed data doesn't fit in e.g. 3 * 8kb, it didn't compress very well, so there's probably no point in compressing it anyway. Now, an exception to that might be a record that contains something else than page data, like a commit record with millions of subxids, but I think we could live with not compressing those, even though it would be beneficial to do so. Another thing to consider is the possibility to control at GUC level what is the maximum size of a record we allow to compress. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Role Attribute Bitmask Catalog Representation
On Tue, Dec 9, 2014 at 12:10 AM, Adam Brightwell adam.brightw...@crunchydatasolutions.com wrote: Michael, This work will certainly continue to be pursued. If a simple move is possible/acceptable, then I think that would be the best option. I can handle that as it would appear that I am capable of moving it, if that is acceptable. Yes please. Actually I could have done it, just found the option to do so. Let's see what shows up with your work. I have moved it to commitfest 2014-12 and marked as Waiting on Author if that is acceptable. Thanks! I guess that's fine. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/8/14, 1:05 PM, Robert Haas wrote: Besides, I haven't really seen anyone propose something that sounds like a credible alternative. If we could make partition objects things that the storage layer needs to know about but the query planner doesn't need to understand, that'd be maybe worth considering. But I don't see any way that that's remotely feasible. There are lots of places that we assume that a heap consists of blocks number 0 up through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits and pieces of the way index vacuuming is handled, which in turn bleeds into Hot Standby. You can't just decide that now block numbers are going to be replaced by some more complex structure, or even that they're now going to be nonlinear, without breaking a huge amount of stuff. Agreed, but it's possible to keep a block/CTID interface while doing something different on the disk. If you think about it, partitioning is really a hack anyway. It clutters up your logical set implementation with a bunch of physical details. What most people really want when they implement partitioning is simply data locality. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/8/14, 12:26 PM, Josh Berkus wrote: 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system... 6. Unique Index Problem Cannot create a unique index across multiple partitions, which prevents the partitioned table from being FK'd. Not Addressed (but could be addressed in the future) And would be extremely useful even with simple inheritance, let alone partitioning... 9. Hibernate Problem When using the trigger method, inserts into the master partition return 0, which Hibernate and some other ORMs regard as an insert failure. Addressed. It would be really nice to address this with regular inheritance too... 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger... A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then attach it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. How much of the desire for this is because our current row routing solutions are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal mostly eliminates it. B. Catchall Partition Many partitioning schemes currently contain a catchall partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. If we allowed for a catchall partition and supported normal inheritance/triggers on that partition then users could continue to do whatever they needed with data that didn't fit the normal partitioning pattern. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 9 December 2014 at 04:09, Robert Haas robertmh...@gmail.com wrote: On Sun, Dec 7, 2014 at 9:30 PM, Simon Riggs si...@2ndquadrant.com wrote: * parameter should be SUSET - it doesn't *need* to be set only at server start since all records are independent of each other Why not USERSET? There's no point in trying to prohibit users from doing things that will cause bad performance because they can do that anyway. Yes, I think USERSET would work fine for this. * ideally we'd like to be able to differentiate the types of usage. which then allows the user to control the level of compression depending upon the type of action. My first cut at what those settings should be are ALL LOGICAL PHYSICAL VACUUM. VACUUM - only compress while running vacuum commands PHYSICAL - only compress while running physical DDL commands (ALTER TABLE set tablespace, CREATE INDEX), i.e. those that wouldn't typically be used for logical decoding LOGICAL - compress FPIs for record types that change tables ALL - all user commands (each level includes all prior levels) Interesting idea, but what evidence do we have that a simple on/off switch isn't good enough? Yes, I think that was overcooked. What I'm thinking is that in the long run we might have groups of parameters attached to different types of action, so we wouldn't need, for example, two parameters for work_mem and maintenance_work_mem. We'd just have work_mem and then a scheme that has different values of work_mem for different action types. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 9 December 2014 at 04:21, Andres Freund and...@2ndquadrant.com wrote: On 2014-12-08 14:09:19 -0500, Robert Haas wrote: records, just fpis. There is no evidence that we even want to compress other record types, nor that our compression mechanism is effective at doing so. Simple = keep name as compress_full_page_writes Quite right. I don't really agree with this. There's lots of records which can be quite big where compression could help a fair bit. Most prominently HEAP2_MULTI_INSERT + INIT_PAGE. During initial COPY that's the biggest chunk of WAL. And these are big and repetitive enough that compression is very likely to be beneficial. Yes, you're right there. I was forgetting those aren't FPIs. However they are close enough that it wouldn't necessarily effect the naming of a parameter that controls such compression. I still think that just compressing the whole record if it's above a certain size is going to be better than compressing individual parts. I think its OK to think it, but we should measure it. For now then, I remove my objection to a commit of this patch based upon parameter naming/rethinking. We have a fine tradition of changing the names after the release is mostly wrapped, so lets pick a name in a few months time when the dust has settled on what's in. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 9 December 2014 at 06:28, Jim Nasby jim.na...@bluetreble.com wrote: On 12/7/14, 6:16 PM, Simon Riggs wrote: What I'm more interested in is what you plan to do with the information once we get it? The assumption that skipping blocks is something bad is strange. I added it because VACUUM could and did regularly hang on busy tables, which resulted in bloat because other blocks that needed cleaning didn't get any attention. Which is better, spend time obsessively trying to vacuum particular blocks, or to spend the time on other blocks that are in need of cleaning and are available to be cleaned? Which is better, have autovacuum or system wide vacuum progress on to other tables that need cleaning, or spend lots of effort retrying? How do we know what is the best next action? I'd really want to see some analysis of those things before we spend even more cycles on this. That's the entire point of logging this information. There is an underlying assumption that we won't actually skip many pages, but there's no data to back that up, nor is there currently any way to get that data. There is no such underlying assumption. You assumed there was one, but there isn't one. All I can say for certain is that waiting on a lock for long periods was literally a waste of time. Now it no longer wastes time, it gets on with vacuuming the pages it can. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On 12/08/2014 02:12 PM, Jim Nasby wrote: On 12/8/14, 12:26 PM, Josh Berkus wrote: 4. Creation Locking Problem high probability of lock pile-ups whenever a new partition is created on demand due to multiple backends trying to create the partition at the same time. Not Addressed? Do users actually try and create new partitions during DML? That sounds doomed to failure in pretty much any system... There is no question that it would be easier for users to create partitions on demand automatically. Particularly if you're partitioning by something other than time. For a particular case, consider users on RDS, which has no cron jobs for creating new partitons; it's on demand or manually. It's quite possible that there is no good way to work out the locking for on-demand partitions though, but *if* we're going to have a 2nd partition system, I think it's important to at least discuss the problems with on-demand creation. 11. Hash Partitioning Some users would prefer to partition into a fixed number of hash-allocated partitions. Not Addressed. Though, you should be able to do that in either system if you bother to define your own hash in a BEFORE trigger... That doesn't do you any good with the SELECT query, unless you change your middleware to add a hash(column) to every query. Which would be really hard to do for joins. A. COPY/ETL then attach In inheritance partitioning, you can easily build a partition outside the master and then attach it, allowing for minimal disturbance of concurrent users. Could be addressed in the future. How much of the desire for this is because our current row routing solutions are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal mostly eliminates it. That doesn't always work, though. In some cases the partition is being built using some fairly complex logic (think of partitions which are based on matviews) and there's no fast way to create the new data. Again, this is an acceptable casualty of an improved design, but if it will be so, we should consciously decide that. B. Catchall Partition Many partitioning schemes currently contain a catchall partition which accepts rows outside of the range of the partitioning scheme, due to bad input data. Probably not handled on purpose; Alvaro is proposing that we reject these instead, or create the partitions on demand, which is a legitimate approach. C. Asymmetric Partitioning / NULLs in partition column This is the classic Active/Inactive By Month setup for partitions. Could be addressed via special handling for NULL/infinity in the partitioned column. If we allowed for a catchall partition and supported normal inheritance/triggers on that partition then users could continue to do whatever they needed with data that didn't fit the normal partitioning pattern. That sounds to me like it would fall under the heading of impossible levels of backwards-compatibility. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX SCHEMA
On Tue, Dec 2, 2014 at 3:42 PM, Michael Paquier michael.paqu...@gmail.com wrote: Adding on top of that a couple of things cleaned up, like docs and typos, and I got the patch attached. Let's have a committer have a look a it now, I am marking that as Ready for Committer. For the archives, this has been committed as fe263d1. Thanks Simon for looking and the final push. And sorry that I didn't spot the issue with tap tests when reviewing, check-world passed but my dev VM missed necessary perl packages. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_recvlogical description
On Mon, Dec 8, 2014 at 11:53 PM, Euler Taveira eu...@timbira.com.br wrote: Hi, The pg_recvlogical docs was rewritten but someone forgot to tweak the help description. It is a bit late in the 9.4 cycle but let be consistent. Yeah, that makes sense. +1 for making docs consistent on master. For 9.4 this is a bit too late IMO. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user set local_preload_libraries.
On 12/8/14 12:39 PM, Robert Haas wrote: On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote: My radical proposal therefore would have been to embrace this inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether, relying on users interpreting the parameter names to indicate that changing them later may or may not have an effect. Unfortunately, the concerns about ignore_system_indexes prevent that. What exactly are those concerns? Do you have a link to previous discussion? Earlier in the thread: http://www.postgresql.org/message-id/20108.1415120...@sss.pgh.pa.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Status of Commit fest 2014-10
On 12/08/2014 09:19 AM, Michael Paquier wrote: - Use faster, higher precision timer API GetSystemTimeAsFileTime on windows This is now committed. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user set local_preload_libraries.
Peter Eisentraut pete...@gmx.net writes: On 12/8/14 12:39 PM, Robert Haas wrote: On Sun, Dec 7, 2014 at 9:54 AM, Peter Eisentraut pete...@gmx.net wrote: My radical proposal therefore would have been to embrace this inconsistency and get rid of PGC_BACKEND and PGC_SU_BACKEND altogether, relying on users interpreting the parameter names to indicate that changing them later may or may not have an effect. Unfortunately, the concerns about ignore_system_indexes prevent that. What exactly are those concerns? Do you have a link to previous discussion? Earlier in the thread: http://www.postgresql.org/message-id/20108.1415120...@sss.pgh.pa.us The core of the mentioned issues is that catalog searches done via the systable_beginscan/systable_getnext API will ordinarily visit catalog entries in the order of the specified index. However, if ignore_system_indexes is set, you get a seqscan that will return the same tuples in heap order (effectively, random order). There are known cases where this results in minor planner inefficiencies, and I'm worried that there might be outright bugs we don't know about, since that whole operating mode can be best be described as entirely untested outside of the bootstrap sequence. Barring someone committing to spend the time to improve that situation (time that would be poorly invested IMO), I don't think that we want to open up ignore_system_indexes as USERSET, or do anything else to encourage its use. If we're intent on removing PGC_BACKEND then I'd be okay with reclassifying ignore_system_indexes as SUSET; but I'm not exactly convinced that we should be trying to get rid of PGC_BACKEND. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
From: Robert Haas [mailto:robertmh...@gmail.com] On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com wrote: I guess you could list or hash partition on multiple columns, too. How would you distinguish values in list partition for multiple columns? I mean for range partition, we are sure there will be either one value for each column, but for list it could be multiple and not fixed for each partition, so I think it will not be easy to support the multicolumn partition key for list partitions. I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this list of values for each column in partkey is as convenient to store and manipulate as range partvalues. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] moving from contrib to bin
Let's take another crack at moving stuff out of contrib. Nobody likes contrib. The task is only finding something that most people like better. Last time this was attempted, the discussion got lost in exactly which extensions are worthy enough to be considered official or something like that. I want to dodge that here by starting at the opposite end: 1. move programs to src/bin/ 2. move test things to src/test/ (I had that in my notes, but someone already did that, so the stars must be aligned.) 3. deal with extensions later Here are the contrib programs: oid2name pg_archivecleanup pg_standby pg_test_fsync pg_test_timing pg_upgrade pg_xlogdump pgbench vacuumlo The proposal would basically be to mv contrib/$x src/bin/$x and also move the reference pages in the documentation. We could consider alternative arrangements, if there is interest, such as moving vacuumlo to scripts or moving pg_archivecleanup and pg_standby into one directory. It doesn't matter very much to me. There is precedent for this: Some time ago we moved reindexdb from contrib to scripts. Besides moving things out of contrib, there is also a practical motivation for this. Putting both client and server programs into contrib creates issues for packagers. One would have to create a separate -contrib-client package to package this properly. Not to mention packaging a bunch of unrelated extensions with these programs. If we make these normal programs in src/bin/, packagers can put them into the normal -client and -server packages, and everything will fall into place. Besides, a number of packagers have been treating pg_upgrade specially and moved it out of contrib, so this would just be catching up with reality a bit. Comments? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] moving from contrib to bin
Peter Eisentraut pete...@gmx.net writes: Last time this was attempted, the discussion got lost in exactly which extensions are worthy enough to be considered official or something like that. I want to dodge that here by starting at the opposite end: 1. move programs to src/bin/ Here are the contrib programs: oid2name pg_archivecleanup pg_standby pg_test_fsync pg_test_timing pg_upgrade pg_xlogdump pgbench vacuumlo The proposal would basically be to mv contrib/$x src/bin/$x and also move the reference pages in the documentation. Personally, I'm good with moving pg_archivecleanup, pg_standby, pg_upgrade, pg_xlogdump, and pgbench this way. (Although wasn't there just some discussion about pg_standby being obsolete? If so, shouldn't we remove it instead of promoting it?) As for the others: I'm not exactly convinced that we want to encourage packagers to include either pg_test_fsync or pg_test_timing in standard packages. They are not all that useful to ordinary users. oid2name and vacuumlo, besides being of very dubious general utility, are fails from a namespacing standpoint. If we were to promote them into standard install components I think a minimum requirement should be to rename them to pg_something. (oid2name is an entirely bogus name for what it does, anyway.) That would also be a good opportunity to revisit their rather-ad-hoc APIs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: From: Robert Haas [mailto:robertmh...@gmail.com] On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila amit.kapil...@gmail.com wrote: I guess you could list or hash partition on multiple columns, too. How would you distinguish values in list partition for multiple columns? I mean for range partition, we are sure there will be either one value for each column, but for list it could be multiple and not fixed for each partition, so I think it will not be easy to support the multicolumn partition key for list partitions. I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this list of values for each column in partkey is as convenient to store and manipulate as range partvalues. Yeah and also how would user specify the values, as an example assume that table is partitioned on monthly_salary, so partition definition would look: PARTITION BY LIST(monthly_salary) ( PARTITION salary_less_than_thousand VALUES(300, 900), PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), ... ) Now if user wants to define multi-column Partition based on monthly_salary and annual_salary, how do we want him to specify the values. Basically how to distinguish which values belong to first column key and which one's belong to second column key. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund and...@2ndquadrant.com wrote: I don't think that's mutually exclusive with the idea of partitions-as-tables. I mean, you can add code to the ALTER TABLE path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) wherever you want. That'll be a lot of places you'll need to touch. More fundamentally: Why should we name something a table that's not one? Well, I'm not convinced that it isn't one. And adding a new relkind will involve a bunch of code churn, too. But I don't much care to pre-litigate this: when someone has got a patch, we can either agree that the approach is OK or argue that it is problematic because X. I think we need to hammer down the design in broad strokes first, and I'm not sure we're totally there yet. That's right, I think at this point defining the top level behaviour/design is very important to proceed, we can decide about the better implementation approach afterwards (may be once initial patch is ready, because it might not be a major work to do it either way). So here's where we are on this point till now as per my understanding, I think that direct operations should be prohibited on partitions, you think that they should be allowed and Andres think that it might be better to allow direct operations on partitions for Read. - Direct access to individual partitions to bypass tuple-routing/query-planning overhead. I think that might be ok in some cases, but in general I'd be very wary to allow that. I think it might be ok to allow direct read access, but everything else I'd be opposed. I'd much rather go the route of allowing to few things and then gradually opening up if required than the other way round (as that pretty much will never happen because it'll break deployed systems). Why? Because I think it will be difficult for users to write/maintain more of such code, which is one of the complaints with previous system where user needs to write triggers to route the tuple to appropriate partition. I think in first step we should try to improve the tuple routing algorithm so that it is not pain for users or atleast it should be at par with some of the other competitive database systems and if we are not able to come up with such an implementation, then may be we can think of providing it as a special way for users to improve performance. Another reason is that fundamentally partitions are managed internally to divide the user data in a way so that access could be cheaper and we take the specifications for defining the partitions from users and allowing operations on internally managed objects could lead to user writing quite some code to do what database actually does internally. If we see that TOAST table are internally used to manage large tuples, however we don't want users to directly perform dml on those tables. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] moving from contrib to bin
On 2014-12-08 22:50:30 -0500, Tom Lane wrote: I'm not exactly convinced that we want to encourage packagers to include either pg_test_fsync or pg_test_timing in standard packages. They are not all that useful to ordinary users. I actually think both are quite useful when setting up new systems to quickly screen for problems. There still is a fairly large number of virtualized systems with pretty much broken timing functions; and checking whether fsync actually takes some time is also good thing to do in virtualized environments - it's not an infrequent thing to see fsyncs taking unrealistically low time. Neither is likely to be harmful. So it doesn't seem harmful to move them. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs si...@2ndquadrant.com wrote: On 8 December 2014 at 11:46, Michael Paquier michael.paqu...@gmail.com wrote: I don't really like those new names, but I'd prefer wal_compression_level if we go down that road with 'none' as default value. We may still decide in the future to support compression at the record level instead of context level, particularly if we have an API able to do palloc_return_null_at_oom, so the idea of WAL compression is not related only to FPIs IMHO. We may yet decide, but the pglz implementation is not effective on smaller record lengths. Nor has any testing been done to show that is even desirable. It's even much worse for non-compressible (or less-compressible) WAL data. I am not clear here that how a simple on/off switch could address such cases because the data could be sometimes dependent on which table user is doing operations (means schema or data in some tables are more prone for compression in which case it can give us benefits). I think may be we should think something on lines what Robert has touched in one of his e-mails (context-aware compression strategy). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] moving from contrib to bin
On Mon, Dec 8, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote: I actually think both are quite useful when setting up new systems to quickly screen for problems. There still is a fairly large number of virtualized systems with pretty much broken timing functions; and checking whether fsync actually takes some time is also good thing to do in virtualized environments - it's not an infrequent thing to see fsyncs taking unrealistically low time. Neither is likely to be harmful. So it doesn't seem harmful to move them. +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Mon, Dec 8, 2014 at 11:21 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think we have access to this information in planner (RelOptInfo - pages), if we want, we can use that to eliminate the small relations from parallelism, but question is how big relations do we want to consider for parallelism, one way is to check via tests which I am planning to follow, do you think we have any heuristic which we can use to decide how big relations should be consider for parallelism? Surely the Path machinery needs to decide this in particular cases based on cost. We should assign some cost to starting a parallel worker via some new GUC, like parallel_startup_cost = 100,000. And then we should also assign a cost to the act of relaying a tuple from the parallel worker to the master, maybe cpu_tuple_cost (or some new GUC). For a small relation, or a query with a LIMIT clause, the parallel startup cost will make starting a lot of workers look unattractive, but for bigger relations it will make sense from a cost perspective, which is exactly what we want. Sounds sensible. cpu_tuple_cost is already used for some other purpose so not sure if it is right thing to override that parameter, how about cpu_tuple_communication_cost or cpu_tuple_comm_cost. There are probably other important considerations based on goals for overall resource utilization, and also because at a certain point adding more workers won't help because the disk will be saturated. I don't know exactly what we should do about those issues yet, but the steps described in the previous paragraph seem like a good place to start anyway. Agreed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Mon, Dec 8, 2014 at 11:27 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost sfr...@snowman.net wrote: For my 2c, I'd like to see it support exactly what the SeqScan node supports and then also what Foreign Scan supports. That would mean we'd then be able to push filtering down to the workers which would be great. Even better would be figuring out how to parallelize an Append node (perhaps only possible when the nodes underneath are all SeqScan or ForeignScan nodes) since that would allow us to then parallelize the work across multiple tables and remote servers. I don't see how we can support the stuff ForeignScan does; presumably any parallelism there is up to the FDW to implement, using whatever in-core tools we provide. I do agree that parallelizing Append nodes is useful; but let's get one thing done first before we start trying to do thing #2. I'm not entirely following this. How can the worker be responsible for its own plan when the information passed to it (per the above paragraph..) is pretty minimal? In general, I don't think we need to have specifics like this worker is going to do exactly X because we will eventually need some communication to happen between the worker and the master process where the worker can ask for more work because it's finished what it was tasked with and the master will need to give it another chunk of work to do. I don't think we want exactly what each worker process will do to be fully formed at the outset because, even with the best information available, given concurrent load on the system, it's not going to be perfect and we'll end up starving workers. The plan, as formed by the master, should be more along the lines of this is what I'm gonna have my workers do along w/ how many workers, etc, and then it goes and does it. Perhaps for an 'explain analyze' we return information about what workers actually *did* what, but that's a whole different discussion. I agree with this. For a first version, I think it's OK to start a worker up for a particular sequential scan and have it help with that sequential scan until the scan is completed, and then exit. It should not, as the present version of the patch does, assign a fixed block range to each worker; instead, workers should allocate a block or chunk of blocks to work on until no blocks remain. That way, even if every worker but one gets stuck, the rest of the scan can still finish. I will check on this point and see if it is feasible to do something on those lines, basically currently at Executor initialization phase, we set the scan limits and then during Executor Run phase use heap_getnext to fetch the tuples accordingly, but doing it dynamically means at ExecutorRun phase we need to reset the scan limit for which page/pages to scan, still I have to check if there is any problem with such an idea. Do you any different idea in mind? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: From: Robert Haas [mailto:robertmh...@gmail.com] I don't understand. If you want to range partition on columns (a, b), you say that, say, tuples with (a, b) values less than (100, 200) go here and the rest go elsewhere. For list partitioning, you say that, say, tuples with (a, b) values of EXACTLY (100, 200) go here and the rest go elsewhere. I'm not sure how useful that is but it's not illogical. In case of list partitioning, 100 and 200 would respectively be one of the values in lists of allowed values for a and b. I thought his concern is whether this list of values for each column in partkey is as convenient to store and manipulate as range partvalues. Yeah and also how would user specify the values, as an example assume that table is partitioned on monthly_salary, so partition definition would look: PARTITION BY LIST(monthly_salary) ( PARTITION salary_less_than_thousand VALUES(300, 900), PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), ... ) Now if user wants to define multi-column Partition based on monthly_salary and annual_salary, how do we want him to specify the values. Basically how to distinguish which values belong to first column key and which one's belong to second column key. Amit, in one of my earlier replies to your question of why we may not want to implement multi-column list partitioning (lack of user interest in the feature or possible complexity of the code), I tried to explain how that may work if we do choose to go that way. Basically, something we may call PartitionColumnValue should be such that above issue can be suitably sorted out. For example, a partition defining/bounding value would be a pg_node_tree representation of List of one of the (say) following parse nodes as appropriate - typedef struct PartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, charpartkind, Node*partrangelower, Node*partrangeupper, List*partlistvalues }; OR separately, typedef struct RangePartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, Node*partrangelower, Node*partrangeupper }; typedef struct ListPartitionColumnValue { NodeTag type, Oid *partitionid, char*partcolname, List*partlistvalues }; Where a partition definition would look like typedef struct PartitionDef { NodeTag type, RangeVarpartition, RangeVarparentrel, char*kind, Node*values, List*options, char*tablespacename }; PartitionDef.values is an (ordered) List of PartitionColumnValue each of which corresponds to one column in the partition key in that order. We should be able to devise a way to load the pg_node_tree representation of PartitionDef.values (on-disk pg_partition_def.partvalues) into relcache using a suitable data structure so that it becomes readily usable in variety of contexts that we are interested in using this information. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Misunderstanding on the FSM README file
2014-12-07 15:07 GMT+01:00 Heikki Linnakangas hlinnakan...@vmware.com: On 12/07/2014 02:03 PM, Guillaume Lelarge wrote: Hi, I've been reading the FSM README file lately (src/backend/storage/freespace/README), and I'm puzzled by one of the graph (the binary tree structure of an FSM file). Here it is: 4 4 2 3 4 0 2- This level represents heap pages Shouldn't the last line be: 4 3 2 0 (ie, highest number of free space on the left node, lowest on the right one) Probably just nitpicking, but still, I'm wondering if I missed something out. No, that's not how it works. Each number at the bottom level corresponds to a particular heap page. The first number would be heap page #0 (which has 3 units of free space), the second heap page #1 (with 4 units of free space) and so forth. Each node on the upper levels stores the maximum of its two children. Oh OK. Thanks Heikki, that makes perfect sense. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com
Re: [HACKERS] Proposal : REINDEX SCHEMA
On Tue, Dec 9, 2014 at 10:10 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Dec 2, 2014 at 3:42 PM, Michael Paquier michael.paqu...@gmail.com wrote: Adding on top of that a couple of things cleaned up, like docs and typos, and I got the patch attached. Let's have a committer have a look a it now, I am marking that as Ready for Committer. For the archives, this has been committed as fe263d1. Thanks Simon for looking and the final push. And sorry that I didn't spot the issue with tap tests when reviewing, check-world passed but my dev VM missed necessary perl packages. While re-looking at that. I just found that when selecting the relations that are reindexed for a schema we ignore materialized view as the key scan is only done using 'r' as relkind. The patch attached fixes that. Thanks, -- Michael From ae2b1b8c426698bb7142f9f02e4cf08295e9dd73 Mon Sep 17 00:00:00 2001 From: Michael Paquier michael@otacoo.com Date: Tue, 9 Dec 2014 16:40:39 +0900 Subject: [PATCH] Fix REINDEX SCHEMA ignoring matviews The key scan used was using a filter on relation relkind, but that's not actually necessary as a filter is applied when building the list of OIDs reindexed. --- src/backend/commands/indexcmds.c | 8 ++-- src/test/regress/expected/create_index.out | 18 +- src/test/regress/sql/create_index.sql | 6 +++--- 3 files changed, 14 insertions(+), 18 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index a3e8a15..9b07216 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -1867,16 +1867,12 @@ ReindexObject(const char *objectName, ReindexObjectType objectKind) */ if (objectKind == REINDEX_OBJECT_SCHEMA) { - scan_keys = palloc(sizeof(ScanKeyData) * 2); + scan_keys = palloc(sizeof(ScanKeyData)); ScanKeyInit(scan_keys[0], Anum_pg_class_relnamespace, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(objectOid)); - ScanKeyInit(scan_keys[1], - Anum_pg_class_relkind, - BTEqualStrategyNumber, F_CHAREQ, - 'r'); - num_keys = 2; + num_keys = 1; } else num_keys = 0; diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out index ebac939..abffe65 100644 --- a/src/test/regress/expected/create_index.out +++ b/src/test/regress/expected/create_index.out @@ -2837,12 +2837,12 @@ explain (costs off) REINDEX SCHEMA schema_to_reindex; -- failure, schema does not exist ERROR: schema schema_to_reindex does not exist CREATE SCHEMA schema_to_reindex; -CREATE TABLE schema_to_reindex.table1(col1 SERIAL PRIMARY KEY); -CREATE TABLE schema_to_reindex.table2(col1 SERIAL PRIMARY KEY, col2 VARCHAR(100) NOT NULL); -CREATE INDEX ON schema_to_reindex.table2(col2); +CREATE TABLE schema_to_reindex.table(col1 SERIAL PRIMARY KEY); +CREATE MATERIALIZED VIEW schema_to_reindex.matview AS SELECT col1 FROM schema_to_reindex.table; +CREATE INDEX ON schema_to_reindex.matview(col1); REINDEX SCHEMA schema_to_reindex; -NOTICE: table schema_to_reindex.table1 was reindexed -NOTICE: table schema_to_reindex.table2 was reindexed +NOTICE: table schema_to_reindex.table was reindexed +NOTICE: table schema_to_reindex.matview was reindexed BEGIN; REINDEX SCHEMA schema_to_reindex; -- failure, cannot run in a transaction ERROR: REINDEX SCHEMA cannot run inside a transaction block @@ -2852,13 +2852,13 @@ CREATE ROLE reindexuser login; SET SESSION ROLE user_reindex; ERROR: role user_reindex does not exist REINDEX SCHEMA schema_to_reindex; -NOTICE: table schema_to_reindex.table1 was reindexed -NOTICE: table schema_to_reindex.table2 was reindexed +NOTICE: table schema_to_reindex.table was reindexed +NOTICE: table schema_to_reindex.matview was reindexed -- Clean up RESET ROLE; DROP ROLE user_reindex; ERROR: role user_reindex does not exist DROP SCHEMA schema_to_reindex CASCADE; NOTICE: drop cascades to 2 other objects -DETAIL: drop cascades to table schema_to_reindex.table1 -drop cascades to table schema_to_reindex.table2 +DETAIL: drop cascades to table schema_to_reindex.table +drop cascades to materialized view schema_to_reindex.matview diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql index 1cd57da..a5b3403 100644 --- a/src/test/regress/sql/create_index.sql +++ b/src/test/regress/sql/create_index.sql @@ -970,9 +970,9 @@ explain (costs off) -- REINDEX SCHEMA schema_to_reindex; -- failure, schema does not exist CREATE SCHEMA schema_to_reindex; -CREATE TABLE schema_to_reindex.table1(col1 SERIAL PRIMARY KEY); -CREATE TABLE schema_to_reindex.table2(col1 SERIAL PRIMARY KEY, col2 VARCHAR(100) NOT NULL); -CREATE INDEX ON schema_to_reindex.table2(col2); +CREATE TABLE schema_to_reindex.table(col1 SERIAL PRIMARY KEY); +CREATE MATERIALIZED VIEW schema_to_reindex.matview AS SELECT col1 FROM schema_to_reindex.table; +CREATE INDEX ON schema_to_reindex.matview(col1); REINDEX SCHEMA