Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()
At 2014-07-03 11:15:53 +0530, amit.khande...@enterprisedb.com wrote: In GetBufferWithoutRelcache(), I was wondering, rather than calling PinBuffer(), if we do this : LockBufHdr(buf); PinBuffer_Locked(buf); valid = ((buf-flags BM_VALID) != 0); then we can avoid having the new buffer access strategy BAS_DISCARD that is introduced in this patch. And so the code changes in freelist.c would not be necessary. Thanks for the suggestion. It sounds sensible at first glance. I'll think about it a little more, then try it and see. will follow up with some performance numbers soon. Yes, that would be nice. I'm afraid I haven't had a chance to do this yet. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Wed, Jul 2, 2014 at 11:45 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Jeff Janes wrote: I would only envision using the parallel feature for vacuumdb after a pg_upgrade or some other major maintenance window (that is the only time I ever envision using vacuumdb at all). I don't think autovacuum can be expected to handle such situations well, as it is designed to be a smooth background process. That's a fair point. One thing that would be pretty neat but I don't think I would get anyone to implement it, is having the user control the autovacuum launcher in some way. For instance please vacuum this set of tables as quickly as possible, and it would launch as many workers are configured. It would take months to get a UI settled for this, however. This sounds to be a better way to have multiple workers working on vacuuming tables. For vacuum as we already have some sort of infrastructure (vacuum workers) to perform tasks in parallel, why not to leverage that instead of inventing a new one even if we assume that we can reduce the duplicate code. I don't know how to calibrate the number of lines that is worthwhile. If you write in C and need to have cross-platform compatibility and robust error handling, it seems to take hundreds of lines to do much of anything. The code duplication is a problem, but I don't think just raw line count is, especially since it has already been written. Well, there are (at least) two types of duplicate code: first you have these common routines such as pgpipe that are duplicates for no good reason. Just move them to src/port or something and it's all good. But the OP said there is code that cannot be shared even though it's very similar in both incarnations. That means we cannot (or it's difficult to) just have one copy, which means as they fix bugs in one copy we need to update the other. I checked briefly the duplicate code among both versions and I think, we might be able to reduce it to a significant amount by making common functions and use AH where passed (as an example, I have checked function ParallelBackupStart() which is more than 100 lines). If you see code duplication as a major point for which you don't prefer this patch, then I think that can be ameliorated or atleast it is worth a try to do so. However I think it might be better to achieve in a way suggested by you using autovacuum launcher. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Autonomous Transaction (WIP)
On 01 July 2014 12:00, Amit Kapila Wrote: On Tue, Jul 1, 2014 at 11:46 AM, Rajeev rastogi rajeev.rast...@huawei.commailto:rajeev.rast...@huawei.com wrote: On 30 June 2014 22:50, Pavel Stehule Wrote: I didn't find a related message. ? I think there have been some confusion, the design idea were never rejected but yes there were few feedback/ concern, which I had clarified. Also some of the other concerns are already fixed in latest patch. Simon has mentioned that exactly this idea has been rejected at PGCon 2 years back. Please refer that in below mail: http://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab7713dde1...@szxeml508-mbx.china.huawei.com As far as I can see, you never came back with the different solution. Yeah right. So for this I tried to search archived mails to get the details about the discussion but I could not find anything regarding design. So I am not sure how shall I make my solution different from earlier as earlier solution is not accessible to me. Any help regarding this will be really great help to me. Also from the current Autonomous transaction discussion thread (including ca+u5nmkeum4abrqbndlyt5ledektae8rbiyw3977yhmeowq...@mail.gmail.comhttp://www.postgresql.org/message-id/ca+u5nmkeum4abrqbndlyt5ledektae8rbiyw3977yhmeowq...@mail.gmail.com), I have summarized all important feedbacks as mentioned below along with the resolution suggested: 1. Pavel Stehule (07-04-2014): -1 for Oracle syntax - it is hardly inconsistent with Postgres Changed the syntax to “START AUTONOMOUS TRANSACTION” 2. Pavan (10-04-2014): Making autonomous transaction properties independent of main transaction. Made all properties of autonomous transaction (including read-only) independent from main transaction except isolation level, which I did not find very useful to keep different. But others opinion is different then we can make this property also independent. 3. Alvaro Herrarta (09-04-2014): Autonomous transaction to have their own separate proc entry. This was concluded to not have separate proc entry for autonomous transaction and same concluded. 4. Tom Lane (09-04-2014): The point being that you need to change both pg_subtrans and pg_clog to make that state transition. This is handled for autonomous transaction. 5. Robert Haas (09-04-2014): Not in favour of current design related to maintaining lockmask for autonomous transaction. I had replied for this mail regarding why this design is kept but still if design for this part is not acceptable, then I can rework to make it better. In order to do so I would be very happy to have more discussion to get concrete feedback and direction to improve this. 6. Tom Lane (09-04-2014): no justification for distinguishing normal and autonomous transactions at this level (locking level). I had replied this also earlier. Reason for distinguishing at this level is to handle any kind of deadlock possibility between main and autonomous transaction. Deadlock handling between main and autonomous transaction was one of the requirement discussed at PGCon 2012 as part of autonomous transaction discussion. Please let me know if I am missing something in this. All of the above mentioned changes are included in latest patch shared. Please let me know if I have missed any other important points from the earlier discussion, I would like to address that also. Have you checked the discussion in Developer meeting notes. Please check the same at below link: http://wiki.postgresql.org/wiki/PgCon_2012_Developer_Meeting#Autonomous_Transactions From the discussion, I am able to make out two important points: 1. Main transaction and autonomous transaction should be independent and can conflict. This is already included in our latest patch. 2. Utility commands like VACUUM and CREATE INDEX CONCURRENTLY should be able to work from autonomous transaction. Both of the above mentioned utility commands are not supported even inside the main transaction. So it is not working within autonomous transaction. Any opinion about this? Please let me know if I have missed any points from the link given. So I wanted to have this patch in commitfest application, so that we can have a healthy discussion and rectify all the issues. But now I see that this patch has already been moved to rejected category, which will put break on further review. I believe ideally this patch should have been marked as Returned with feedback as you already got a feedback long back and never come up with solution for same. Since this patch is very big and complex, it is better we continue discussing from the first CommitFest itself so that we can get ample time to share everyone’s opinion and then address all possible issue. Any Opinions/Suggestions are welcome. Also let me know if I have missed something. Thanks and Regards, Kumar Rajeev Rastogi
Re: [HACKERS] WAL replay bugs
Michael Paquier michael.paqu...@gmail.com writes: On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Pehaps this is showing that no tidy or generalized way to tell what a page is used for. Many of the modules which have their own page format has a magic value and the values seem to be selected carefully. But no one-for-all method to retrieve that. You have a point here. Yeah, it's a bit messy, but I believe it's currently always possible to tell which access method a PG page belongs to. Look at pg_filedump. The last couple times we added index access methods, we took pains to make sure pg_filedump could figure out what their pages were. (IIRC, it's a combination of the special-space size and contents, but I'm too tired to go check the details right now.) For gin, I'll investigate if it is possible to add a identifier like GIN_PAGE_ID, it would make the page analysis more consistent with the others. I am not sure for what the 8 bytes allocated for the special area are used now for though. There is exactly zero chance that anyone will accept an on-disk format change just to make this prettier. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay bugs
On Thu, Jul 3, 2014 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: On Wed, Jul 2, 2014 at 5:32 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Pehaps this is showing that no tidy or generalized way to tell what a page is used for. Many of the modules which have their own page format has a magic value and the values seem to be selected carefully. But no one-for-all method to retrieve that. You have a point here. Yeah, it's a bit messy, but I believe it's currently always possible to tell which access method a PG page belongs to. Look at pg_filedump. The last couple times we added index access methods, we took pains to make sure pg_filedump could figure out what their pages were. (IIRC, it's a combination of the special-space size and contents, but I'm too tired to go check the details right now.) Yes, that's what the current code does btw, in this *messy* way. For gin, I'll investigate if it is possible to add a identifier like GIN_PAGE_ID, it would make the page analysis more consistent with the others. I am not sure for what the 8 bytes allocated for the special area are used now for though. There is exactly zero chance that anyone will accept an on-disk format change just to make this prettier. Yeah thought so. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autonomous Transaction (WIP)
On Thu, Jul 3, 2014 at 12:03 PM, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 01 July 2014 12:00, Amit Kapila Wrote: Simon has mentioned that exactly this idea has been rejected at PGCon 2 years back. Please refer that in below mail: http://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab7713dde1...@szxeml508-mbx.china.huawei.com As far as I can see, you never came back with the different solution. Yeah right. So for this I tried to search archived mails to get the details about the discussion but I could not find anything regarding design. So I am not sure how shall I make my solution different from earlier as earlier solution is not accessible to me. I haven't read your idea/patch in any detail, so can't comment on whether it is good or bad. However I think if one of the Committers has already mentioned that the same idea has been rejected previously, then it makes little sense to further review or update the patch unless you know the reason of rejection and handle it in an acceptable way. Now as far as I can understand, the problem seems to be in a way you have defined Autonomous Transaction Storage which can lead to consumption of additional client slots, this is just what I could make sense from above mail but I am not completely sure on this matter. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] gaussian distribution pgbench
Hello Gavin, decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0% probability of fist/last percent of the range: 11.3% 0.0% I would suggest that probabilities should NEVER be expressed in percentages! As a percentage probability looks weird, and is never used for serious statistical work - in my experience at least. I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35 rather than 35%. I could agree about the mathematics, but ISTM that 11.5% is more readable and intuitive than 0.115. I could change probability and replace it with frequency or maybe occurence, what would you think about that? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Setting PG-version without recompiling
Hi. I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's minor-version string: 4beta1. Is it possible to set this somewhere without recompiling PG? Thanks. -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com https://www.visena.com https://www.visena.com
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick i...@2ndquadrant.com wrote: On 01/07/14 21:00, Rushabh Lathia wrote: I spent some more time on the patch and here are my review comments. .) Patch gets applied through patch -p1 (git apply fails) .) trailing whitespace in the patch at various places Not sure where you see this, unless it's in the tests, where it's required. Please ignore the whitespace comment. Don't know somehow that was due to my local changes in the branch. .) Unnecessary new line + and - in the patch. (src/backend/rewrite/rewriteManip.c::getInsertSelectQuery()) (src/include/rewrite/rewriteManip.h) Fixed. .) patch has proper test coverage and regression running cleanly. .) In map_primary_key_to_list() patch using INDEX_ATTR_BITMAP_IDENTITY_KEY bitmap to get the keycols. In IndexAttrBitmapKind there is also INDEX_ATTR_BITMAP_KEY, so was confuse between INDEX_ATTR_BITMAP_KEY and INDEX_ATTR_BITMAP_IDENTITY_KEY and also haven't found related comments in the code. Later with use of testcase and debugging found confirmed that INDEX_ATTR_BITMAP_IDENTITY_KEY only returns the Primary key. Revised patch version (see other mail) fixes this by introducing INDEX_ATTR_BITMAP_PRIMARY_KEY. Looks good. .) At present in patch when RETURNING PRIMARY KEY is used on table having no primary key it throw an error. If I am not wrong JDBC will be using that into getGeneratedKeys(). Currently this feature is implemented in the JDBC driver by appending RETURNING * to the supplied statement. With this implementation it will replace RETURNING * with RETURNING PRIMARY KEY, right ? So just wondering what JDBC expect getGeneratedKeys() to return when query don't have primary key and user called executeUpdate() with Statement.RETURN_GENERATED_KEYS? I looked at JDBC specification but its not clear what it will return when table don't have keys. Can you please let us know your finding on this ? The spec [*] is indeed frustratingly vague: The method Statement.getGeneratedKeys, which can be called to retrieve the generated value, returns a ResultSet object with a column for each automatically generated value. The methods execute, executeUpdate or Connection.prepareStatement accept an optional parameter, which can be used to indicate that any auto generated values should be returned when the statement is executed or prepared. [*] http://download.oracle.com/otn-pub/jcp/jdbc-4_1-mrel- spec/jdbc4.1-fr-spec.pdf I understand this to mean that no rows will be returned if no auto-generated values are not present. As-is, the patch will raise an error if the target table does not have a primary key, which makes sense from the point of view of the proposed syntax, but which will make it impossible for the JDBC driver to implement the above understanding of the spec (i.e. return nothing if no primary key exists). The basic reason for this patch is to allow JDBC to use this syntax for getGeneratedKeys(). Now because this patch throw an error when table don't have any primary key, this patch won't be useful for the getGeneratedKeys() implementation. It would be simple enough not to raise an error in this case, but that means the query would be effectively failing silently and I don't think that's desirable behaviour. Yes, not to raise an error won't be desirable behavior. A better solution would be to have an optional IF EXISTS clause: RETURNING PRIMARY KEY [ IF EXISTS ] which would be easy enough to implement. Thoughts? Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so far we having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure whether its ok to use such syntax for DML commands. Others please share your thoughts/comments. Ian Barwick -- Ian Barwick http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Rushabh Lathia
Re: [HACKERS] Setting PG-version without recompiling
At 2014-07-03 11:02:34 +0200, andr...@visena.com wrote: Is it possible to set this somewhere without recompiling PG? I'm afraid not. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting PG-version without recompiling
På torsdag 03. juli 2014 kl. 11:13:44, skrev Abhijit Menon-Sen a...@2ndquadrant.com mailto:a...@2ndquadrant.com: At 2014-07-03 11:02:34 +0200, andr...@visena.com wrote: Is it possible to set this somewhere without recompiling PG? I'm afraid not. Ok -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com https://www.visena.com https://www.visena.com
Re: [HACKERS] gaussian distribution pgbench
On 03/07/14 20:58, Fabien COELHO wrote: Hello Gavin, decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0% probability of fist/last percent of the range: 11.3% 0.0% I would suggest that probabilities should NEVER be expressed in percentages! As a percentage probability looks weird, and is never used for serious statistical work - in my experience at least. I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35 rather than 35%. I could agree about the mathematics, but ISTM that 11.5% is more readable and intuitive than 0.115. I could change probability and replace it with frequency or maybe occurence, what would you think about that? You may well be hitting a situation, where you meet opposition whatever you do! :-) frequency implies a positive integer (though relative frequency might be okay) - and if you use occurrence, someone else is bound to complain... Though, I'd opt for relative frequency, if you can't use values in the range 0 ... 1 for probabilities, if %'s are used - so long as it does not generate a flame war. I suspect it may not be worth the grief to change. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
Hi, Hi, Attached you can find a short (compile tested only ) patch implementing a 'shared_memory_type' GUC, akin to 'dynamic_shared_memory_type'. Will only apply to 9.4, not 9.3, but it should be easy to convert for it. Greetings, Andres Freund I have rebased Andres's patch against 9.3-STABLE tree. Please take a look at attached patches and let me know if you find anything strange. I am going to test the patch on a huge HP machine: DL980G7/64 cores/2TB mem. With the patch I would be able to report back if using SysV shared mem fixes the 9.3 performance problem. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index f746c81..e82054a 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -72,6 +72,7 @@ static void IpcMemoryDetach(int status, Datum shmaddr); static void IpcMemoryDelete(int status, Datum shmId); static PGShmemHeader *PGSharedMemoryAttach(IpcMemoryKey key, IpcMemoryId *shmid); +static void *CreateAnonymousSegment(Size *size); /* @@ -389,49 +390,19 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port) * developer use, this shouldn't be a big problem. */ #ifndef EXEC_BACKEND + if (shared_memory_type == SHMEM_TYPE_MMAP) { - long pagesize = sysconf(_SC_PAGE_SIZE); - - /* - * Ensure request size is a multiple of pagesize. - * - * pagesize will, for practical purposes, always be a power of two. - * But just in case it isn't, we do it this way instead of using - * TYPEALIGN(). - */ - if (pagesize 0 size % pagesize != 0) - size += pagesize - (size % pagesize); - - /* - * We assume that no one will attempt to run PostgreSQL 9.3 or later - * on systems that are ancient enough that anonymous shared memory is - * not supported, such as pre-2.4 versions of Linux. If that turns - * out to be false, we might need to add a run-time test here and do - * this only if the running kernel supports it. - */ - AnonymousShmem = mmap(NULL, size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, - -1, 0); - if (AnonymousShmem == MAP_FAILED) - { - int saved_errno = errno; - - ereport(FATAL, - (errmsg(could not map anonymous shared memory: %m), - (saved_errno == ENOMEM) ? -errhint(This error usually means that PostgreSQL's request - for a shared memory segment exceeded available memory - or swap space. To reduce the request size (currently - %lu bytes), reduce PostgreSQL's shared memory usage, - perhaps by reducing shared_buffers or - max_connections., - (unsigned long) size) : 0)); - } + AnonymousShmem = CreateAnonymousSegment(size); AnonymousShmemSize = size; - /* Now we need only allocate a minimal-sized SysV shmem block. */ sysvsize = sizeof(PGShmemHeader); } + else #endif + { + Assert(shared_memory_type == SHMEM_TYPE_SYSV); + sysvsize = size; + } /* Make sure PGSharedMemoryAttach doesn't fail without need */ UsedShmemSegAddr = NULL; @@ -631,3 +602,47 @@ PGSharedMemoryAttach(IpcMemoryKey key, IpcMemoryId *shmid) return hdr; } + +/* + * Creates an anonymous mmap()ed shared memory segment. + * + * Pass the requested size in *size. This function will modify *size to the + * actual size of the allocation, if it ends up allocating a segment that is + * larger than requested. + */ +#ifndef EXEC_BACKEND +static void * +CreateAnonymousSegment(Size *size) +{ + Size allocsize = *size; + void *ptr = MAP_FAILED; + int mmap_errno = 0; + + /* + * use the original size, not the rounded up value, when falling back + * to non-huge pages. + */ + allocsize = *size; + ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, + PG_MMAP_FLAGS, -1, 0); + mmap_errno = errno; + + if (ptr == MAP_FAILED) + { + errno = mmap_errno; + ereport(FATAL, +(errmsg(could not map anonymous shared memory: %m), + (mmap_errno == ENOMEM) ? + errhint(This error usually means that PostgreSQL's request + for a shared memory segment exceeded available memory, + swap space or huge pages. To reduce the request size + (currently %zu bytes), reduce PostgreSQL's shared + memory usage, perhaps by reducing shared_buffers or + max_connections., + *size) : 0)); + } + + *size = allocsize; + return ptr; +} +#endif diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 918ac51..51dccdc 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -39,6 +39,8 @@ #include storage/sinvaladt.h #include storage/spin.h +/* GUCs */ +int shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE; shmem_startup_hook_type shmem_startup_hook = NULL; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 2b6527f..2945a68 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -61,6 +61,7
Re: [HACKERS] WAL replay bugs
Hello, This is the additional comments for other part. I haven't see significant defect in the code so far. = bufcapt.c: - buffer_capture_remember() or so. Pages for unlogged tables are avoided to be written taking advantage that the lsn for such pages stays 0/0. I'd like to see a comment mentioning for, say, buffer_capture_is_changed? or buffer_capture_forget or somewhere. - buffer_capture_forget() However this error is likely not to occur, in the error message could not find image..., the buffer id seems to bring no information. LSN, or quadraplet of LSN, rnode, forknum and blockno seems to be more informative. - buffer_capture_is_changed() The name for the function semes to be misleading. What this function does is comparing LSNs between stored page image and current page. lsn_is_changed(BufferImage) or something like would be more clearly. == bufmgr.c: - ConditionalLockBuffer() Sorry for a trivial comment, but the variable 'res' conceales the meaning. acquired seems to be more preferable, isn't it? - LockBuffer() There is a 'XXX' comment. The discussion written there seems to be right, for me. If you mind that peeking into there is a bad behavior, adding a macro into lwlock.h would help:p lwlock.h: #define LWLockHoldedExclusive(l) ((l)-exclusive 0) lwlock.h: #define LWLockHoldedShared(l) ((l)-shared 0) # I don't see this usable so much.. bufmgr.c: if (LWLockHoldedExclusive(buf-content_lock)) If there isn't any particular concern, 'XXX:' should be removed. = bufpage.c: - #include storage/bufcapt.h The include seems to be needless. = bufcapt.h: - header comment The file name is misspelled as 'bufcaptr.h'. Copyright notice of UC is needed? (Other files are the same) - The includes in this header except for buf.h seem not to be necessary. = init_env.sh: - pg_get_test_port() It determines server port using PG_VERSION_NUM, but is it necessary? Addition to it, the theoretical maximum initial PGPORT semes to be 65535 but this script search for free port up to the number larger by 16 from the start, which would be beyond the edge. - pg_get_test_port() It stops with error after 16 times of failure, but the message complains only about the final attempt. If you want to mention the port numbers, it might should be 'port $PGPORTSTART-$PGPORT ..' or would be 'All 16 ports attempted failed' or something.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()
FYI, I've attached a patch that does what you suggested. I haven't done anything else (i.e. testing) with it yet. -- Abhijit diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 5f9fc49..dc90c02 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -501,33 +501,9 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) * here during crash recovery. */ if (HotStandbyActiveInReplay()) - { - BlockNumber blkno; - - for (blkno = xlrec-lastBlockVacuumed + 1; blkno xlrec-block; blkno++) - { - /* - * We use RBM_NORMAL_NO_LOG mode because it's not an error - * condition to see all-zero pages. The original btvacuumpage - * scan would have skipped over all-zero pages, noting them in FSM - * but not bothering to initialize them just yet; so we mustn't - * throw an error here. (We could skip acquiring the cleanup lock - * if PageIsNew, but it's probably not worth the cycles to test.) - * - * XXX we don't actually need to read the block, we just need to - * confirm it is unpinned. If we had a special call into the - * buffer manager we could optimise this so that if the block is - * not in shared_buffers we confirm it as unpinned. - */ - buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, blkno, - RBM_NORMAL_NO_LOG); - if (BufferIsValid(buffer)) - { -LockBufferForCleanup(buffer); -UnlockReleaseBuffer(buffer); - } - } - } + XLogLockBlockRangeForCleanup(xlrec-node, MAIN_FORKNUM, + xlrec-lastBlockVacuumed + 1, + xlrec-block); /* * If we have a full-page image, restore it (using a cleanup lock) and diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index b7829ff..d84f83a 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -287,10 +287,6 @@ XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init) * * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the * relation is extended with all-zeroes pages up to the given block number. - * - * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't - * exist, and we don't check for all-zeroes. Thus, no log entry is made - * to imply that the page should be dropped or truncated later. */ Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, @@ -331,8 +327,6 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, log_invalid_page(rnode, forknum, blkno, false); return InvalidBuffer; } - if (mode == RBM_NORMAL_NO_LOG) - return InvalidBuffer; /* OK to extend the file */ /* we do this in recovery only - no rel-extension lock needed */ Assert(InRecovery); @@ -375,6 +369,62 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, return buffer; } +/* + * XLogBlockRangeForCleanup is used in Hot Standby mode to emulate the + * locking effects imposed by VACUUM for nbtrees. + */ +void +XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum, + BlockNumber startBlkno, + BlockNumber uptoBlkno) +{ + BlockNumber blkno; + BlockNumber lastblock; + BlockNumber endingBlkno; + Buffer buffer; + SMgrRelation smgr; + + Assert(startBlkno != P_NEW); + Assert(uptoBlkno != P_NEW); + + /* Open the relation at smgr level */ + smgr = smgropen(rnode, InvalidBackendId); + + /* + * Create the target file if it doesn't already exist. This lets us cope + * if the replay sequence contains writes to a relation that is later + * deleted. (The original coding of this routine would instead suppress + * the writes, but that seems like it risks losing valuable data if the + * filesystem loses an inode during a crash. Better to write the data + * until we are actually told to delete the file.) + */ + smgrcreate(smgr, forkNum, true); + + lastblock = smgrnblocks(smgr, forkNum); + + endingBlkno = uptoBlkno; + if (lastblock endingBlkno) + endingBlkno = lastblock; + + for (blkno = startBlkno; blkno endingBlkno; blkno++) + { + /* + * All we need to do here is prove that we can lock each block + * with a cleanup lock. It's not an error to see all-zero pages + * here because the original btvacuumpage would not have thrown + * an error either. + * + * We don't actually need to read the block, we just need to + * confirm it is unpinned. + */ + buffer = GetBufferWithoutRelcache(rnode, forkNum, blkno); + if (BufferIsValid(buffer)) + { + LockBufferForCleanup(buffer); + UnlockReleaseBuffer(buffer); + } + } +} /* * Struct actually returned by XLogFakeRelcacheEntry, though the declared diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 07ea665..d3c7eb7 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -222,8 +222,6 @@ ReadBuffer(Relation reln, BlockNumber blockNum) * current physical EOF; that is
Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()
On 3 July 2014 06:45, Amit Khandekar amit.khande...@enterprisedb.com wrote: In GetBufferWithoutRelcache(), I was wondering, rather than calling PinBuffer(), if we do this : LockBufHdr(buf); PinBuffer_Locked(buf); valid = ((buf-flags BM_VALID) != 0); then we can avoid having the new buffer access strategy BAS_DISCARD that is introduced in this patch. And so the code changes in freelist.c would not be necessary. That looks like a good idea, thanks. I think we should say this though LockBufHdr(buf); valid = ((buf-flags BM_VALID) != 0); if (valid) PinBuffer_Locked(buf); else UnlockBufHdr(buf); since otherwise we would access the buffer flags without the spinlock and we would leak a pin if the buffer was not valid -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: Hello Mitsumasa-san, And I'm also interested in your decile percents output like under followings, decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4% Sure, I'm really fine with that. I think that it is easier than before. Sum of decile percents is just 100%. That's a good property:-) However, I don't prefer highest/lowest percentage because it will be confused with decile percentage for users, and anyone cannot understand this digits. I cannot understand 4.9%, 0.0% when I see the first time. Then, I checked the source code, I understood it:( It's not good design... #Why this parameter use 100? What else? People have ten fingers and like powers of 10, and are used to percents? So I'd like to remove it if you like. It will be more simple. I think that for the exponential distribution it helps, especially for high threshold, to have the lowest/highest percent density. For low thresholds, the decile is also definitely useful. So I'm fine with both outputs as you have put them. I have just updated the wording so that it may be clearer: decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0% probability of fist/last percent of the range: 11.3% 0.0% Attached patch is fixed version, please confirm it. Attached a v15 which just fixes a typo and the above wording update. I'm validating it for committers. #Of course, World Cup is being held now. I'm not hurry at all. I'm not a soccer kind of person, so it does not influence my availibility.:-) Suggested commit message: Add drawing random integers with a Gaussian or truncated exponentional distributions to pgbench. Test variants with these distributions are also provided and triggered with options --gaussian=... and --exponential= IIRC we've not reached consensus about whether we should support such options in pgbench. Several hackers disagreed to support them. OTOH, we've almost reached the consensus that supporting gaussian and exponential options in \setrandom. So I think that you should separate those two features into two patches, and we should apply the \setrandom one first. Then we can discuss whether the other patch should be applied or not. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
On 2014-07-03 21:27:53 +0900, Fujii Masao wrote: Add drawing random integers with a Gaussian or truncated exponentional distributions to pgbench. Test variants with these distributions are also provided and triggered with options --gaussian=... and --exponential= IIRC we've not reached consensus about whether we should support such options in pgbench. Several hackers disagreed to support them. Yea. I certainly disagree with the patch in it's current state because it copies the same 15 lines several times with a two word difference. Independent of whether we want those options, I don't think that's going to fly. OTOH, we've almost reached the consensus that supporting gaussian and exponential options in \setrandom. So I think that you should separate those two features into two patches, and we should apply the \setrandom one first. Then we can discuss whether the other patch should be applied or not. Sounds like a good plan. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should extension--unpackaged-$ver.sql's \echo use FROM unpackaged;?
On Thu, Jul 3, 2014 at 4:26 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Fujii noticed that I'd used \echo Use CREATE EXTENSION pg_buffercache to load this file. \quit in an extension upgrade script. That's obviously wrong, because it should say ALTER EXTENSION. The reason it's that way is that I copied the line from the unpackaged--1.0.sql file. I noticed all unpackaged--$version transition files miss the FROM unpackaged in that echo. I guess that's just a oversight which we should correct? Maybe for user-friendness. But I think that's not so important fix enough to backpatch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section
On Thu, Jul 3, 2014 at 4:34 AM, Michael Banck michael.ba...@credativ.de wrote: Hi, While reading through the Explicit Locking section of the manual today, I felt the last paragraph of section 13.3.2. (Row-level Locks) might merit its own subsection. It talks about page-level locks as distinct from table- and row-level locks. Then again, it is just one paragraph, so maybe this was deliberate and/or rejected before (though I couldn't find prior discussion off-hand). Proposed patch attached. This seems to make sense. Barring objection, I will commit this only in HEAD. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra t...@fuzzy.cz wrote: On 30.6.2014 23:12, Tomas Vondra wrote: Hi, attached is v5 of the patch. The main change is that scaling the number of buckets is done only once, after the initial hash table is build. The main advantage of this is lower price. This also allowed some cleanup of unecessary code. However, this new patch causes warning like this: WARNING: temporary file leak: File 231 still referenced I've been eyeballing the code for a while now, but I still fail to see where this comes from :-( Any ideas? Meh, the patches are wrong - I haven't realized the tight coupling between buckets/batches in ExecHashGetBucketAndBatch: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue hashtable-log2_nbuckets) (nbatch - 1); The previous patches worked mostly by pure luck (the nbuckets was set only in the first batch), but once I moved the code at the end, it started failing. And by worked I mean didn't throw an error, but probably returned bogus results with (nbatch1). However, ISTM this nbuckets-nbatch coupling is not really necessary, it's only constructed this way to assign independent batch/bucket. So I went and changed the code like this: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue (32 - hashtable-log2_nbatch)); I.e. using the top bits for batchno, low bits for bucketno (as before). Hopefully I got it right this time. At least it seems to be working for cases that failed before (no file leaks, proper rowcounts so far). Hi, I had a look at the patch and I was wondering if there is a way we can set a cap on the resized size, and instead increase the number of batches instead? Since this patch evaluates the growth of tuples vs increase of space, we could look at increasing the number of batches itself instead of number of buckets, if the resized number of buckets exceeds a threshold. It shouldnt be too hard, AIUI it would involve a call in MultiExecHash right after the new cost evaluation the patch adds. It isnt a target of the current patch, but I think that it is a logical extension to the same. Thoughts/Comments? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] [v9.5] Custom Plan API
Kaigai-san, Sorry for lagged response. Here are my random thoughts about the patch. I couldn't understand the patch fully, because some of APIs are not used by ctidscan. If Custom Scan patch v2 review * Custom plan class comparison In backend/optimizer/util/pathnode.c, custclass is compared by bit-and with 's'. Do you plan to use custclass as bit field? If so, values for custom plan class should not be a character. Otherwise, custclass should be compared by == operator. * Purpose of GetSpecialCustomVar() The reason why FinalizeCustomPlan callback is necessary is not clear to me. Could you show a case that the API would be useful? * Purpose of FinalizeCustomPlan() The reason why FinalizeCustomPlan callback is necessary is not clear to me, because ctidscan just calls finalize_primnode() and bms_add_members() with given information. Could you show a case that the API would be useful? * Is it ok to call set_cheapest() for all relkind? Now set_cheapest() is called not for only relation and foreign table but also custom plan, and other relations such as subquery, function, and value. Calling call_custom_scan_provider() and set_cheapest() in the case of RTE_RELATION seems similar to the old construct, how do you think about this? * Is it hard to get rid of CopyCustomPlan()? Copying ForeignScan node doesn't need per-FDW copy function by limiting fdw_private to have only copy-able objects. Can't we use the same way for CustomPlan? Letting authors call NodeSetTag or copyObject() sounds uncomfortable to me. This would be able to apply to TextOutCustomPlan() and TextOutCustomPath() too. * MultiExec support is appropriate for the first version? The cases need MultiExec seems little complex for the first version of custom scan. What kind of plan do you image for this feature? * Does SupportBackwardScan() have enough information? Other scans check target list with TargetListSupportsBackwardScan(). Isn't it necessary to check it for CustomPlan too in ExecSupportsBackwardScan()? * Place to call custom plan provider Is it necessary to call provider when relkind != RELKIND_RELATION? If yes, isn't it necessary to call for append relation? I know that we concentrate to replacing scan in the initial version, so it would not be a serious problem, but it would be good to consider extensible design. * Custom Plan Provider is addpath? Passing addpath handler as only one attribute of CUSTOM PLAN PROVIDER seems little odd. Using handler like FDW makes the design too complex and/or messy? * superclass of CustomPlanState CustomPlanState derives ScanState, instead of deriving PlanState directly. I worry the case of non-heap-scan custom plan, but it might be ok to postpone consideration about that at the first cut. * Naming and granularity of objects related to custom plan I'm not sure the current naming is appropriate, especially difference between custom plan and provider and handler. In the context of CREATE CUSTOM PLAN statement, what the term custom plan means? My impression is that custom plan is an alternative plan type, e.g. ctidscan or pg_strom_scan. Then what the term provider means? My impression for that is extension, such as ctidscan and pg_strom. The grammar allows users to pass function via PROVIDER clause of CREATE CUSTOM SCAN, so the function would be the provider of the custom plan created by the statement. * enable_customscan GUC parameter enable_customscan would be useful for users who want to disable custom plan feature temporarily. In the case of pg_strom, using GPU for limited sessions for analytic or batch applications seems handy. * Adding pg_custom_plan catalog Using cust as prefix for pg_custom_plan causes ambiguousness which makes it difficult to choose catalog prefix for a feature named Custom Foo in future. How about using cusp (CUStom Plan)? Or is it better to use pg_custom_plan_provider as catalog relation name, as the document says that CREATE CUSTOM PLAN defines custom plan provider. Then prefix could be cpp (Custom Plan Provider). This seems to match the wording used for pg_foreign_data_wrapper. * CREATE CUSTOM PLAN statement This is just a question: We need to emit CREATE CUSTOM PLAN if we want to use I wonder how it is extended when supporting join as custom class. * New operators about TID comparison IMO this portion should be a separated patch, because it adds OID definition of existing operators such as tidgt and tidle. Is there any (explicit or implicit) rule about defining macro for oid of an operator? * Prototype of get_custom_plan_oid() custname (or cppname if use the rule I proposed above) seems appropriate as the parameter name of get_custom_plan_oid() because similar funcitons use catalog column names in such case. * Coding conventions Some lines are indented with white space. Future pgindent run will fix this issue? * Unnecessary struct forward declaration Forward declarations of CustomPathMethods, Plan, and CustomPlan in
Re: [HACKERS] Proposing pg_hibernate
Amit Kapila amit.kapil...@gmail.com wrote: Overall I agree that following Robert's idea will increase the time to make database server up and reach a state where apps can connect and start operations, I agree that warming the cache before beginning to apply WAL would be best. but I think atleast with such an approach we can claim that after warming buffers with pg_hibernator apps will always have performance greater than equal to the case when there is no pg_hibernator. I would be careful of this claim in a NUMA environment. The whole reason I investigated NUMA issues and wrote the NUMA patch is that a customer had terrible performance which turned out to be caused by cache warming using a single connection (and thus a single process, and thus a single NUMA memory node). To ensure that this doesn't trash performance on machines with many cores and memory nodes, it would be best to try to spread the data around among nodes. One simple way of doing this would be to find some way to use multiple processes in hopes that they would run on difference CPU packages. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting PG-version without recompiling
Andreas Joseph Krogh andr...@visena.com writes: Hi. Â I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's minor-version string: 4beta1. Is it possible to set this somewhere without recompiling PG? No, and even if you could, that would be the wrong approach. The right approach is to fix the JDBC driver to not complain about such version strings. I'm a bit surprised they haven't done so long since, considering how long PG beta versions have been tagged like that. For that matter, they really ought not complain about strings like 9.5devel or 9.5alpha2 either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
Rushabh Lathia rushabh.lat...@gmail.com writes: On Thu, Jul 3, 2014 at 6:56 AM, Ian Barwick i...@2ndquadrant.com wrote: A better solution would be to have an optional IF EXISTS clause: RETURNING PRIMARY KEY [ IF EXISTS ] Hmm liked the idea about adding [ IF EXISTS ]. Looking at the grammar so far we having [ IF EXISTS ] option only for DDL commands (mainly DROP) not sure whether its ok to use such syntax for DML commands. Others please share your thoughts/comments. TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy in that the application would have little idea what it was getting back. IF EXISTS would make it so spongy as to be useless, IMO. It sounds to me like designing this for JDBC's getGeneratedKeys method is a mistake. There isn't going to be any way that the driver can support that without having looked at the table's metadata for itself, and if it's going to do that then it doesn't need a crutch that only partially solves the problem anyhow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section
Fujii Masao masao.fu...@gmail.com wrote: This seems to make sense. Barring objection, I will commit this only in HEAD. I'm inclined to think this is a slight improvement, just for the sake of consistency with peer level information. You probably already noticed, but the patch as submitted neglects to close the prior sect2 block before opening the new one. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting PG-version without recompiling
På torsdag 03. juli 2014 kl. 16:16:01, skrev Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us: Andreas Joseph Krogh andr...@visena.com writes: Hi. �� I'm up for testing 9.4 but my JDBC-driver fails to connect due to PG's minor-version string: 4beta1. Is it possible to set this somewhere without recompiling PG? No, and even if you could, that would be the wrong approach. The right approach is to fix the JDBC driver to not complain about such version strings. I'm a bit surprised they haven't done so long since, considering how long PG beta versions have been tagged like that. For that matter, they really ought not complain about strings like 9.5devel or 9.5alpha2 either. regards, tom lane I'm using the pgjdbc-ng driver, I'm sure the official driver handles this. The driver will be fixed (the issue has a pull-request), I just wondered if there was a magic know I could use until the PR is merged. -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com https://www.visena.com https://www.visena.com
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 1.7.2014 01:24, Tomas Vondra wrote: On 30.6.2014 23:12, Tomas Vondra wrote: Hi, Hopefully I got it right this time. At least it seems to be working for cases that failed before (no file leaks, proper rowcounts so far). Attached v7, fixing nbatch/ntuples in an assert. regards Tomas diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 0d9663c..db3a953 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es-format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es); + ExplainPropertyLong(Original Hash Buckets, +hashtable-nbuckets_original, es); ExplainPropertyLong(Hash Batches, hashtable-nbatch, es); ExplainPropertyLong(Original Hash Batches, hashtable-nbatch_original, es); ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es); } - else if (hashtable-nbatch_original != hashtable-nbatch) + else if ((hashtable-nbatch_original != hashtable-nbatch) || (hashtable-nbuckets_original != hashtable-nbuckets)) { appendStringInfoSpaces(es-str, es-indent * 2); appendStringInfo(es-str, - Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n, - hashtable-nbuckets, hashtable-nbatch, - hashtable-nbatch_original, spacePeakKb); + Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n, + hashtable-nbuckets, hashtable-nbuckets_original, + hashtable-nbatch, hashtable-nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..53642d1 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -37,8 +37,10 @@ #include utils/lsyscache.h #include utils/syscache.h +bool enable_hashjoin_bucket = true; static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -48,6 +50,33 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +/* + * Compute appropriate size for hashtable given the estimated size of the + * relation to be hashed (number of rows and average row width). + * + * This is exported so that the planner's costsize.c can use it. + */ + +/* Target bucket loading (tuples per bucket) */ +#define NTUP_PER_BUCKET 10 + +/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets. + * + * Once we reach the threshold we double the number of buckets, and we + * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That + * means these two equations should hold + * + * b = 2a (growth) + * (a + b)/2 = 1 (oscillate around NTUP_PER_BUCKET) + * + * which means b=1. (a = b/2). If we wanted higher threshold, we + * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for + * growth, leading to (b=1.6). Or (b=8a) giving 1. etc. + * + * Let's start with doubling the bucket count, i.e. 1.333. */ +#define NTUP_GROW_COEFFICIENT 1.333 +#define NTUP_GROW_THRESHOLD (NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT) + /* * ExecHash * @@ -116,6 +145,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); +hashtable-skewTuples += 1; } else { @@ -126,6 +156,23 @@ MultiExecHash(HashState *node) } } + /* If average number of tuples per bucket is over the defined threshold, + * increase the number of buckets to get below it. */ + if (enable_hashjoin_bucket) { + + /* consider only tuples in the non-skew buckets */ + double nonSkewTuples = (hashtable-totalTuples - hashtable-skewTuples); + + if ((nonSkewTuples / hashtable-nbatch) (hashtable-nbuckets * NTUP_GROW_THRESHOLD)) { + +#ifdef HJDEBUG + printf(Increasing nbucket to %d (average per bucket = %.1f)\n, + nbuckets, (batchTuples / hashtable-nbuckets)); +#endif + ExecHashIncreaseNumBuckets(hashtable); + } + } + /* must provide our own instrumentation support */ if (node-ps.instrument) InstrStopNode(node-ps.instrument, hashtable-totalTuples); @@ -239,6 +286,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) int nbatch; int num_skew_mcvs; int log2_nbuckets; + int log2_nbatch; int nkeys; int i; ListCell *ho; @@ -263,6 +311,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) log2_nbuckets = my_log2(nbuckets); Assert(nbuckets == (1 log2_nbuckets)); + /* nbatch must be a power of 2 */ + log2_nbatch = my_log2(nbatch); + Assert(nbatch
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On 3.7.2014 02:13, Tomas Vondra wrote: Hi, while hacking on the 'dynamic nbucket' patch, scheduled for the next CF (https://commitfest.postgresql.org/action/patch_view?id=1494) I was repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change in how we handle it. TL;DR; version -- I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1) once the table is built and there's free space in work_mem. The patch mentioned above makes implementing this possible / rather simple. Attached is v1 of this experimental patch. It's supposed to be applied on top of v7 of this patch http://www.postgresql.org/message-id/53b59498.3000...@fuzzy.cz as it shared most of the implementation. So apply it like this: patch -p1 hashjoin-nbuckets-growth-v7.patch patch -p1 hashjoin-dynamic-ntup-v1.patch It implements the ideas outlined in the previous message, most importantly: (a) decreases NTUP_PER_BUCKET to 4 (b) checks free work_mem when deciding whether to add batch (c) after building the batches, increases the number of buckets as much as possible, i.e. up to the number of batch tuples, but not exceeding work_mem The improvements for the queries I tried so far are quite significant (partially due to the NTUP_PER_BUCKET decrease, partially due to the additional bucket count increase). The rebuild is quite fast - the patch measures and reports this as a WARNING, and the timings I've seen are ~12ms per 7MB (i.e. ~120ms for 70MB and ~1200ms for 700MB). Of course, this only makes sense when compared to how much time it saved, but for the queries I tested so far this was a good investment. However it's likely there are queries where this may not be the case, i.e. where rebuilding the hash table is not worth it. Let me know if you can construct such query (I wasn't). regards Tomas diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 53642d1..a4623dc 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -58,7 +58,7 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); */ /* Target bucket loading (tuples per bucket) */ -#define NTUP_PER_BUCKET 10 +#define NTUP_PER_BUCKET 4 /* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets. * @@ -77,6 +77,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); #define NTUP_GROW_COEFFICIENT 1.333 #define NTUP_GROW_THRESHOLD (NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT) +#define SPACE_USED(hashtable, nbuckets) ((hashtable)-spaceUsed + (nbuckets) * sizeof(void*)) + /* * ExecHash * @@ -156,21 +158,33 @@ MultiExecHash(HashState *node) } } - /* If average number of tuples per bucket is over the defined threshold, - * increase the number of buckets to get below it. */ +/* + * Consider resizing the hash table (number of buckets) for better + * lookup performance. The code in ExecHashTableInsert guarantees + * we have enough memory to reach NTUP_PER_BUCKET, but maybe we can + * do better - getting lower number of tuples per bucket (up to + * NTUP_PER_BUCKET=1). + */ if (enable_hashjoin_bucket) { - /* consider only tuples in the non-skew buckets */ - double nonSkewTuples = (hashtable-totalTuples - hashtable-skewTuples); - - if ((nonSkewTuples / hashtable-nbatch) (hashtable-nbuckets * NTUP_GROW_THRESHOLD)) { +instr_time start_time, end_time; #ifdef HJDEBUG - printf(Increasing nbucket to %d (average per bucket = %.1f)\n, - nbuckets, (batchTuples / hashtable-nbuckets)); + printf(Increasing nbucket to %d (average per bucket = %.1f)\n, +nbuckets, (batchTuples / hashtable-nbuckets)); #endif - ExecHashIncreaseNumBuckets(hashtable); - } + +elog(WARNING, hash resize (start) : nbuckets=%d, hashtable-nbuckets); + +INSTR_TIME_SET_CURRENT(start_time); + +ExecHashIncreaseNumBuckets(hashtable); + +INSTR_TIME_SET_CURRENT(end_time); +INSTR_TIME_SUBTRACT(end_time, start_time); + +elog(WARNING, hash resize (end) : nbuckets=%d duration=%.3f, hashtable-nbuckets, INSTR_TIME_GET_MILLISEC(end_time)); + } /* must provide our own instrumentation support */ @@ -738,35 +752,34 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable) int oldnbuckets = hashtable-nbuckets; HashJoinTuple *oldbuckets = hashtable-buckets; MemoryContext oldcxt; - double batchTuples = (hashtable-totalTuples / hashtable-nbatch); + +/* average number of tuples per batch */ +double batchTuples = (hashtable-totalTuples - hashtable-skewTuples) / hashtable-nbatch; + +/* memory available for buckets */ +SizefreeMemory = (hashtable-spaceAllowed - hashtable-spaceUsed); /* - * Determine the proper number of buckets, i.e. stop once the average - * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET). -
Re: [HACKERS] tweaking NTUP_PER_BUCKET
Tomas, * Tomas Vondra (t...@fuzzy.cz) wrote: However it's likely there are queries where this may not be the case, i.e. where rebuilding the hash table is not worth it. Let me know if you can construct such query (I wasn't). Thanks for working on this! I've been thinking on this for a while and this seems like it may be a good approach. Have you considered a bloom filter over the buckets..? Also, I'd suggest you check the archives from about this time last year for test cases that I was using which showed cases where hashing the larger table was a better choice- those same cases may also show regression here (or at least would be something good to test). Have you tried to work out what a 'worst case' regression for this change would look like? Also, how does the planning around this change? Are we more likely now to hash the smaller table (I'd guess 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you make any changes due to the rehashing cost?)? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 3.7.2014 15:42, Atri Sharma wrote: On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 30.6.2014 23:12, Tomas Vondra wrote: Hi, attached is v5 of the patch. The main change is that scaling the number of buckets is done only once, after the initial hash table is build. The main advantage of this is lower price. This also allowed some cleanup of unecessary code. However, this new patch causes warning like this: WARNING: temporary file leak: File 231 still referenced I've been eyeballing the code for a while now, but I still fail to see where this comes from :-( Any ideas? Meh, the patches are wrong - I haven't realized the tight coupling between buckets/batches in ExecHashGetBucketAndBatch: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue hashtable-log2_nbuckets) (nbatch - 1); The previous patches worked mostly by pure luck (the nbuckets was set only in the first batch), but once I moved the code at the end, it started failing. And by worked I mean didn't throw an error, but probably returned bogus results with (nbatch1). However, ISTM this nbuckets-nbatch coupling is not really necessary, it's only constructed this way to assign independent batch/bucket. So I went and changed the code like this: *bucketno = hashvalue (nbuckets - 1); *batchno = (hashvalue (32 - hashtable-log2_nbatch)); I.e. using the top bits for batchno, low bits for bucketno (as before). Hopefully I got it right this time. At least it seems to be working for cases that failed before (no file leaks, proper rowcounts so far). Hi, I had a look at the patch and I was wondering if there is a way we can set a cap on the resized size, and instead increase the number of batches instead? Since this patch evaluates the growth of tuples vs increase of space, we could look at increasing the number of batches itself instead of number of buckets, if the resized number of buckets exceeds a threshold. It shouldnt be too hard, AIUI it would involve a call in MultiExecHash right after the new cost evaluation the patch adds. So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET (i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the number of batches? Hmmm, that'd be easy to implement. I don't think it's possible to do that in MultiExecHash (that's too late). But it's a matter of changing this condition in ExecHashTableInsert: if (hashtable-spaceUsed hashtable-spaceAllowed) ExecHashIncreaseNumBatches(hashtable); to something like this if ((hashtable-spaceUsed hashtable-spaceAllowed) || (hashtable-totalTuples / hashtable-nbatch == NTUP_PER_BUCKET * hashtable-nbuckets)) ExecHashIncreaseNumBatches(hashtable); But I don't think increasing number of batches is a good approach, as it means more rescans. I think using memory is more efficient (after all, that's what work_mem is for, right?). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On Thu, Jul 3, 2014 at 11:40 PM, Stephen Frost sfr...@snowman.net wrote: Tomas, * Tomas Vondra (t...@fuzzy.cz) wrote: However it's likely there are queries where this may not be the case, i.e. where rebuilding the hash table is not worth it. Let me know if you can construct such query (I wasn't). Thanks for working on this! I've been thinking on this for a while and this seems like it may be a good approach. Have you considered a bloom filter? IIRC, last time when we tried doing bloom filters, I was short of some real world useful hash functions that we could use for building the bloom filter. If we are restarting experiments on this, I would be glad to assist. Regards, Atri
Re: [HACKERS] tweaking NTUP_PER_BUCKET
Hi Stephen, On 3.7.2014 20:10, Stephen Frost wrote: Tomas, * Tomas Vondra (t...@fuzzy.cz) wrote: However it's likely there are queries where this may not be the case, i.e. where rebuilding the hash table is not worth it. Let me know if you can construct such query (I wasn't). Thanks for working on this! I've been thinking on this for a while and this seems like it may be a good approach. Have you considered a bloom filter over the buckets..? Also, I'd suggest you check the I know you've experimented with it, but I haven't looked into that yet. archives from about this time last year for test cases that I was using which showed cases where hashing the larger table was a better choice- those same cases may also show regression here (or at least would be something good to test). Good idea, I'll look at the test cases - thanks. Have you tried to work out what a 'worst case' regression for this change would look like? Also, how does the planning around this change? Are we more likely now to hash the smaller table (I'd guess 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you make any changes due to the rehashing cost?)? The case I was thinking about is underestimated cardinality of the inner table and a small outer table. That'd lead to a large hash table and very few lookups (thus making the rehash inefficient). I.e. something like this: Hash Join Seq Scan on small_table (rows=100) (actual rows=100) Hash Seq Scan on bad_estimate (rows=100) (actual rows=10) Filter: ((a 100) AND (b 100)) But I wasn't able to reproduce this reasonably, because in practice that'd lead to a nested loop or something like that (which is a planning issue, impossible to fix in hashjoin code). Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma atri.j...@gmail.com wrote: IIRC, last time when we tried doing bloom filters, I was short of some real world useful hash functions that we could use for building the bloom filter. Last time was we wanted to use bloom filters in hash joins to filter out tuples that won't match any of the future hash batches to reduce the amount of tuples that need to be spilled to disk. However the problem was that it was unclear for a given amount of memory usage how to pick the right size bloom filter and how to model how much it would save versus how much it would cost in reduced hash table size. I think it just required some good empirical tests and hash join heavy workloads to come up with some reasonable guesses. We don't need a perfect model just some reasonable bloom filter size that we're pretty sure will usually help more than it hurts. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy in that the application would have little idea what it was getting back. IF EXISTS would make it so spongy as to be useless, IMO. Seems easy enough to look at the count of columns in the result set. But it seems like noise words -- if you don't put IF EXISTS then surely you'll get the same behaviour anyways, no? Fwiw when I wrote ORM-like layers I used to describe the output of the query, including sometimes issuing WHERE 10 queries and looking at the output column types when I needed that before executing the query. Using table metadata would have required a much more in depth understanding of how the database worked and also wouldn't handle expressions, joins, set operations, etc. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate function API versus grouping sets
Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom If we're going to do that, I think it needs to be in 9.4. Are Tom you going to work up a patch? How's this? (applies and passes regression on 9.4 and master) -- Andrew (irc:RhodiumToad) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 09ff035..0388735 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -2199,44 +2199,56 @@ AggGetAggref(FunctionCallInfo fcinfo) } /* - * AggGetPerTupleEContext - fetch per-input-tuple ExprContext + * AggGetTempMemoryContext - fetch short-term memory context * - * This is useful in agg final functions; the econtext returned is the - * same per-tuple context that the transfn was called in (which can - * safely get reset during the final function). + * This is useful in agg final functions; the context returned is one that the + * final function can safely reset as desired. This isn't useful for transition + * functions, since the context returned MAY (we don't promise) be the same as + * the context those are called in. * * As above, this is currently not useful for aggs called as window functions. */ -ExprContext * -AggGetPerTupleEContext(FunctionCallInfo fcinfo) +MemoryContext +AggGetTempMemoryContext(FunctionCallInfo fcinfo) { if (fcinfo-context IsA(fcinfo-context, AggState)) { AggState *aggstate = (AggState *) fcinfo-context; - return aggstate-tmpcontext; + return aggstate-tmpcontext-ecxt_per_tuple_memory; } return NULL; } /* - * AggGetPerAggEContext - fetch per-output-tuple ExprContext + * AggRegisterCallback - register a cleanup callback linked to aggcontext * * This is useful for aggs to register shutdown callbacks, which will ensure - * that non-memory resources are freed. + * that non-memory resources are freed. These callbacks will occur just before + * the associated aggcontext (as returned by AggCheckCallContext) is reset, + * either between groups or as a result of rescanning the query. They will NOT + * be called on error paths. The primary intent is to allow for freeing of + * tuplestores or tuplesorts maintained in aggcontext, or pins held by slots + * created by the agg functions. (They will not be called until after the + * result of the finalfn is no longer needed, so it's safe for the finalfn to + * return data that will be freed by the callback.) * * As above, this is currently not useful for aggs called as window functions. */ -ExprContext * -AggGetPerAggEContext(FunctionCallInfo fcinfo) +void +AggRegisterCallback(FunctionCallInfo fcinfo, + ExprContextCallbackFunction func, + Datum arg) { if (fcinfo-context IsA(fcinfo-context, AggState)) { AggState *aggstate = (AggState *) fcinfo-context; - return aggstate-ss.ps.ps_ExprContext; + RegisterExprContextCallback(aggstate-ss.ps.ps_ExprContext, func, arg); + + return; } - return NULL; + elog(ERROR,Aggregate function cannot register a callback in this context); } diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c index efb0411..d116a63 100644 --- a/src/backend/utils/adt/orderedsetaggs.c +++ b/src/backend/utils/adt/orderedsetaggs.c @@ -49,8 +49,6 @@ typedef struct OSAPerQueryState MemoryContext qcontext; /* Memory context containing per-group data: */ MemoryContext gcontext; - /* Agg plan node's output econtext: */ - ExprContext *peraggecontext; /* These fields are used only when accumulating tuples: */ @@ -117,7 +115,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) Aggref *aggref; MemoryContext qcontext; MemoryContext gcontext; - ExprContext *peraggecontext; List *sortlist; int numSortCols; @@ -133,10 +130,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) elog(ERROR, ordered-set aggregate called in non-aggregate context); if (!AGGKIND_IS_ORDERED_SET(aggref-aggkind)) elog(ERROR, ordered-set aggregate support function called for non-ordered-set aggregate); - /* Also get output exprcontext so we can register shutdown callback */ - peraggecontext = AggGetPerAggEContext(fcinfo); - if (!peraggecontext) - elog(ERROR, ordered-set aggregate called in non-aggregate context); /* * Prepare per-query structures in the fn_mcxt, which we assume is the @@ -150,7 +143,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) qstate-aggref = aggref; qstate-qcontext = qcontext; qstate-gcontext = gcontext; - qstate-peraggecontext = peraggecontext; /* Extract the sort information */ sortlist = aggref-aggorder; @@ -291,9 +283,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) osastate-number_of_rows = 0; /* Now register a shutdown callback to clean things up */ - RegisterExprContextCallback(qstate-peraggecontext, -ordered_set_shutdown, -PointerGetDatum(osastate)); + AggRegisterCallback(fcinfo, + ordered_set_shutdown, +
[HACKERS] Pg_upgrade and toast tables bug discovered
There have been periodic reports of pg_upgrade errors related to toast tables. The most recent one was from May of this year: http://www.postgresql.org/message-id/flat/20140520202223.gb3...@momjian.us#20140520202223.gb3...@momjian.us There error was: Copying user relation files /var/lib/postgresql/8.4/main/base/4275487/4278965 Mismatch of relation OID in database FNBooking: old OID 4279499, new OID 19792 Failure, exiting and the fix is to add a dummy TEXT column to the table on the old cluster to force a toast table, then drop the dummy column. I have had trouble getting a table schema that is causing problems, but received a report via EDB support recently that had a simple schema (anonymized): CREATE TABLE pg_upgrade_toast_test ( x1 numeric(15,0), x2 numeric(15,0), x3 character varying(15), x4 character varying(60), x5 numeric(15,0), x6 numeric(15,0), x7 character varying(15), x8 character varying(60), x9 numeric(15,0), x10 character varying(15), x11 character varying(60), x12 numeric(15,0), x13 numeric(15,0), x14 character varying(15), x15 character varying(60), x16 numeric(15,0), x17 character varying(15), x18 character varying(60), x19 numeric(15,0), x20 character varying(15), x21 character varying(60) ); needs_toast_table() computes the length of this table as 2024 bytes in 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. It turns out it is this commit that causes the difference: commit 97f38001acc61449f7ac09c539ccc29e40fecd26 Author: Robert Haas rh...@postgresql.org Date: Wed Aug 4 17:33:09 2010 + Fix numeric_maximum_size() calculation. The old computation can sometimes underestimate the necessary space by 2 bytes; however we're not back-patching this, because this result isn't used for anything critical. Per discussion with Tom Lane, make the typmod test in this function match the ones in numeric() and apply_typmod() exactly. It seems the impact of this patch on pg_upgrade wasn't considered, or even realized until now. Suggestions on a fix? My initial idea is to to allow for toast tables in the new cluster that aren't in the old cluster by skipping over the extra toast tables. This would only be for pre-9.1 old clusters. It would not involve adding toast tables to the old cluster as pg_upgrade never modifies the old cluster. We already handle cases where the old cluster had toast tables and the new cluster wouldn't ordinarily have them. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate function API versus grouping sets
Here are two more, cumulative with the previous patch and each other: The first removes the assumption in ordered_set_startup that the aggcontext can be cached in fn_extra, and puts it in the transvalue instead. The second exposes the OSA* structures in a header file, so that user-defined ordered-set aggs can use ordered_set_transition[_multi] directly rather than having to cargo-cult it into their own code. -- Andrew (irc:RhodiumToad) diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c index d116a63..92fa9f3 100644 --- a/src/backend/utils/adt/orderedsetaggs.c +++ b/src/backend/utils/adt/orderedsetaggs.c @@ -47,8 +47,6 @@ typedef struct OSAPerQueryState Aggref *aggref; /* Memory context containing this struct and other per-query data: */ MemoryContext qcontext; - /* Memory context containing per-group data: */ - MemoryContext gcontext; /* These fields are used only when accumulating tuples: */ @@ -86,6 +84,8 @@ typedef struct OSAPerGroupState { /* Link to the per-query state for this aggregate: */ OSAPerQueryState *qstate; + /* MemoryContext for per-group data */ + MemoryContext gcontext; /* Sort object we're accumulating data in: */ Tuplesortstate *sortstate; /* Number of normal rows inserted into sortstate: */ @@ -104,6 +104,15 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) OSAPerGroupState *osastate; OSAPerQueryState *qstate; MemoryContext oldcontext; + MemoryContext gcontext; + + /* + * Check we're called as aggregate (and not a window function), and + * get the Agg node's group-lifespan context (which might not be the + * same throughout the query, but will be the same for each transval) + */ + if (AggCheckCallContext(fcinfo, gcontext) != AGG_CONTEXT_AGGREGATE) + elog(ERROR, ordered-set aggregate called in non-aggregate context); /* * We keep a link to the per-query state in fn_extra; if it's not there, @@ -114,16 +123,9 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) { Aggref *aggref; MemoryContext qcontext; - MemoryContext gcontext; List *sortlist; int numSortCols; - /* - * Check we're called as aggregate (and not a window function), and - * get the Agg node's group-lifespan context - */ - if (AggCheckCallContext(fcinfo, gcontext) != AGG_CONTEXT_AGGREGATE) - elog(ERROR, ordered-set aggregate called in non-aggregate context); /* Need the Aggref as well */ aggref = AggGetAggref(fcinfo); if (!aggref) @@ -142,7 +144,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) qstate = (OSAPerQueryState *) palloc0(sizeof(OSAPerQueryState)); qstate-aggref = aggref; qstate-qcontext = qcontext; - qstate-gcontext = gcontext; /* Extract the sort information */ sortlist = aggref-aggorder; @@ -259,10 +260,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) } /* Now build the stuff we need in group-lifespan context */ - oldcontext = MemoryContextSwitchTo(qstate-gcontext); + oldcontext = MemoryContextSwitchTo(gcontext); osastate = (OSAPerGroupState *) palloc(sizeof(OSAPerGroupState)); osastate-qstate = qstate; + osastate-gcontext = gcontext; /* Initialize tuplesort object */ if (use_tuples) diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c index 92fa9f3..bbad0e5 100644 --- a/src/backend/utils/adt/orderedsetaggs.c +++ b/src/backend/utils/adt/orderedsetaggs.c @@ -16,6 +16,8 @@ #include math.h +#include utils/orderedset.h + #include catalog/pg_aggregate.h #include catalog/pg_operator.h #include catalog/pg_type.h @@ -30,71 +32,8 @@ #include utils/tuplesort.h -/* - * Generic support for ordered-set aggregates - * - * The state for an ordered-set aggregate is divided into a per-group struct - * (which is the internal-type transition state datum returned to nodeAgg.c) - * and a per-query struct, which contains data and sub-objects that we can - * create just once per query because they will not change across groups. - * The per-query struct and subsidiary data live in the executor's per-query - * memory context, and go away implicitly at ExecutorEnd(). - */ - -typedef struct OSAPerQueryState -{ - /* Aggref for this aggregate: */ - Aggref *aggref; - /* Memory context containing this struct and other per-query data: */ - MemoryContext qcontext; - - /* These fields are used only when accumulating tuples: */ - - /* Tuple descriptor for tuples inserted into sortstate: */ - TupleDesc tupdesc; - /* Tuple slot we can use for inserting/extracting tuples: */ - TupleTableSlot *tupslot; - /* Per-sort-column sorting information */ - int numSortCols; - AttrNumber *sortColIdx; - Oid *sortOperators; - Oid *eqOperators; - Oid *sortCollations; - bool *sortNullsFirsts; - /* Equality operator call info, created only if needed: */ - FmgrInfo *equalfns; - - /* These fields are used only when accumulating datums: */ - - /* Info about datatype
Re: [HACKERS] Array of composite types returned from python
I wrote: Ronan Dunklau ronan.dunk...@dalibo.com writes: I don't know how to do that without implementing the cache itself. I don't either, but my thought was that we could hack up a simple one-element cache pretty trivially, eg static info and desc variables in PLyObject_ToComposite that are initialized the first time through. You could only test one composite-array type per session with that sort of kluge, but that would be good enough for doing some simple performance testing. I did that, and found that building and returning a million-element composite array took about 4.2 seconds without any optimization, and 3.2 seconds with the hacked-up cache (as of HEAD, asserts off). I'd say that means we might want to do something about it eventually, but it's hardly the first order of business. I've committed the patch with a bit of additional cleanup. I credited Ronan and Ed equally as authors, since I'd say the fix for plpy.prepare was at least as complex as the original patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
Greg Stark st...@mit.edu writes: On Thu, Jul 3, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy in that the application would have little idea what it was getting back. IF EXISTS would make it so spongy as to be useless, IMO. Seems easy enough to look at the count of columns in the result set. Sure, if you know how many columns are in the table to begin with; but that goes back to having table metadata. If you don't, you're probably going to be issuing RETURNING *, and AFAICS RETURNING *, PRIMARY KEY would be entirely useless (even without IF EXISTS :-(). I'm still unconvinced that there's a robust use-case here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade and toast tables bug discovered
Bruce Momjian br...@momjian.us writes: I have had trouble getting a table schema that is causing problems, but received a report via EDB support recently that had a simple schema (anonymized): ... needs_toast_table() computes the length of this table as 2024 bytes in 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. My initial idea is to to allow for toast tables in the new cluster that aren't in the old cluster by skipping over the extra toast tables. This would only be for pre-9.1 old clusters. TBH, it has never been more than the shakiest of assumptions that the new version could not create toast tables where the old one hadn't. I think you should just allow for that case, independently of specific PG versions. Fortunately it seems easy enough, since you certainly don't need to put any data into the new toast tables. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tweaking NTUP_PER_BUCKET
On 3.7.2014 20:50, Tomas Vondra wrote: Hi Stephen, On 3.7.2014 20:10, Stephen Frost wrote: Tomas, * Tomas Vondra (t...@fuzzy.cz) wrote: However it's likely there are queries where this may not be the case, i.e. where rebuilding the hash table is not worth it. Let me know if you can construct such query (I wasn't). Thanks for working on this! I've been thinking on this for a while and this seems like it may be a good approach. Have you considered a bloom filter over the buckets..? Also, I'd suggest you check the archives from about this time last year for test cases that I was using which showed cases where hashing the larger table was a better choice- those same cases may also show regression here (or at least would be something good to test). Good idea, I'll look at the test cases - thanks. I can't find the thread / test cases in the archives. I've found this thread in hackers: http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com Can you point me to the right one, please? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade and toast tables bug discovered
On 2014-07-03 17:09:41 -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: I have had trouble getting a table schema that is causing problems, but received a report via EDB support recently that had a simple schema (anonymized): ... needs_toast_table() computes the length of this table as 2024 bytes in 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. My initial idea is to to allow for toast tables in the new cluster that aren't in the old cluster by skipping over the extra toast tables. This would only be for pre-9.1 old clusters. TBH, it has never been more than the shakiest of assumptions that the new version could not create toast tables where the old one hadn't. I think you should just allow for that case, independently of specific PG versions. Fortunately it seems easy enough, since you certainly don't need to put any data into the new toast tables. I don't think it's just that simple unfortunately. If pg_class entries get created that didn't exist on the old server there's a chance for oid conflicts. Consider SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid); CREATE TABLE table_without_toast_in_old_server(...); -- heap oid 17094, toast oid 16384 SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid); CREATE TABLE another_table(...); ERROR: could not create file ...: File exists I think we even had reports of such a problem. The most robust, but not trivial, approach seems to be to prevent toast table creation if there wasn't a set_next_toast_pg_class_oid(). Then, after all relations are created, iterate over all pg_class entries that possibly need toast tables and recheck if they now might need one. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate function API versus grouping sets
Andrew Gierth and...@tao11.riddles.org.uk writes: Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom If we're going to do that, I think it needs to be in 9.4. Are Tom you going to work up a patch? How's this? (applies and passes regression on 9.4 and master) Pushed with minor cosmetic adjustments. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate function API versus grouping sets
Tom == Tom Lane t...@sss.pgh.pa.us writes: How's this? (applies and passes regression on 9.4 and master) Tom Pushed with minor cosmetic adjustments. Thanks! -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aggregate function API versus grouping sets
Andrew Gierth and...@tao11.riddles.org.uk writes: Here are two more, cumulative with the previous patch and each other: The first removes the assumption in ordered_set_startup that the aggcontext can be cached in fn_extra, and puts it in the transvalue instead. OK, done. (ATM it seems like we wouldn't need gcontext in the transvalue either, but I left it there in case we want it later.) The second exposes the OSA* structures in a header file, so that user-defined ordered-set aggs can use ordered_set_transition[_multi] directly rather than having to cargo-cult it into their own code. I'm not very happy about this one; we intentionally did not expose these structs, so that we could freely make fixes like, for example, your immediately preceding patch. I think it'd be worth considering whether there's a way to allow third-party ordered-set aggs to use the infrastructure in orderedsetaggs.c while still treating these structs as opaque. In any case we'd need a more carefully thought-through API than just decreeing that some private structs are now public. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote: TBH, I thought that RETURNING PRIMARY KEY was already fairly iffy in that the application would have little idea what it was getting back. IF EXISTS would make it so spongy as to be useless, IMO. IF EXISTS is pretty pointless - while the behaviour of getGeneratedKeys() isn't defined for cases where there aren't any, it's only meaningful if the caller has previously asked for the keys to be returned, and someone asking to do that where it doesn't make sense can get an error as far as I'm concerned. No-one does this in practice. Looking at the feature as a more general SQL construct, ISTM that if someone requests RETURNING PRIMARY KEY where there isn't one, an error is appropriate. And for the IF EXISTS case, when on earth will someone request a primary key even if they're not sure one exists? It sounds to me like designing this for JDBC's getGeneratedKeys method is a mistake. There isn't going to be any way that the driver can support that without having looked at the table's metadata for itself, and if it's going to do that then it doesn't need a crutch that only partially solves the problem anyhow. Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet from whatever was returned. It's CURRENTLY doing that, but it's appending RETURNING * and leaving it up to the caller of getGeneratedKeys() to work out which columns the caller is interested in. Turns out that's actually ok - most Java-based ORMs have more than enough metadata about the tables they manage to know which columns are the primary key. It's the driver that doesn't have that information, which is where RETURNING PRIMARY KEY can help by not forcing the driver to request that every single column is returned. The only downside that I see is cases where someone requests the keys to be returned but already has a RETURNING clause in their insert statement - what if the requested columns don't include the primary key? IMO it's probably enough to document that if the caller wants to issue a RETURNING clause then they better make sure it contains the columns they're interested in. This is already way outside anything that standard ORMs will do (they don't know about RETURNING) so anyone doing this is hand-crafting anyway. Cheers Tom
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
Tom Dunstan pg...@tomd.cc writes: On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote: It sounds to me like designing this for JDBC's getGeneratedKeys method is a mistake. There isn't going to be any way that the driver can support that without having looked at the table's metadata for itself, and if it's going to do that then it doesn't need a crutch that only partially solves the problem anyhow. Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet from whatever was returned. It's CURRENTLY doing that, but it's appending RETURNING * and leaving it up to the caller of getGeneratedKeys() to work out which columns the caller is interested in. I might be mistaken, but as I read the spec for getGeneratedKeys, whether or not a column is in a primary key has got nothing to do with whether that column should be returned. It looks to me like what you're supposed to return is columns with computed default values, eg, serial columns. (Whether we should define it as being *exactly* columns created by the SERIAL macro is an interesting question, but for sure those ought to be included.) Now, the pkey might be a serial column ... but it equally well might not be; it might not have any default expression at all. And there certainly could be serial columns that weren't in the pkey. The fact that the spec is kinda fuzzy doesn't entitle us to ignore the plain meaning of the term generated key. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
On 4 July 2014 11:37, Tom Lane t...@sss.pgh.pa.us wrote: Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet from whatever was returned. It's CURRENTLY doing that, but it's appending RETURNING * and leaving it up to the caller of getGeneratedKeys() to work out which columns the caller is interested in. I might be mistaken, but as I read the spec for getGeneratedKeys, whether or not a column is in a primary key has got nothing to do with whether that column should be returned. It looks to me like what you're supposed to return is columns with computed default values, eg, serial columns. (Whether we should define it as being *exactly* columns created by the SERIAL macro is an interesting question, but for sure those ought to be included.) Now, the pkey might be a serial column ... but it equally well might not be; it might not have any default expression at all. And there certainly could be serial columns that weren't in the pkey. The fact that the spec is kinda fuzzy doesn't entitle us to ignore the plain meaning of the term generated key. Well, strictly speaking no, but in reality it doesn't look like other databases that support this feature support anything other than returning auto-increment fields on primary keys, often only supporting a single column (see [1]). Oracle doesn't support even that - callers have to call the variant that specifies a column list, since historically there was no real support for recognising such columns, oracle being sequence based. As such, there can't be any callers in the wild that will expect this to return anything other than the primary key, because there's no support for doing anything else. And if the caller really DOES want other columns returned, they can always call the variant that allows them to specify those columns, or just issue a normal query with a returning clause. This feature really exists to support the default case where those columns aren't specified by the caller, and given current driver support (and sanity) the only thing any caller will expect from such a result is the primary key. Cheers Tom [1] http://www.postgresql.org/message-id/cappfruwqy0z66trv4xmdqnyv0prjky+38x+p4vkhmrrw5rb...@mail.gmail.com
Re: [HACKERS] docs: additional subsection for page-level locks in explicit-locking section
On Fri, Jul 4, 2014 at 12:51 AM, Kevin Grittner kgri...@ymail.com wrote: Fujii Masao masao.fu...@gmail.com wrote: This seems to make sense. Barring objection, I will commit this only in HEAD. Committed. I'm inclined to think this is a slight improvement, just for the sake of consistency with peer level information. You probably already noticed, but the patch as submitted neglects to close the prior sect2 block before opening the new one. Yes, thanks for pointing out that! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Can simplify 'limit 1' with slow function?
slow query(8531 ms): SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 40.12211338311868)')) FROM road order by id LIMIT 1; explain output: Limit (cost=4653.48..4653.48 rows=1 width=3612) - Sort (cost=4653.48..4683.06 rows=11832 width=3612) Sort Key: id - Seq Scan on road (cost=0.00..4594.32 rows=11832 width=3612) fast query(16ms): select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 40.12211338311868)')) from (SELECT shape FROM road order by id LIMIT 1) a explain output: Subquery Scan on a (cost=1695.48..1695.74 rows=1 width=3608) - Limit (cost=1695.48..1695.48 rows=1 width=3612) - Sort (cost=1695.48..1725.06 rows=11832 width=3612) Sort Key: road.id - Seq Scan on road (cost=0.00..1636.32 rows=11832 width=3612) CREATE TABLE road ( shape geometry, id integer ) WITH ( OIDS=FALSE ); There are redundant call when sorting? On Tue, Jul 1, 2014 at 2:16 PM, Martijn van Oosterhout klep...@svana.org wrote: On Sun, Jun 29, 2014 at 10:05:50PM +0800, gotoschool6g wrote: The simplified scene: select slowfunction(s) from a order by b limit 1; is slow than select slowfunction(s) from (select s from a order by b limit 1) as z; if there are many records in table 'a'. The real scene. Function ST_Distance_Sphere is slow, the query: SELECT ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from road order by c limit 1; is slow than: select ST_Distance_Sphere(s, ST_GeomFromText('POINT(1 1)')) from (SELECT s from road order by c limit 1) as a; There are about 7000 records in 'road'. I think to help here I think we need the EXPLAIN ANALYSE output for both queries. Well, I think the problem is a well understood one: there is no guarantee that functions-in-select-list are called exactly once per output row. This is documented -- for example see here: http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS. In short, if you want very precise control of function evaluation use a subquery, or, if you're really paranoid, a CTE. I'm probably dense, but I'm not sure I understand. Or it is that the slowfunction() is called prior to the sort? That seems insane. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog add synchronous mode
On Tue, Jul 1, 2014 at 10:11 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Jun 30, 2014 at 7:09 PM, furu...@pm.nttdata.co.jp wrote: Thanks for the review! +if (secs = 0) +secs = 1;/* Always sleep at least 1 sec */ + +sleeptime = secs * 1000 + usecs / 1000; The above is the code which caused that problem. 'usecs' should have been reset to zero when 'secs' are rounded up to 1 second. But not. Attached is the updated version of the patch. Thank you for the refactoring v2 patch. I did a review of the patch. 1. applied cleanly and compilation was without warnings and errors 2. all regress tests was passed ok 3. sleeptime is ok when the --status-intarvall is set to 1 Thanks for reviewing the patch! I think that this refactoring patch is useful for improving source code readability and making the future patches simpler, whether we adopt your patch or not. So, barring any objections, I'm thinking to commit this refactoring patch. Committed! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade and toast tables bug discovered
On Thu, Jul 3, 2014 at 05:09:41PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: I have had trouble getting a table schema that is causing problems, but received a report via EDB support recently that had a simple schema (anonymized): ... needs_toast_table() computes the length of this table as 2024 bytes in 9.0, and 2064 bytes on 9.1, with the TOAST threshold being 2032 bytes. My initial idea is to to allow for toast tables in the new cluster that aren't in the old cluster by skipping over the extra toast tables. This would only be for pre-9.1 old clusters. TBH, it has never been more than the shakiest of assumptions that the new version could not create toast tables where the old one hadn't. I think you should just allow for that case, independently of specific PG versions. Fortunately it seems easy enough, since you certainly don't need to put any data into the new toast tables. OK, patch attached. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c new file mode 100644 index 6205c74..17e4d5b *** a/contrib/pg_upgrade/info.c --- b/contrib/pg_upgrade/info.c *** gen_db_file_maps(DbInfo *old_db, DbInfo *** 38,58 int *nmaps, const char *old_pgdata, const char *new_pgdata) { FileNameMap *maps; ! int relnum; int num_maps = 0; maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) * old_db-rel_arr.nrels); ! for (relnum = 0; relnum Min(old_db-rel_arr.nrels, new_db-rel_arr.nrels); ! relnum++) { ! RelInfo*old_rel = old_db-rel_arr.rels[relnum]; ! RelInfo*new_rel = new_db-rel_arr.rels[relnum]; if (old_rel-reloid != new_rel-reloid) ! pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n, ! old_db-db_name, old_rel-reloid, new_rel-reloid); /* * TOAST table names initially match the heap pg_class oid. In --- 38,85 int *nmaps, const char *old_pgdata, const char *new_pgdata) { FileNameMap *maps; ! int old_relnum, new_relnum; int num_maps = 0; maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) * old_db-rel_arr.nrels); ! /* ! * The old database shouldn't have more relations than the new one. ! * We force the new cluster to have a TOAST table if the old table ! * had one. ! */ ! if (old_db-rel_arr.nrels new_db-rel_arr.nrels) ! pg_fatal(old and new databases \%s\ have a mismatched number of relations\n, ! old_db-db_name); ! ! /* Drive the loop using new_relnum, which might be higher. */ ! for (old_relnum = new_relnum = 0; new_relnum new_db-rel_arr.nrels; ! new_relnum++) { ! RelInfo*old_rel = old_db-rel_arr.rels[old_relnum]; ! RelInfo*new_rel = new_db-rel_arr.rels[new_relnum]; ! if (old_rel-reloid != new_rel-reloid) ! { ! if (strcmp(new_rel-nspname, pg_toast) == 0) ! /* ! * It is possible that the new cluster has a TOAST table ! * for a table that didn't need one in the old cluster, ! * e.g. 9.0 to 9.1 changed the NUMERIC length computation. ! * Therefore, if we have a TOAST table in the new cluster ! * that doesn't match, skip over it and continue processing. ! * It is possible this TOAST table used an OID that was ! * reserved in the old cluster, but we have no way of ! * testing that, and we would have already gotten an error ! * at the new cluster schema creation stage. ! */ ! continue; ! else ! pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n, ! old_db-db_name, old_rel-reloid, new_rel-reloid); ! } /* * TOAST table names initially match the heap pg_class oid. In *** gen_db_file_maps(DbInfo *old_db, DbInfo *** 76,89 create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db, old_rel, new_rel, maps + num_maps); num_maps++; } ! /* ! * Do this check after the loop so hopefully we will produce a clearer ! * error above ! */ ! if (old_db-rel_arr.nrels != new_db-rel_arr.nrels) ! pg_fatal(old and new databases \%s\ have a different number of relations\n, old_db-db_name); *nmaps = num_maps; --- 103,114 create_rel_filename_map(old_pgdata, new_pgdata, old_db, new_db, old_rel, new_rel, maps + num_maps); num_maps++; + old_relnum++; } ! /* Did we fail to exhaust the old array? */ ! if (old_relnum != old_db-rel_arr.nrels) ! pg_fatal(old and new databases \%s\ have a mismatched number of relations\n, old_db-db_name); *nmaps = num_maps; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade and toast tables bug discovered
On Thu, Jul 3, 2014 at 11:55:40PM +0200, Andres Freund wrote: I don't think it's just that simple unfortunately. If pg_class entries get created that didn't exist on the old server there's a chance for oid conflicts. Consider SELECT binary_upgrade.set_next_heap_pg_class_oid('17094'::pg_catalog.oid); CREATE TABLE table_without_toast_in_old_server(...); -- heap oid 17094, toast oid 16384 SELECT binary_upgrade.set_next_heap_pg_class_oid('16384'::pg_catalog.oid); CREATE TABLE another_table(...); ERROR: could not create file ...: File exists I think we even had reports of such a problem. I had not considered this. I don't remember ever seeing such a report. We have had oid mismatch reports, but we now know the cause of those. The most robust, but not trivial, approach seems to be to prevent toast table creation if there wasn't a set_next_toast_pg_class_oid(). Then, after all relations are created, iterate over all pg_class entries that possibly need toast tables and recheck if they now might need one. Wow, that is going to be kind of odd in that there really isn't a good way to create toast tables except perhaps add a dummy TEXT column and remove it. There also isn't an easy way to not create a toast table, but also find out that one was needed. I suppose we would have to insert some dummy value in the toast pg_class column and come back later to clean it up. I am wondering what the probability of having a table that didn't need a toast table in the old cluster, and needed one in the new cluster, and there being an oid collision. I think the big question is whether we want to go down that route. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
Hanada-san, Thanks for your dedicated reviewing. It's a very long message. So, let me summarize the things I shall do in the next patch. * fix bug: custom-plan class comparison * fix up naming convention and syntax CREATE CUSTOM PLAN PROVIDER, rather than CREATE CUSTOM PLAN. Prefix shall be cpp_. * fix up: definition of get_custom_plan_oid() * fix up: unexpected white spaces, to be tabs. * fix up: remove unnecessary forward declarations. * fix up: revert replace_nestloop_params() to static * make SetCustomPlanRef an optional callback * fix up: typos in various points * add documentation to explain custom-plan interface. Also, I want committer's opinion about the issues below * whether set_cheapest() is called for all relkind? * how argument of add_path handler shall be derivered? Individual comments are put below: Kaigai-san, Sorry for lagged response. Here are my random thoughts about the patch. I couldn't understand the patch fully, because some of APIs are not used by ctidscan. If Custom Scan patch v2 review * Custom plan class comparison In backend/optimizer/util/pathnode.c, custclass is compared by bit-and with 's'. Do you plan to use custclass as bit field? If so, values for custom plan class should not be a character. Otherwise, custclass should be compared by == operator. Sorry, it is a bug that come from the previous design. I had an idea that allows a custom plan provider to support multiple kind of exec nodes, however, I concluded it does not make sense so much. (we can define multiple CPP for each) * Purpose of GetSpecialCustomVar() The reason why FinalizeCustomPlan callback is necessary is not clear to me. Could you show a case that the API would be useful? It is needed feature to replace a built-in join by custom scan, however, it might be unclear on the scan workloads. Let me explain why join replacement needed. A join node has two input slot (inner and outer), its expression node including Var node reference either of slot according to its varno (INNER_VAR or OUTER_VAR). In case when a CPP replaced a join, it has to generate an equivalent result but it may not be a best choice to use two input streams. (Please remind when we construct remote join on postgres_fdw, all the materialization was done on remote side, thus we had one input stream to generate local join equivalent view.) On the other hands, EXPLAIN command has to understand what column is the source of varnodes in targetlist of custom-node even if it is rewritten to use just one slot. For example, which label shall be shown in case when 3rd item of targetlist is originally come from 2nd item of inner slot but all the materialized result is stored into outer slot. Only CPP can track its relationship between the original and the newer one. This interface provides a way to solve a varnode that actually references. * Purpose of FinalizeCustomPlan() The reason why FinalizeCustomPlan callback is necessary is not clear to me, because ctidscan just calls finalize_primnode() and bms_add_members() with given information. Could you show a case that the API would be useful? The main purpose of this callback gives an extension chance to apply finalize_primenode() if custom-node hold expression tree on its private fields. In case when CPP picked up a part of clauses to run its own way, it shall be attached on neither plan-targetlist nor plan-qual, only CPP knows where does it attached. So, these orphan expression nodes have to be treated by CPP. * Is it ok to call set_cheapest() for all relkind? Now set_cheapest() is called not for only relation and foreign table but also custom plan, and other relations such as subquery, function, and value. Calling call_custom_scan_provider() and set_cheapest() in the case of RTE_RELATION seems similar to the old construct, how do you think about this? I don't think we may be actually able to have some useful custom scan logic on these special relation forms, however, I also didn't have a special reason why custom-plan does not need to support these special relations. I'd like to see committer's opinion here. * Is it hard to get rid of CopyCustomPlan()? Copying ForeignScan node doesn't need per-FDW copy function by limiting fdw_private to have only copy-able objects. Can't we use the same way for CustomPlan? Letting authors call NodeSetTag or copyObject() sounds uncomfortable to me. This would be able to apply to TextOutCustomPlan() and TextOutCustomPath() too. FDW-like design was the original one, but the latest design was suggestion by Tom on the v9.4 development cycle, because some data types are not complianced to copyObject; like Bitmapset. * MultiExec support is appropriate for the first version? The cases need MultiExec seems little complex for the first version of custom scan. What kind of plan do you image for this feature? It is definitely necessary to exchange multiple rows with custom-format with upper level if
Re: [HACKERS] RETURNING PRIMARY KEY syntax extension
On Fri, Jul 4, 2014 at 7:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: Tom Dunstan pg...@tomd.cc writes: On 4 July 2014 00:07, Tom Lane t...@sss.pgh.pa.us wrote: It sounds to me like designing this for JDBC's getGeneratedKeys method is a mistake. There isn't going to be any way that the driver can support that without having looked at the table's metadata for itself, and if it's going to do that then it doesn't need a crutch that only partially solves the problem anyhow. Sure it can - it append RETURNING PRIMARY KEY and hand back a ResultSet from whatever was returned. It's CURRENTLY doing that, but it's appending RETURNING * and leaving it up to the caller of getGeneratedKeys() to work out which columns the caller is interested in. I might be mistaken, but as I read the spec for getGeneratedKeys, whether or not a column is in a primary key has got nothing to do with whether that column should be returned. It looks to me like what you're supposed to return is columns with computed default values, eg, serial columns. (Whether we should define it as being *exactly* columns created by the SERIAL macro is an interesting question, but for sure those ought to be included.) Now, the pkey might be a serial column ... but it equally well might not be; it might not have any default expression at all. And there certainly could be serial columns that weren't in the pkey. 100% agree with Tom. The fact that the spec is kinda fuzzy doesn't entitle us to ignore the plain meaning of the term generated key. regards, tom lane -- Rushabh Lathia
[HACKERS] Issue while calling new PostgreSQL command from a Java Application
Hi, -- I have defined a new command my_command in PostgreSQL. This command takes the path of ANALYZE and inside analyze.c, I have a function to do some operations if its my_command.This command takes the input arguments: table name, column name and an input string. my_command nation (n_nationkey) 'input string'; When I run this command from command line psql, it works as expected. But when I call the same command from a java application, the variable that stores the input string is NULL. I printed the value of the input string in gram.y file where I have defined my_command. fprintf (stderr, I am inside gram.y %s\n,n-inp_str); and the input string is printed correctly. But when I print stmt-inp_str in the function standard_ProcessUtility() of utility.c for the case T_VacuumStmt, I get the value as NULL. This is as far as I could trace back from analyze.c. I am not sure how executing the same command from an application can make a difference. gram.y content gist: -- MyStmt: my_keyword qualified_name name_list my_inp_str { VacuumStmt *n = makeNode(VacuumStmt); n-options = VACOPT_ANALYZE; n-freeze_min_age = -1; n-freeze_table_age = -1; n-relation = $2; n-va_cols = $3; n-inp_str = $4; fprintf (stderr, I am inside gram.y %s\n,n-inp_str); $$ = (Node *)n; }; char *inp_str is added to the struct VacuumStmt in parsenodes.h --- Only the newly added char *inp_str(that is different from ANALYZE) value is NULL. I was able to retrieve the column name from va_cols. Any help is appreciated. Thanks! -- Regards, Ashoke
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Jul 4, 2014 at 4:58 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Updated version of patches are attached. Changes are as follows 1. Improved readability of the code as per the review comments. 2. Addition of block_compression field in BkpBlock structure to store information about compression of block. This provides for switching compression on/off and changing compression algorithm as required. 3.Handling of OOM in critical section by checking for return value of malloc and proceeding without compression of FPW if return value is NULL. Thanks for updating the patches! But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to be able to apply to HEAD cleanly. --- $ git am ~/Desktop/0001-Support-for-LZ4-and-Snappy-2.patch Applying: Support for LZ4 and Snappy-2 $ git am ~/Desktop/0002-CompressBackupBlock_snappy_lz4_pglz-2.patch Applying: CompressBackupBlock_snappy_lz4_pglz-2 /home/postgres/pgsql/git/.git/rebase-apply/patch:42: indent with spaces. /*Allocates memory for compressed backup blocks according to the compression algorithm used.Once per session at the time of insertion of first XLOG record. /home/postgres/pgsql/git/.git/rebase-apply/patch:43: indent with spaces. This memory stays till the end of session. OOM is handled by making the code proceed without FPW compression*/ /home/postgres/pgsql/git/.git/rebase-apply/patch:58: indent with spaces. if(compressed_pages[j] == NULL) /home/postgres/pgsql/git/.git/rebase-apply/patch:59: space before tab in indent. { /home/postgres/pgsql/git/.git/rebase-apply/patch:60: space before tab in indent. compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF; error: patch failed: src/backend/access/transam/xlog.c:60 error: src/backend/access/transam/xlog.c: patch does not apply Patch failed at 0001 CompressBackupBlock_snappy_lz4_pglz-2 When you have resolved this problem run git am --resolved. If you would prefer to skip this patch, instead run git am --skip. To restore the original branch and stop patching run git am --abort. --- Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers