Re: [HACKERS] [GENERAL] Insert result does not match record count
On 02/01/2014 02:26 AM, Bruce Momjian wrote: On Sat, Feb 1, 2014 at 02:25:16AM +0100, Vik Fearing wrote: OK, thanks for the feedback. I understand now. The contents of the string will potentially have a larger integer, but the byte length of the string in the wire protocol doesn't change. Let's wait for Vik to reply and I think we can move forward. Unfortunately, I just did some cleanup last week and removed that branch. Had I waited a bit more I still would have had all the work I had done. I'll see how quickly I can redo it to get to the part where I got scared of what I was doing. It will have to wait until next week though; I am currently at FOSDEM. OK, thanks. I thought it only required passing the int64 around until it got into the string passed to the client. The original patch is in the email archives if you want it. The original patch didn't have much in the way of actual work done, unfortunately. Without re-doing the work, my IRC logs show that I was bothered by this in src/backend/tcop/postgres.c: case 'E':/* execute */ { const char *portal_name; intmax_rows; forbidden_in_wal_sender(firstchar); /* Set statement_timestamp() */ SetCurrentStatementStartTimestamp(); portal_name = pq_getmsgstring(input_message); max_rows = pq_getmsgint(input_message, 4); pq_getmsgend(input_message); exec_execute_message(portal_name, max_rows); } break; I needed to change max_rows to int64 which meant I had to change pq_getmsgint to pq_getmsgint64 which made me a little worried. I was already overwhelmed by how much code I was changing and this one made me reconsider. If it's just a n00b thing, I guess I can pick this back up for 9.5. -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pg_sleep(interval)
Hi It seems like pg_sleep_until() has issues if used within a transaction, as it uses now() and not clock_timestamp(). Please find attached a patch that solves this issue. For consistency reasons, I also modified pg_sleep_for() to also use clock_timestamp. Regards On Fri, Jan 31, 2014 at 2:12 AM, Vik Fearing vik.fear...@dalibo.com wrote: On 01/30/2014 09:48 PM, Robert Haas wrote: On Thu, Oct 17, 2013 at 9:11 AM, Vik Fearing vik.fear...@dalibo.com wrote: On 10/17/2013 02:42 PM, Robert Haas wrote: On Thu, Oct 17, 2013 at 8:26 AM, Vik Fearing vik.fear...@dalibo.com wrote: On 10/17/2013 10:03 AM, Fabien COELHO wrote: My guess is that it won't be committed if there is a single but it might break one code or surprise one user somewhere in the universe, but I wish I'll be proven wrong. IMO, returned with feedback on a 1 liner is really akin to rejected. I have attached here an entirely new patch (new documentation and everything) that should please everyone. It no longer overloads pg_sleep(double precision) but instead add two new functions: * pg_sleep_for(interval) * pg_sleep_until(timestamp with time zone) Because it's no longer overloading the original pg_sleep, Robert's ambiguity objection is no more. Also, I like how it reads aloud: SELECT pg_sleep_for('5 minutes'); If people like this, I'll reject the current patch and add this one to the next commitfest. I find that naming relatively elegant. However, you've got to schema-qualify every function and operator used in the definitions, or you're creating a search-path security vulnerability. Good catch. Updated patch attached. Committed. Thanks! -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *** *** 3034,3042 DATA(insert OID = 2625 ( pg_ls_dir PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 DESCR(list all files in a directory); DATA(insert OID = 2626 ( pg_sleep PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 701 _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ )); DESCR(sleep for the specified time in seconds); ! DATA(insert OID = 3935 ( pg_sleep_for PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 1186 _null_ _null_ _null_ _null_ select pg_catalog.pg_sleep(extract(epoch from pg_catalog.now() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.now())) _null_ _null_ _null_ )); DESCR(sleep for the specified interval); ! DATA(insert OID = 3936 ( pg_sleep_until PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 1184 _null_ _null_ _null_ _null_ select pg_catalog.pg_sleep(extract(epoch from $1) operator(pg_catalog.-) extract(epoch from pg_catalog.now())) _null_ _null_ _null_ )); DESCR(sleep until the specified time); DATA(insert OID = 2971 ( textPGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 16 _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ )); --- 3034,3042 DESCR(list all files in a directory); DATA(insert OID = 2626 ( pg_sleep PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 701 _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ )); DESCR(sleep for the specified time in seconds); ! DATA(insert OID = 3935 ( pg_sleep_for PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 1186 _null_ _null_ _null_ _null_ select pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp())) _null_ _null_ _null_ )); DESCR(sleep for the specified interval); ! DATA(insert OID = 3936 ( pg_sleep_until PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 1184 _null_ _null_ _null_ _null_ select pg_catalog.pg_sleep(extract(epoch from $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp())) _null_ _null_ _null_ )); DESCR(sleep until the specified time); DATA(insert OID = 2971 ( textPGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 16 _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ )); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On 01/30/2014 01:53 AM, Tomas Vondra wrote: (3) A file with explain plans for 4 queries suffering ~2x slowdown, and explain plans with 9.4 master and Heikki's patches is available here: http://www.fuzzy.cz/tmp/gin/queries.txt All the queries have 6 common words, and the explain plans look just fine to me - exactly like the plans for other queries. Two things now caught my eye. First some of these queries actually have words repeated - either exactly like database database or in negated form like !anything anything. Second, while generating the queries, I use dumb frequency, where only exact matches count. I.e. write != written etc. But the actual number of hits may be much higher - for example write matches exactly just 5% documents, but using @@ it matches more than 20%. I don't know if that's the actual cause though. Ok, here's another variant of these patches. Compared to git master, it does three things: 1. It adds the concept of ternary consistent function internally, but no catalog changes. It's implemented by calling the regular boolean consistent function both ways. 2. Use a binary heap to get the next item from the entries in a scan. I'm pretty sure this makes sense, because arguably it makes the code more readable, and reduces the number of item pointer comparisons significantly for queries with a lot of entries. 3. Only perform the pre-consistent check to try skipping entries, if we don't already have the next item from the entry loaded in the array. This is a tradeoff, you will lose some of the performance gain you might get from pre-consistent checks, but it also limits the performance loss you might get from doing useless pre-consistent checks. So taken together, I would expect this patch to make some of the performance gains less impressive, but also limit the loss we saw with some of the other patches. Tomas, could you run your test suite with this patch, please? - Heikki diff --git a/src/backend/access/gin/Makefile b/src/backend/access/gin/Makefile index aabc62f..db4f496 100644 --- a/src/backend/access/gin/Makefile +++ b/src/backend/access/gin/Makefile @@ -14,6 +14,6 @@ include $(top_builddir)/src/Makefile.global OBJS = ginutil.o gininsert.o ginxlog.o ginentrypage.o gindatapage.o \ ginbtree.o ginscan.o ginget.o ginvacuum.o ginarrayproc.o \ - ginbulk.o ginfast.o ginpostinglist.o + ginbulk.o ginfast.o ginpostinglist.o ginlogic.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index a45d722..f2ea962 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -32,41 +32,6 @@ typedef struct pendingPosition bool *hasMatchKey; } pendingPosition; - -/* - * Convenience function for invoking a key's consistentFn - */ -static bool -callConsistentFn(GinState *ginstate, GinScanKey key) -{ - /* - * If we're dealing with a dummy EVERYTHING key, we don't want to call the - * consistentFn; just claim it matches. - */ - if (key-searchMode == GIN_SEARCH_MODE_EVERYTHING) - { - key-recheckCurItem = false; - return true; - } - - /* - * Initialize recheckCurItem in case the consistentFn doesn't know it - * should set it. The safe assumption in that case is to force recheck. - */ - key-recheckCurItem = true; - - return DatumGetBool(FunctionCall8Coll(ginstate-consistentFn[key-attnum - 1], - ginstate-supportCollation[key-attnum - 1], - PointerGetDatum(key-entryRes), - UInt16GetDatum(key-strategy), - key-query, - UInt32GetDatum(key-nuserentries), - PointerGetDatum(key-extra_data), - PointerGetDatum(key-recheckCurItem), - PointerGetDatum(key-queryValues), - PointerGetDatum(key-queryCategories))); -} - /* * Goes to the next page if current offset is outside of bounds */ @@ -453,13 +418,31 @@ restartScanEntry: freeGinBtreeStack(stackEntry); } +static int +entryCmp(Datum a, Datum b, void *arg) +{ + GinScanEntry ea = (GinScanEntry) DatumGetPointer(a); + GinScanEntry eb = (GinScanEntry) DatumGetPointer(b); + return -ginCompareItemPointers(ea-curItem, eb-curItem); +} + static void startScanKey(GinState *ginstate, GinScanKey key) { + int i; ItemPointerSetMin(key-curItem); key-curItemMatches = false; key-recheckCurItem = false; key-isFinished = false; + + GinInitConsistentMethod(ginstate, key); + + key-entryHeap = binaryheap_allocate(key-nentries, entryCmp, NULL); + for (i = 0; i key-nentries; i++) + binaryheap_add(key-entryHeap, PointerGetDatum(key-scanEntry[i])); + + key-nunloaded = 0; + key-unloadedEntries = palloc(key-nentries * sizeof(GinScanEntry)); } static void @@ -649,6 +632,11 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advan * * Item pointers are returned in ascending order. * + * If 'ifcheap' is passed as TRUE, the function only
Re: [HACKERS] SSL: better default ciphersuite
On Thu, Dec 12, 2013 at 04:32:07PM +0200, Marko Kreen wrote: Attached patch changes default ciphersuite to HIGH:MEDIUM:+3DES:!aNULL and also adds documentation about reasoning for it. This is the last pending SSL cleanup related patch: https://commitfest.postgresql.org/action/patch_view?id=1310 Peter, you have claimed it as committer, do you see any remaining issues with it? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2
Hi, On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote: Here are the results of a benchmark on Nathan Boley's 64-core, 4 socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/ That's interesting. The maximum number of what you see here (~293125) is markedly lower than what I can get. ... poke around ... Hm, that's partially because you're using pgbench without -M prepared if I see that correctly. The bottleneck in that case is primarily memory allocation. But even after that I am getting higher numbers: ~342497. Trying to nail down the differnce it oddly seems to be your max_connections=80 vs my 100. The profile in both cases is markedly different, way much more spinlock contention with 80. All in Pin/UnpinBuffer(). I think =80 has to lead to some data being badly aligned. I can reproduce that =91 has *much* better performance than =90. 170841.844938 vs 368490.268577 in a 10s test. Reproducable both with an without the test. That's certainly worth some investigation. This is *not* reproducable on the intel machine, so it might the associativity of the L1/L2 cache on the AMD. Perhaps I should have gone past 64 clients, because in the document Lock Scaling Analysis on Intel Xeon Processors [1], Intel write: This implies that with oversubscription (more threads running than available logical CPUs), the performance of spinlocks can depend heavily on the exact OS scheduler behavior, and may change drastically with operating system or VM updates. I haven't bothered with a higher client counts though, because Andres noted it's the same with 90 clients on this AMD system. Andres: Do you see big problems when # clients # logical cores on the affected Intel systems? There's some slowdown with the patch applied, but it's not big. Without it, the slowdown is much earlier. There is only a marginal improvement in performance on this big 4 socket system. Andres informs me privately that he has reproduced the problem on multiple new 4-socket Intel servers, so it seems reasonable to suppose that more or less an Intel thing. I've just poked around, it's not just 4 socket, but also 2 socket systems. Some background: The setups that triggered me into working on the patchset didn't really have a pgbench like workload, the individual queries were/are more complicated even though it's still an high throughput OLTP workload. And the contention was *much* higher than what I can reproduce with pgbench -S, there was often nearly all time spent in the lwlock's spinlock, and it was primarily the buffer mapping lwlocks, being locked in shared mode. The difference is that instead of locking very few buffers per query like pgbench does, they touched much more. If you look at a profile of a pgbench -S workload -cj64 it's pretty much all bottlenecked by GetSnapshotData(): unpatched: - 40.91% postgres_plainl postgres_plainlw[.] s_lock - s_lock - 51.34% LWLockAcquire GetSnapshotData - GetTransactionSnapshot + 55.23% PortalStart + 44.77% PostgresMain - 48.66% LWLockRelease GetSnapshotData - GetTransactionSnapshot + 53.64% PostgresMain + 46.36% PortalStart + 2.65% pgbench [kernel.kallsyms] [k] try_to_wake_up + 2.61% postgres_plainl postgres_plainlw[.] LWLockRelease + 1.36% postgres_plainl postgres_plainlw[.] LWLockAcquire + 1.25% postgres_plainl postgres_plainlw[.] GetSnapshotData patched: - 2.94% postgres postgres [.] LWLockAcquire - LWLockAcquire + 26.61% ReadBuffer_common + 17.08% GetSnapshotData + 10.71% _bt_relandgetbuf + 10.24% LockAcquireExtended + 10.11% VirtualXactLockTableInsert + 9.17% VirtualXactLockTableCleanup + 7.55% _bt_getbuf + 3.40% index_fetch_heap + 1.51% LockReleaseAll + 0.89% StartTransactionCommand + 0.83% _bt_next + 0.75% LockRelationOid + 0.52% ReadBufferExtended - 2.75% postgres postgres [.] _bt_compare - _bt_compare + 96.72% _bt_binsrch + 1.71% _bt_moveright + 1.29% _bt_search - 2.67% postgres postgres [.] GetSnapshotData - GetSnapshotData + 97.03% GetTransactionSnapshot + 2.16% PostgresMain + 0.81% PortalStart So now the profile looks much saner/less contended which immediately is visible in transaction rates: 192584.218530 vs 552834.002573. But if you try to attack the contention from the other end, by setting default_transaction_isolation='repeatable read' to reduce the number of snapshots taken, its suddenly 536789.807391 vs 566839.328922. A *much* smaller benefit. The reason the patch doesn't help that much with that setting is that there simply isn't as much actual contention there: + 2.77% postgres postgres[.] _bt_compare - 2.72% postgres postgres[.]
Re: [HACKERS] [9.3 bug] disk space in pg_xlog increases during archive recovery
On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-24 22:31:17 +0900, MauMau wrote: From: Fujii Masao masao.fu...@gmail.com On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas Thanks! The patch looks good to me. Attached is the updated version of the patch. I added the comments. Thank you very much. Your comment looks great. I tested some recovery situations, and confirmed that WAL segments were kept/removed as intended. I'll update the CommitFest entry with this patch. You haven't updated the patches status so far, so I've now marked as returned feedback as several people voiced serious doubts about the approach. If that's not accurate please speak up. The problem is, we might not be able to perform restartpoints more aggressively even if we reduce checkpoint_timeout in the server under the archive recovery. Because the frequency of occurrence of restartpoints depends on not only that checkpoint_timeout but also the checkpoints which happened while the server was running. I haven't tried reducing checkpoint_timeout. Did you try reducing checkpoint_segments? As I pointed out, at least if standby_mode is enabled, it will also trigger checkpoints, independently from checkpoint_timeout. Right. If standby_mode is enabled, checkpoint_segment can trigger the restartpoint. But the problem is that the timing of restartpoint depends on not only the checkpoint parameters (i.e., checkpoint_timeout and checkpoint_segments) that are used during archive recovery but also the checkpoint WAL that was generated by the master. For example, could you imagine the case where the master generated only one checkpoint WAL since the last backup and it crashed with database corruption. Then DBA decided to perform normal archive recovery by using the last backup. In this case, even if DBA reduces both checkpoint_timeout and checkpoint_segments, only one restartpoint can occur during recovery. This low frequency of restartpoint might fill up the disk space with lots of WAL files. This would be harmless if the server that we are performing recovery in has enough disk space. But I can imagine that some users want to recover the database and restart the service temporarily in poor server with less enough disk space until they can purchase sufficient server. In this case, accumulating lots of WAL files in pg_xlog might be harmful. If the issue is that you're not using standby_mode (if so, why?), then the fix maybe is to make that apply to a wider range of situations. I guess that he is not using standby_mode because, according to his first email in this thread, he said he would like to prevent WAL from accumulating in pg_xlog during normal archive recovery (i.e., PITR). Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD
Hi, In the nearby thread at http://archives.postgresql.org/message-id/20140202140014.GM5930%40awork2.anarazel.de Peter and I discovered that there is a large performance difference between different max_connections on a larger machine (4x Opteron 6272, 64 cores together) in a readonly pgbench tests... Just as reference, we're talking about a performance degradation from 475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting max_connections to 90, from 91... On 2014-02-02 15:00:14 +0100, Andres Freund wrote: On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote: Here are the results of a benchmark on Nathan Boley's 64-core, 4 socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/ That's interesting. The maximum number of what you see here (~293125) is markedly lower than what I can get. ... poke around ... Hm, that's partially because you're using pgbench without -M prepared if I see that correctly. The bottleneck in that case is primarily memory allocation. But even after that I am getting higher numbers: ~342497. Trying to nail down the differnce it oddly seems to be your max_connections=80 vs my 100. The profile in both cases is markedly different, way much more spinlock contention with 80. All in Pin/UnpinBuffer(). I think =80 has to lead to some data being badly aligned. I can reproduce that =91 has *much* better performance than =90. 170841.844938 vs 368490.268577 in a 10s test. Reproducable both with an without the test. That's certainly worth some investigation. This is *not* reproducable on the intel machine, so it might the associativity of the L1/L2 cache on the AMD. So, I looked into this, and I am fairly certain it's because of the (mis-)alignment of the buffer descriptors. With certain max_connections settings InitBufferPool() happens to get 64byte aligned addresses, with others not. I checked the alignment with gdb to confirm that. A quick hack (attached) making BufferDescriptor 64byte aligned indeed restored performance across all max_connections settings. It's not surprising that a misaligned buffer descriptor causes problems - there'll be plenty of false sharing of the spinlocks otherwise. Curious that the the intel machine isn't hurt much by this. Now all this hinges on the fact that by a mere accident BufferDescriptors are 64byte in size: struct sbufdesc { BufferTag tag; /* 020 */ BufFlags flags;/*20 2 */ uint16 usage_count; /*22 2 */ unsigned int refcount; /*24 4 */ intwait_backend_pid; /*28 4 */ slock_tbuf_hdr_lock; /*32 1 */ /* XXX 3 bytes hole, try to pack */ intbuf_id; /*36 4 */ intfreeNext; /*40 4 */ /* XXX 4 bytes hole, try to pack */ LWLock * io_in_progress_lock; /*48 8 */ LWLock * content_lock; /*56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 10 */ /* sum members: 57, holes: 2, sum holes: 7 */ }; We could polish up the attached patch and apply it to all the branches, the costs of memory are minimal. But I wonder if we shouldn't instead make ShmemInitStruct() always return cacheline aligned addresses. That will require some fiddling, but it might be a good idea nonetheless? I think we should also consider some more reliable measures to have BufferDescriptors cacheline sized, rather than relying on the happy accident. Debugging alignment issues isn't fun, too much of a guessing game... Thoughts? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index e187242..96b4eea 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -78,10 +78,14 @@ InitBufferPool(void) BufferDescriptors = (BufferDesc *) ShmemInitStruct(Buffer Descriptors, NBuffers * sizeof(BufferDesc), foundDescs); + BufferDescriptors = (BufferDesc *)( + TYPEALIGN(64, BufferDescriptors)); BufferBlocks = (char *) ShmemInitStruct(Buffer Blocks, NBuffers * (Size) BLCKSZ, foundBufs); + BufferBlocks = (char *) ( + TYPEALIGN(64, BufferBlocks)); if (foundDescs || foundBufs) { @@ -167,9 +171,11 @@ BufferShmemSize(void) /* size of buffer descriptors */ size = add_size(size, mul_size(NBuffers, sizeof(BufferDesc))); + size = add_size(size, 64); /* size of data pages */ size = add_size(size, mul_size(NBuffers, BLCKSZ)); +
Re: [HACKERS] [9.3 bug] disk space in pg_xlog increases during archive recovery
On 2014-02-02 23:50:40 +0900, Fujii Masao wrote: On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-01-24 22:31:17 +0900, MauMau wrote: I haven't tried reducing checkpoint_timeout. Did you try reducing checkpoint_segments? As I pointed out, at least if standby_mode is enabled, it will also trigger checkpoints, independently from checkpoint_timeout. Right. If standby_mode is enabled, checkpoint_segment can trigger the restartpoint. But the problem is that the timing of restartpoint depends on not only the checkpoint parameters (i.e., checkpoint_timeout and checkpoint_segments) that are used during archive recovery but also the checkpoint WAL that was generated by the master. Sure. But we really *need* all the WAL since the last checkpoint's redo location locally to be safe. For example, could you imagine the case where the master generated only one checkpoint WAL since the last backup and it crashed with database corruption. Then DBA decided to perform normal archive recovery by using the last backup. In this case, even if DBA reduces both checkpoint_timeout and checkpoint_segments, only one restartpoint can occur during recovery. This low frequency of restartpoint might fill up the disk space with lots of WAL files. I am not sure I understand the point of this scenario. If the primary crashed after a checkpoint, there won't be that much WAL since it happened... If the issue is that you're not using standby_mode (if so, why?), then the fix maybe is to make that apply to a wider range of situations. I guess that he is not using standby_mode because, according to his first email in this thread, he said he would like to prevent WAL from accumulating in pg_xlog during normal archive recovery (i.e., PITR). Well, that doesn't necessarily prevent you from using standby_mode... But yes, that might be the case. I wonder if we shouldn't just always look at checkpoint segments during !crash recovery. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup and pg_stat_tmp directory
On Sat, Feb 1, 2014 at 2:08 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 31, 2014 at 8:40 AM, Fujii Masao masao.fu...@gmail.com wrote: Yeah, I was thinking that, too. I'm not sure whether including log files in backup really increases the security risk, though. There are already very important data, i.e., database, in backups. Anyway, since the amount of log files can be very large and they are not essential for recovery, it's worth considering whether to exclude them. OTOH, I'm sure that some users prefer current behavior for some reasons. So I think that it's better to expose the pg_basebackup option specifying whether log files are included in backups or not. I don't really see how this can work reliably; pg_log isn't a hard-coded name, but rather a GUC that users can leave set to that value or set to any other value they choose. I'm thinking to change basebackup.c so that it compares the name of the directory that it's trying to back up and the setting value of log_directory parameter, then, if they are the same, it just skips the directory. The patch that I sent upthread does this regarding stats_temp_directory. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
Dave Page dp...@pgadmin.org writes: On Sun, Feb 2, 2014 at 1:03 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think we should give serious consideration to desupporting this combination so that we can get rid of the plague of PGDLLIMPORT marks. No objection here - though I should point out that it's not been offline for a long time (aside from a couple of weeks in January) - it's been happily building most pre-9.2 branches for ages. 9.1 seems to be stuck, along with HEAD, and I forgot to add 9.3. I'm in the process of cleaning that up as time allows, but am happy to drop it instead if we no longer want to support anything that old. We certainly don't use anything resembling that config for the EDB installer builds. Further discussion pointed out that currawong, for example, seems to want PGDLLIMPORT markings but is able to get by without them in some cases that narwhal evidently doesn't like. So at this point, desupporting narwhal's configuration is clearly premature --- we should instead be looking into exactly what is causing the different cases to fail or not fail. I still have hopes that we might be able to get rid of PGDLLIMPORT marks, but by actually understanding why they seem to be needed in some cases and not others, not by just arbitrarily dropping support. In the meantime, please do get HEAD running again on that machine. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Insert result does not match record count
Vik Fearing vik.fear...@dalibo.com writes: Without re-doing the work, my IRC logs show that I was bothered by this in src/backend/tcop/postgres.c: max_rows = pq_getmsgint(input_message, 4); I needed to change max_rows to int64 which meant I had to change pq_getmsgint to pq_getmsgint64 which made me a little worried. As well you should be, because we are *not* doing that. That would be a guaranteed-incompatible protocol change. Fortunately, I don't see any functional need for widening the row-limit field in execute messages; how likely is it that someone wants to fetch exactly 3 billion rows? The practical use-cases for nonzero row limits generally involve fetching a bufferload worth of data at a time, so that the restriction to getting no more than INT_MAX rows at once is several orders of magnitude away from being a problem. The same goes for internal uses of row limits, which makes it questionable whether it's worth changing the width of ExecutorRun's count parameter, which is what I assume you were on about here. But in any case, if we did that we'd not try to reflect it as far as here, because the message format specs can't change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.
Peter Geoghegan p...@heroku.com writes: On Wed, Jan 29, 2014 at 1:10 PM, Robert Haas rh...@postgresql.org wrote: Include planning time in EXPLAIN ANALYZE output. Isn't it perhaps a little confusing that Planning time may well exceed Total runtime? Perhaps s/Total runtime/Execution time/ ? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote: On Mon, Oct 7, 2013 at 03:02:36PM -0400, Robert Haas wrote: On Fri, Oct 4, 2013 at 3:20 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-04 15:15:36 -0400, Robert Haas wrote: Andres, are you (or is anyone) going to try to fix this assertion failure? I think short term replacing it by IsTransactionOrTransactionBlock() is the way to go. Entirely restructuring how cache invalidation in the abort path works is not a small task. Well, if we're going to go that route, how about something like the attached? I included the assert-change per se, an explanatory comment, and the test case that Noah devised to cause the current assertion to fail. Is there any plan to commit this? IMO it has to be applied. Tom objected on the grounds that cache invalidation has to be fixed properly but that's a major restructuring of code that worked this way for a long time. So changing the Assert() to reflect that seems fair to me. I'd adapt the tests with a sentence explaining what they test, on a first look they are pretty obscure... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] slow startup due to LWLockAssign() spinlock
Hi, On larger, multi-socket, machines, startup takes a fair bit of time. As I was profiling anyway I looked into it and noticed that just about all of it is spent in LWLockAssign() called by InitBufferPool(). Starting with shared_buffers=48GB on the server Nate Boley provided, takes about 12 seconds. Nearly all of it spent taking the ShmemLock spinlock. Simply modifying LWLockAssign() to not take the spinlock when !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making LWLockAssign() prettier it seems enough of a speedup to be worthwile nonetheless. Since this code is also hit when do an emergency restart, I'd say it has practical relevance... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 0679eb5..10ee364 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -451,9 +451,11 @@ CreateLWLocks(void) * LWLockAssign - assign a dynamically-allocated LWLock number * * We interlock this using the same spinlock that is used to protect - * ShmemAlloc(). Interlocking is not really necessary during postmaster - * startup, but it is needed if any user-defined code tries to allocate - * LWLocks after startup. + * ShmemAlloc(). Interlocking is not necessary during postmaster startup, but + * it is needed if any user-defined code tries to allocate LWLocks after + * startup. Since we allocate large amounts of LWLocks for the buffer pool, we + * avoid taking the spinlock when not needed, as it has shown to slowdown + * startup considerably. */ LWLock * LWLockAssign(void) @@ -464,14 +466,17 @@ LWLockAssign(void) volatile int *LWLockCounter; LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int)); - SpinLockAcquire(ShmemLock); + if (IsUnderPostmaster) + SpinLockAcquire(ShmemLock); if (LWLockCounter[0] = LWLockCounter[1]) { - SpinLockRelease(ShmemLock); + if (IsUnderPostmaster) + SpinLockRelease(ShmemLock); elog(ERROR, no more LWLocks available); } result = MainLWLockArray[LWLockCounter[0]++].lock; - SpinLockRelease(ShmemLock); + if (IsUnderPostmaster) + SpinLockRelease(ShmemLock); return result; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery inconsistencies, standby much larger than primary
I've poked at this a bit more. There are at least 10 relations where the last block doesn't match the block mentioned in the xlog record that its LSN indicates. At least it looks like from the info xlogdump prints. Including two blocks where the correct block has the same LSN which maybe means the record was correctlly replayed the second time. Or perhaps it just means I'm underestimating the complexity of btree splits. [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] split_r: s/d/r:1663/16385/1521566 leftsib 1697585 [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[1]: s/d/r:1663/16385/1521566 blk:1697585 hole_off/len:576/4288 [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[2]: s/d/r:1663/16385/1521566 blk:786380 hole_off/len:740/3140 relfilenode | blockno | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -+--+--+-+---+---+---+-+--+-+ 1521566 | 1697585 | EAD/511456E8 | 6 | 0 | 576 | 4864 | 8176 | 8192 | 4 | 0 1521566 | 1704143 | EAD/511456E8 | 6 | 0 | 644 | 4456 | 8176 | 8192 | 4 | 0 [cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree), len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] split_r: s/d/r:1663/16385/4995658 leftsib 139569 [cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree), len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] bkpblock[2]: s/d/r:1663/16385/4995658 blk:18152 hole_off/len:28/8028 relfilenode | blockno | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -+--+--+-+---+---+---+-+--+-+ 4995658 | 139569 | EAD/520F2058 | 6 | 0 | 152 | 4336 | 8176 | 8192 | 4 | 0 4995658 | 139584 | EAD/520F2058 | 6 | 0 | 164 | 3976 | 8176 | 8192 | 4 | 0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
Andres Freund and...@2ndquadrant.com writes: On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote: Is there any plan to commit this? IMO it has to be applied. Tom objected on the grounds that cache invalidation has to be fixed properly but that's a major restructuring of code that worked this way for a long time. So changing the Assert() to reflect that seems fair to me. The replacement Assert is wrong no? At least that's what was said upthread. More to the point, changing the Assert so it doesn't fire doesn't do one damn thing to ameliorate the fact that cache reload during transaction abort is wrong and unsafe. We need to fix the real problem not paper over it. The fact that the fix may be hard doesn't change that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery inconsistencies, standby much larger than primary
Hm, I'm not entirely convinced those are erroneous replays to wrong blocks. They don't look right but there are no blocks of NULs preceding them. So if they're wrong then they only extended the relations by a single block. The relfilenodes that have nul blocks before the last block are: relfilenode | blockno | segnum | offset | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -+--++--+-+-+---+---+---+-+--+-+--- 1261982 | 7141472 | 54 | 7141472 | 0/0 | 0 | 0 | 0 | 0 | 0 |0 | 0 | 0 473158 | 18090059 |138 | 18090059 | 0/0 | 0 | 0 | 0 | 0 | 0 |0 | 0 | 0 1366221 | 2208159 | 16 | 2208159 | 0/0 | 0 | 0 | 0 | 0 | 0 |0 | 0 | 0 1364767 | 3631161 | 27 | 3631161 | 0/0 | 0 | 0 | 0 | 0 | 0 |0 | 0 | 0 1519286 | 311808 | 2 | 311808 | 0/0 | 0 | 0 | 0 | 0 | 0 |0 | 0 | 0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery inconsistencies, standby much larger than primary
Greg Stark st...@mit.edu writes: The relfilenodes that have nul blocks before the last block are: Can we see the associated WAL records (ie, the ones matching the LSNs in the last blocks of these files)? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
On February 2, 2014 5:52:22 PM CET, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote: Is there any plan to commit this? IMO it has to be applied. Tom objected on the grounds that cache invalidation has to be fixed properly but that's a major restructuring of code that worked this way for a long time. So changing the Assert() to reflect that seems fair to me. The replacement Assert is wrong no? At least that's what was said upthread. Well, no. Noah's version isn't as strict as desirable, but it doesn't trigger in wrong cases. Thus still better than what's in 9.3 (nothing). More to the point, changing the Assert so it doesn't fire doesn't do one damn thing to ameliorate the fact that cache reload during transaction abort is wrong and unsafe. And, as upthread, I still don't think that's correct. I don't have sources available right now, but IIRC we already have aborted out of the transaction. Released locks, the xid and everything. We just haven't changed the state yet - and affair we can't naively do so, otherwise we'd do incomplete cleanups if we got interrupted somehow. So yes, it's not pretty and it's really hard to verify. But that doesn't make it entirely wrong. I don't see the point of an assert that triggers for code (and coders) that can't do anything about it because their code is correct. All the while the assertion actually triggers for ugly but correct code. Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery inconsistencies, standby much larger than primary
On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@mit.edu writes: The relfilenodes that have nul blocks before the last block are: Can we see the associated WAL records (ie, the ones matching the LSNs in the last blocks of these files)? Sorry, I've lost track of what information I already shared or didn't, i've been staring at these records all day. It would be so much easier if xlogdump was a fdw or set returning function so I could do joins with pageinspect data: This is the information from the final block headers: relfilenode | blockno | segnum | offset |lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -+--++--++-+---+---+---+-+--+-+ 473158 | 18090059 |138 | 18090059 | EA1/625E28 | 6 | 0 | 144 | 896 |8192 | 8192 | 4 | 1401029863 1366221 | 2208159 | 16 | 2208159 | EA1/62DDF0 | 6 | 0 | 1180 | 3552 |8176 | 8192 | 4 | 0 1261982 | 7141472 | 54 | 7141472 | EA1/638988 | 6 | 0 | 1240 | 3312 |8176 | 8192 | 4 | 0 1364767 | 3631161 | 27 | 3631161 | EA1/63A0A8 | 6 | 0 | 1180 | 3552 |8176 | 8192 | 4 | 0 1519286 | 311808 | 2 | 311808 | EA1/708B08 | 6 | 1 | 312 | 3040 |8192 | 8192 | 4 | 0 These are all the records from the tx that corresponds to the lsn in the first relfilenode: [cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524, info:72, prev:EA1/624080] hot_update: s/d/r:1663/16385/473158 block 9386399 off 29 to block 9386399 off 30 [cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524, info:72, prev:EA1/624080] bkpblock[1]: s/d/r:1663/16385/473158 blk:9386399 hole_off/len:144/752 [cur:EA1/625E28, xid:1418089146, rmid:1(Transaction), len/tot_len:32/64, info:0, prev:EA1/6240C0] d/s:16385/1663 commit at 2014-01-21 05:41:11 UTC The middle three blocks have LSNs that are all part of the same tx, here are all the records for this tx: [cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441, info:8, prev:EA1/625E68] insert: s/d/r:1663/16385/473158 blk/off:9386398/33 header: none [cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441, info:8, prev:EA1/625E68] bkpblock[1]: s/d/r:1663/16385/473158 blk:9386398 hole_off/len:156/1828 [cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214, info:8, prev:EA1/625F28] insert_leaf: s/d/r:1663/16385/473176 tid 1319804/405 [cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214, info:8, prev:EA1/625F28] bkpblock[1]: s/d/r:1663/16385/473176 blk:1319804 hole_off/len:1644/52 [cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494, info:8, prev:EA1/627868] insert_leaf: s/d/r:1663/16385/473182 tid 1186167/147 [cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494, info:8, prev:EA1/627868] bkpblock[1]: s/d/r:1663/16385/473182 blk:1186167 hole_off/len:1300/1772 [cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314, info:8, prev:EA1/629898] insert_leaf: s/d/r:1663/16385/1270734 tid 1294137/2 [cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314, info:8, prev:EA1/629898] bkpblock[1]: s/d/r:1663/16385/1270734 blk:1294137 hole_off/len:1064/2952 [cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/62B210] insert_leaf: s/d/r:1663/16385/1366221 tid 1364573/66 [cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/62B210] bkpblock[1]: s/d/r:1663/16385/1366221 blk:1364573 hole_off/len:1180/2372 [cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814, info:8, prev:EA1/62C6E8] insert_leaf: s/d/r:1663/16385/1404440 tid 1195953/51 [cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814, info:8, prev:EA1/62C6E8] bkpblock[1]: s/d/r:1663/16385/1404440 blk:1195953 hole_off/len:964/3452 [cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862, info:8, prev:EA1/62DDF0] insert_leaf: s/d/r:1663/16385/1410405 tid 1894183/39 [cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862, info:8, prev:EA1/62DDF0] bkpblock[1]: s/d/r:1663/16385/1410405 blk:1894183 hole_off/len:988/1404 [cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254, info:8, prev:EA1/62F0D8] insert_leaf: s/d/r:1663/16385/1521566 tid 1691110/132 [cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254, info:8, prev:EA1/62F0D8] bkpblock[1]: s/d/r:1663/16385/1521566 blk:1691110 hole_off/len:1044/1012 [cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174, info:8, prev:EA1/630BC0] insert_leaf: s/d/r:1663/16385/5285587 tid 386419/102 [cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174, info:8, prev:EA1/630BC0] bkpblock[1]: s/d/r:1663/16385/5285587 blk:386419
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
Andres Freund and...@2ndquadrant.com writes: On February 2, 2014 5:52:22 PM CET, Tom Lane t...@sss.pgh.pa.us wrote: More to the point, changing the Assert so it doesn't fire doesn't do one damn thing to ameliorate the fact that cache reload during transaction abort is wrong and unsafe. And, as upthread, I still don't think that's correct. I don't have sources available right now, but IIRC we already have aborted out of the transaction. Released locks, the xid and everything. Nope ... the case exhibited in the example is dying in AtEOSubXact_Inval, which is in the very midst of subxact abort. I've been thinking about this for the past little while, and I believe that it's probably okay to have RelationClearRelation leave the relcache entry un-rebuilt, but with rd_isvalid = false so it will be rebuilt when next opened. The rationale is explained in the comments in the attached patch. I've checked that this fixes Noah's test case and still passes the existing regression tests. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 2a46cfc..791ddc6 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *** RelationClearRelation(Relation relation, *** 1890,1900 * in case it is a mapped relation whose mapping changed. * * If it's a nailed index, then we need to re-read the pg_class row to see ! * if its relfilenode changed. We can't necessarily do that here, because ! * we might be in a failed transaction. We assume it's okay to do it if ! * there are open references to the relcache entry (cf notes for ! * AtEOXact_RelationCache). Otherwise just mark the entry as possibly ! * invalid, and it'll be fixed when next opened. */ if (relation-rd_isnailed) { --- 1890,1898 * in case it is a mapped relation whose mapping changed. * * If it's a nailed index, then we need to re-read the pg_class row to see ! * if its relfilenode changed. We do that immediately if we're inside a ! * valid transaction. Otherwise just mark the entry as possibly invalid, ! * and it'll be fixed when next opened. */ if (relation-rd_isnailed) { *** RelationClearRelation(Relation relation, *** 1903,1909 if (relation-rd_rel-relkind == RELKIND_INDEX) { relation-rd_isvalid = false; /* needs to be revalidated */ ! if (relation-rd_refcnt 1) RelationReloadIndexInfo(relation); } return; --- 1901,1907 if (relation-rd_rel-relkind == RELKIND_INDEX) { relation-rd_isvalid = false; /* needs to be revalidated */ ! if (IsTransactionState()) RelationReloadIndexInfo(relation); } return; *** RelationClearRelation(Relation relation, *** 1921,1927 relation-rd_indexcxt != NULL) { relation-rd_isvalid = false; /* needs to be revalidated */ ! RelationReloadIndexInfo(relation); return; } --- 1919,1926 relation-rd_indexcxt != NULL) { relation-rd_isvalid = false; /* needs to be revalidated */ ! if (IsTransactionState()) ! RelationReloadIndexInfo(relation); return; } *** RelationClearRelation(Relation relation, *** 1942,1947 --- 1941,1969 /* And release storage */ RelationDestroyRelation(relation); } + else if (!IsTransactionState()) + { + /* + * If we're not inside a valid transaction, we can't do any catalog + * access so it's not possible to rebuild yet. Just exit, leaving + * rd_isvalid = false so that the rebuild will occur when the entry is + * next opened. + * + * Note: it's possible that we come here during subtransaction abort, + * and the reason for wanting to rebuild is that the rel is open in + * the outer transaction. In that case it might seem unsafe to not + * rebuild immediately, since whatever code has the rel already open + * will keep on using the relcache entry as-is. However, in such a + * case the outer transaction should be holding a lock that's + * sufficient to prevent any significant change in the rel's schema, + * so the existing entry contents should be good enough for its + * purposes; at worst we might be behind on statistics updates or the + * like. (See also CheckTableNotInUse() and its callers.) These + * same remarks also apply to the cases above where we exit without + * having done RelationReloadIndexInfo() yet. + */ + return; + } else { /* *** RelationClearRelation(Relation relation, *** 2054,2059 --- 2076,2082 * RelationFlushRelation * * Rebuild the relation if it is open (refcount 0), else blow it away. + * This is used when we receive a cache invalidation event for the rel. */ static void RelationFlushRelation(Relation relation) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.
On Sun, Feb 2, 2014 at 8:13 AM, Tom Lane t...@sss.pgh.pa.us wrote: Perhaps s/Total runtime/Execution time/ ? +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.
On 03/02/14 09:44, Peter Geoghegan wrote: On Sun, Feb 2, 2014 at 8:13 AM, Tom Lane t...@sss.pgh.pa.us wrote: Perhaps s/Total runtime/Execution time/ ? +1 If the planning was ever made into a parallel process, then 'elapsed time' would be less than the 'processor time'. So what does 'Execution time' mean? Can I assume: 'Total runtime' is 'elapsed time' and 'Execution time' is 'processor time'. In a parallel implementation, one would likely want both. Possible this is not an issue now, and I'm thinking to far ahead! Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CacheInvalidateRelcache in btree is a crummy idea
While investigating the assertion failure Noah presented at http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com I was confused for a bit as to why we're getting a relcache rebuild request for t1's index at exit of the subtransaction, when the subtransaction hasn't made any change to the index's schema. Investigation showed that the reason is that _bt_getroot() creates a root page for the index during the subtransaction's INSERT, and then it does CacheInvalidateRelcache() to notify other backends that it's updated the index's metapage. Now, this is a dumb idea, because relcache invalidation is assumed to be a transactional operation, which means that if the subtransaction rolls back, we don't send any invalidation event to other backends. Update of a btree metapage, however, is not a transactional thing. So if we actually *needed* that notification to be sent, this would be a bug. But we don't need that. _bt_getroot() is the primary user of the cached metapage, and it sufficiently verifies that the page it's reached is the right one, which is why we've not seen bug reports. _bt_getrootheight() also uses the cached data, but that's only for a statistical estimate, so it doesn't matter if it's a bit stale. Moreover, if we did need reliable transmission of the invalidation message, that would've gotten broken even more by commit 96675bff1, which suppressed doing that when InRecovery. (In fairness, it wasn't a problem at the time since that was pre-Hot Standby. But the lack of notification would be an issue now for hot standby sessions, if they needed one.) So I'm thinking my commit d2896a9ed, which introduced this mechanism, was poorly thought out and we should just remove the relcache invals as per the attached patch. Letting _bt_getroot() update the cached metapage at next use should be a lot cheaper than a full relcache rebuild for the index. Note: this change will mean that Noah's test case no longer triggers the problem discussed in that thread ... but I'm sure we can find another test case for that. regards, tom lane diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 40f09e3..bf5bc38 100644 *** a/src/backend/access/nbtree/README --- b/src/backend/access/nbtree/README *** location of the root page --- both the t *** 438,451 root (fast root). To avoid fetching the metapage for every single index search, we cache a copy of the meta-data information in the index's relcache entry (rd_amcache). This is a bit ticklish since using the cache ! implies following a root page pointer that could be stale. We require ! every metapage update to send out a SI relcache inval message on the ! index relation. That ensures that each backend will flush its cached copy ! not later than the start of its next transaction. Therefore, stale ! pointers cannot be used for longer than the current transaction, which ! reduces the problem to the same one already dealt with for concurrent ! VACUUM --- we can just imagine that each open transaction is potentially ! already in flight to the old root. The algorithm assumes we can fit at least three items per page (a high key and two real data items). Therefore it's unsafe --- 438,454 root (fast root). To avoid fetching the metapage for every single index search, we cache a copy of the meta-data information in the index's relcache entry (rd_amcache). This is a bit ticklish since using the cache ! implies following a root page pointer that could be stale. However, a ! backend following a cached pointer can sufficiently verify whether it ! reached the intended page; either by checking the is-root flag when it ! is going to the true root, or by checking that the page has no siblings ! when going to the fast root. At worst, this could result in descending ! some extra tree levels if we have a cached pointer to a fast root that is ! now above the real fast root. Such cases shouldn't arise often enough to ! be worth optimizing; and in any case we can expect a relcache flush will ! discard the cached metapage before long, since a VACUUM that's moved the ! fast root pointer can be expected to issue a statistics update for the ! index. The algorithm assumes we can fit at least three items per page (a high key and two real data items). Therefore it's unsafe diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 49c084a..9140749 100644 *** a/src/backend/access/nbtree/nbtinsert.c --- b/src/backend/access/nbtree/nbtinsert.c *** *** 21,27 #include miscadmin.h #include storage/lmgr.h #include storage/predicate.h - #include utils/inval.h #include utils/tqual.h --- 21,26 *** _bt_insertonpg(Relation rel, *** 868,880 END_CRIT_SECTION(); ! /* release buffers; send out relcache inval if metapage changed */ if (BufferIsValid(metabuf)) - { -
Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.
Gavin Flower gavinflo...@archidevsys.co.nz writes: Can I assume: 'Total runtime' is 'elapsed time' and 'Execution time' is 'processor time'. No. It's going to be elapsed time, either way. In a parallel implementation, one would likely want both. When and if we have that, we can argue about what to measure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup and pg_stat_tmp directory
On 2/2/14, 10:23 AM, Fujii Masao wrote: I'm thinking to change basebackup.c so that it compares the name of the directory that it's trying to back up and the setting value of log_directory parameter, then, if they are the same, it just skips the directory. The patch that I sent upthread does this regarding stats_temp_directory. I'm undecided on whether log files should be copied, but in case we decide not to, it needs to be considered whether we at least recreate the pg_log directory on the standby. Otherwise weird things will happen when you start the standby, and it would introduce an extra fixup step to sort that out. Extra credit for doing something useful when pg_log is a symlink. I fear, however, that if you end up implementing all that logic, it would become too much special magic. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext
On 2014-02-02 15:16:45 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On February 2, 2014 5:52:22 PM CET, Tom Lane t...@sss.pgh.pa.us wrote: More to the point, changing the Assert so it doesn't fire doesn't do one damn thing to ameliorate the fact that cache reload during transaction abort is wrong and unsafe. And, as upthread, I still don't think that's correct. I don't have sources available right now, but IIRC we already have aborted out of the transaction. Released locks, the xid and everything. Nope ... the case exhibited in the example is dying in AtEOSubXact_Inval, which is in the very midst of subxact abort. True. But we've done LWLockReleaseAll(), TransactionIdAbortTree(), XidCacheRemoveRunningXids() and ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS), which is why we are currently able to build correct entries, even though we are in an aborted transaction. I've been thinking about this for the past little while, and I believe that it's probably okay to have RelationClearRelation leave the relcache entry un-rebuilt, but with rd_isvalid = false so it will be rebuilt when next opened. The rationale is explained in the comments in the attached patch. I've checked that this fixes Noah's test case and still passes the existing regression tests. Hm, a bit scary, but I don't see an immediate problem. The following comment now is violated for nailed relations * We assume that at the time we are called, we have at least AccessShareLock * on the target index. (Note: in the calls from RelationClearRelation, * this is legitimate because we know the rel has positive refcount.) but that should be easy to fix. I wonder though, if we couldn't just stop doing the RelationReloadIndexInfo() for nailed indexes. The corresponding comment says: * If it's a nailed index, then we need to re-read the pg_class row to see * if its relfilenode changed. We do that immediately if we're inside a * valid transaction. Otherwise just mark the entry as possibly invalid, * and it'll be fixed when next opened. */ but any relfilenode change should have already been handled by RelationInitPhysicalAddr()? Do you plan to backpatch this? If so, even to 8.4? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSOC13 proposal - extend RETURNING syntax
On Wed, Aug 21, 2013 at 08:52:25PM +0200, Karol Trzcionka wrote: W dniu 21.08.2013 19:17, Boszormenyi Zoltan pisze: With this fixed, a more complete review: Thanks. I've done some syntactic and white space cleanup, here attached. Karol, would you care to help with commenting the sections that need same? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate *** a/doc/src/sgml/ref/update.sgml --- b/doc/src/sgml/ref/update.sgml *** *** 194,205 UPDATE [ ONLY ] replaceable class=PARAMETERtable_name/replaceable [ * ] [ termreplaceable class=PARAMETERoutput_expression/replaceable/term listitem para ! An expression to be computed and returned by the commandUPDATE/ ! command after each row is updated. The expression can use any ! column names of the table named by replaceable class=PARAMETERtable_name/replaceable ! or table(s) listed in literalFROM/. ! Write literal*/ to return all columns. /para /listitem /varlistentry --- 194,220 termreplaceable class=PARAMETERoutput_expression/replaceable/term listitem para ! An expression to be computed and returned by the ! commandUPDATE/ command either before or after (prefixed with ! literalBEFORE./literal and literalAFTER./literal, ! respectively) each row is updated. The expression can use any ! column names of the table named by replaceable ! class=PARAMETERtable_name/replaceable or table(s) listed in ! literalFROM/. Write literalAFTER.*/literal to return all ! columns after the update. Write literalBEFORE.*/literal for all ! columns before the update. Write literal*/literal to return all ! columns after update and all triggers fired (these values are in table ! after command). You may combine BEFORE, AFTER and raw columns in the ! expression. /para + warningpara + Mixing table names or aliases named before or after with the + above will result in confusion and suffering. If you happen to + have a table called literalbefore/literal or + literalafter/literal, alias it to something else when using + RETURNING. + /para/warning + /listitem /varlistentry *** *** 287,301 UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, prcp = DEFAULT /para para !Perform the same operation and return the updated entries: programlisting UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, prcp = DEFAULT WHERE city = 'San Francisco' AND date = '2003-07-03' ! RETURNING temp_lo, temp_hi, prcp; /programlisting /para para Use the alternative column-list syntax to do the same update: programlisting --- 302,317 /para para !Perform the same operation and return information on the changed entries: programlisting UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, prcp = DEFAULT WHERE city = 'San Francisco' AND date = '2003-07-03' ! RETURNING temp_lo AS new_low, temp_hi AS new_high, BEFORE.temp_hi/BEFORE.temp_low AS old_ratio, AFTER.temp_hi/AFTER.temp_low AS new_ratio prcp; /programlisting /para + para Use the alternative column-list syntax to do the same update: programlisting *** a/src/backend/commands/trigger.c --- b/src/backend/commands/trigger.c *** *** 2335,2341 ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo) TupleTableSlot * ExecBRUpdateTriggers(EState *estate, EPQState *epqstate, ResultRelInfo *relinfo, !ItemPointer tupleid, TupleTableSlot *slot) { TriggerDesc *trigdesc = relinfo-ri_TrigDesc; HeapTuple slottuple = ExecMaterializeSlot(slot); --- 2335,2341 TupleTableSlot * ExecBRUpdateTriggers(EState *estate, EPQState *epqstate, ResultRelInfo *relinfo, !ItemPointer tupleid, TupleTableSlot *slot, TupleTableSlot **planSlot) { TriggerDesc *trigdesc = relinfo-ri_TrigDesc; HeapTuple slottuple = ExecMaterializeSlot(slot); *** *** 2382,2387 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate, --- 2382,2388 if (newSlot != NULL) { slot = ExecFilterJunk(relinfo-ri_junkFilter, newSlot); + *planSlot = newSlot; slottuple = ExecMaterializeSlot(slot); newtuple = slottuple; } *** a/src/backend/executor/nodeModifyTable.c --- b/src/backend/executor/nodeModifyTable.c ***
Re: [HACKERS] Recovery inconsistencies, standby much larger than primary
Greg Stark st...@mit.edu writes: On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane t...@sss.pgh.pa.us wrote: Can we see the associated WAL records (ie, the ones matching the LSNs in the last blocks of these files)? Sorry, I've lost track of what information I already shared or didn't, Hm. So one of these is a heap update, not an index update, which lets out the theory that it's something specific to indexes. But they are all full-page-image updates, so the WAL replay code path for full-page images still seems to be the suspect. What version were you running before 9.1.11 exactly? I took a look through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any changes that seemed even vaguely related to this. There are some changes in known-transaction tracking, but it's hard to see a connection there. Most of the other diffs are in code that wouldn't execute during WAL replay at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On 2.2.2014 11:45, Heikki Linnakangas wrote: On 01/30/2014 01:53 AM, Tomas Vondra wrote: (3) A file with explain plans for 4 queries suffering ~2x slowdown, and explain plans with 9.4 master and Heikki's patches is available here: http://www.fuzzy.cz/tmp/gin/queries.txt All the queries have 6 common words, and the explain plans look just fine to me - exactly like the plans for other queries. Two things now caught my eye. First some of these queries actually have words repeated - either exactly like database database or in negated form like !anything anything. Second, while generating the queries, I use dumb frequency, where only exact matches count. I.e. write != written etc. But the actual number of hits may be much higher - for example write matches exactly just 5% documents, but using @@ it matches more than 20%. I don't know if that's the actual cause though. Ok, here's another variant of these patches. Compared to git master, it does three things: 1. It adds the concept of ternary consistent function internally, but no catalog changes. It's implemented by calling the regular boolean consistent function both ways. 2. Use a binary heap to get the next item from the entries in a scan. I'm pretty sure this makes sense, because arguably it makes the code more readable, and reduces the number of item pointer comparisons significantly for queries with a lot of entries. 3. Only perform the pre-consistent check to try skipping entries, if we don't already have the next item from the entry loaded in the array. This is a tradeoff, you will lose some of the performance gain you might get from pre-consistent checks, but it also limits the performance loss you might get from doing useless pre-consistent checks. So taken together, I would expect this patch to make some of the performance gains less impressive, but also limit the loss we saw with some of the other patches. Tomas, could you run your test suite with this patch, please? Sure, will do. Do I get it right that this should be applied instead of the four patches you've posted earlier? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Multiple calls to json_array_elements slow down nonlinearly
Hi all I ran into this performance report over the weekend: http://stackoverflow.com/q/21507127/398670 and wanted to mention it here. json_array_elements seems to spend about 97% of its time in MemoryContextReset(...). Given dummy data: test= create table g as select (select json_agg(random()) json from generate_series(0, (r1*4)::int)) from (select random() r1 from generate_series(1,2)) aux; Compare these two methods of producing the same result set: test= create table q as select json-x foo from g, generate_series(0,json_array_length(g.json)-1) x; SELECT 60103 Time: 157.702 ms test= create table p as select json_array_elements(json) foo from g; SELECT 60103 Time: 4254.494 ms The issue is reproducible and scales non-linearly with row count, which is a clue. At 100k rows input, the lateral query takes 592ms vs 179959ms (3 minutes) for json_array_elements. Whenever I grab a backtrace it looks like: #0 0x0072dd7d in MemoryContextReset (context=0x2a02dc90) at mcxt.c:130 #1 0x0072dd90 in MemoryContextResetChildren (context=optimized out) at mcxt.c:155 #2 MemoryContextReset (context=0x1651220) at mcxt.c:131 #3 0x005817f9 in ExecScan (node=node@entry=0x164e1a0, accessMtd=accessMtd@entry=0x592040 SeqNext, recheckMtd=recheckMtd@entry=0x592030 SeqRecheck) at execScan.c:155 (Sorry for the quote-paste; only way to make @#$ Thunderbird not wrap mail, I need to switch clients or fix that). perf top on the process shows: 96.92% postgres [.] MemoryContextReset 0.15% [kernel] [k] cpuacct_account_field 0.09% [kernel] [k] update_cfs_rq_blocked_load 0.09% postgres [.] AllocSetAlloc At a guess, we're looking at a case where a new child context is created at every call, so every MemoryContextResetChildren call has to deal with more child contexts. I'm going to take a quick look now, I just wanted to get this written up before I got sidetracked. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Making strxfrm() blobs in indexes work
On Thu, Jan 30, 2014 at 8:51 PM, Peter Geoghegan p...@heroku.com wrote: I've done some more digging. It turns out that the 1977 paper An Encoding Method for Multifield Sorting and Indexing describes a technique that involves concatenating multiple column values and comparing them using a simple strcmp(). So I thought about it again, and realized that this probably can't be made to work given our present assumptions. Commit 656beff59 added the varstr_cmp() strcmp() tie-breaker. If we cannot trust strcoll() to indicate equality, why should we be able to trust strxfrm() + strcmp() to do so, without an equivalent tie-breaker on the original string? Now, I guess it might be good enough that the comparison in the inner page indicated equivalence provided the comparison on leaf pages indicated equality proper. That seems like the kind of distinction that I'd rather not have to make, though. You could end up with two non-equal but equivalent tuples in an inner page. Seems pretty woolly to me. However, it also occurs to me that strxfrm() blobs have another useful property: We (as, say, the author of an equality operator on text, an operator intended for a btree operator class) *can* trust a strcmp()'s result on blobs, provided the result isn't 0/equal, *even if the blobs are truncated*. So maybe a better scheme, and certainly a simpler one would be to have a pseudo-attribute in inner pages only with, say, the first 8 bytes of a strxfrm() blob formed from the logically-leading text attribute of the same indexTuple. Because we're only doing this on inner pages, there is a very good chance that that will be good enough most of the time. This also allows us to reduce bloat very effectively. As I touched on before, other schemes for other types do seem to suggest themselves. We could support a pseudo leading attribute for numeric too, for example, where we use a 64-bit integer as a proxy for the whole number part (with the implementation having special handling of out of range values by means of an internal magic number - as always with the scheme, equality means inconclusive, ask logically leading attribute). That could win just by mostly avoiding the serialization overhead. Now, the obvious counter-argument here is to point out that there are worst cases where none of this helps. I suspect that those are so rare in the real world, and the cost of doing all this is so low, that it's still likely to be well worth it for any given use-case at the macro level. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/02/2014 09:03 AM, Tom Lane wrote: According to the buildfarm database, narwhal is running a gcc build on Windows 2003. That hardly seems like a mainstream use case. I could believe that it might be of interest to developers, but clearly no developers are actually running such a build. I think we should give serious consideration to desupporting this combination I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also the only open source compiler currently supported for PostgreSQL on Windows - practically the only one available. I don't know about you, but I'm not too keen on assuming Microsoft will continue to offer free toolchains that're usable for our purposes. They're crippling their free build tools more and more with each release, which isn't a good trend. If you wish to eliminate PGDLLIMPORT from the codebase the correct approach would be building with --export-all-symbols (a MinGW extension flag to gcc). That would make the MinGW builds consistent with the MSVC build, which generates a .def file that exports all symbols. As for why PGDLLIMPORT appears to be required in some places on the MSVC build, so far it's looking like we auto-export functions, but not necessarily variables. I'd need to read the fairly scary MSVC build genreator scripts in detail to confirm that, to see how they produce their DEF files; that'll have to wait until after I've got the row-security work sorted out. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup and pg_stat_tmp directory
On Mon, Feb 3, 2014 at 6:57 AM, Peter Eisentraut pete...@gmx.net wrote: On 2/2/14, 10:23 AM, Fujii Masao wrote: I'm thinking to change basebackup.c so that it compares the name of the directory that it's trying to back up and the setting value of log_directory parameter, then, if they are the same, it just skips the directory. The patch that I sent upthread does this regarding stats_temp_directory. I'm undecided on whether log files should be copied, but in case we decide not to, it needs to be considered whether we at least recreate the pg_log directory on the standby. Otherwise weird things will happen when you start the standby, and it would introduce an extra fixup step to sort that out. Yes, basically we should skip only files under pg_log, but not pg_log directory itself. Extra credit for doing something useful when pg_log is a symlink. I fear, however, that if you end up implementing all that logic, it would become too much special magic. ISTM that pg_xlog has already been handled in that way by basebackup. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugfix patch for json_array_elements
On 02/03/2014 09:09 AM, Craig Ringer wrote: At a guess, we're looking at a case where a new child context is created at every call, so every MemoryContextResetChildren call has to deal with more child contexts. That would be yes. After a short run, I see 32849 lines like: json_array_elements temporary cxt: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used under the context: PortalMemory: 8192 total in 1 blocks PortalHeapMemory: 7168 total in 3 blocks ExecutorState: 65600 total in 4 blocks ExprContext: 8192 total in 1 blocks json_array_elements temporary cxt: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used The attached patch deletes the context after use, bringing performance back into line. It should be backpatched to 9.3. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From f4ae516d643dee43200053533dbcc8b362e3f92b Mon Sep 17 00:00:00 2001 From: Craig Ringer cr...@2ndquadrant.com Date: Mon, 3 Feb 2014 09:47:34 +0800 Subject: [PATCH] Fix leaked memory context in json_array_each and json_each json_array_each and json_each leaked a memory context with each invocation, causing profiles involving many invocations in a single query context to be dominated by MemoryContextReset because of MemoryContextResetChildren calls made after each tuple is emitted. See initial bug report: http://stackoverflow.com/q/21507127/398670 and list discussion: http://www.postgresql.org/message-id/52eeec37.9040...@2ndquadrant.com --- src/backend/utils/adt/jsonfuncs.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index a19b222..116d7a0 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -975,6 +975,9 @@ each_worker(PG_FUNCTION_ARGS, bool as_text) rsi-setResult = state-tuple_store; rsi-setDesc = state-ret_tdesc; + MemoryContextDelete(state-tmp_cxt); + state-tmp_cxt = NULL; + PG_RETURN_NULL(); } @@ -1157,6 +1160,9 @@ elements_worker(PG_FUNCTION_ARGS, bool as_text) rsi-setResult = state-tuple_store; rsi-setDesc = state-ret_tdesc; + MemoryContextDelete(state-tmp_cxt); + state-tmp_cxt = NULL; + PG_RETURN_NULL(); } -- 1.8.3.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.warn_shadow
Hi everyone, Here's a new version of the patch. Added new tests and docs and changed the GUCs per discussion. plpgsql.warnings_as_errors only affects compilation at CREATE FUNCTION time: =# set plpgsql.warnings to 'all'; SET =#* set plpgsql.warnings_as_errors to true; SET =#* select foof(1); -- compiled since it's the first call in this session WARNING: variable f1 shadows a previously defined variable LINE 2: declare f1 int; ^ foof -- (1 row) =#* create or replace function foof(f1 int) returns void as $$ declare f1 int; begin end $$ language plpgsql; ERROR: variable f1 shadows a previously defined variable LINE 3: declare f1 int; Currently, plpgsql_warnings is a boolean since there's only one warning we implement. The idea is to make it a bit field of some kind in the future when we add more warnings. Starting that work for 9.4 seemed like overkill, though. I tried to keep things simple. Regards, Marko Tiikkaja *** a/doc/src/sgml/plpgsql.sgml --- b/doc/src/sgml/plpgsql.sgml *** *** 4712,4717 a_output := a_output || $$ if v_$$ || referrer_keys.kind || $$ like '$$ --- 4712,4762 /variablelist /sect2 + sect2 id=plpgsql-warnings +titleWarnings/title + +para + To aid the user in finding instances of simple but common problems before + they cause harm, applicationPL/PgSQL/ provides a number of + replaceablewarnings/. When enabled, a literalWARNING/ is emitted + during the compilation of a function. Optionally, if + varnameplpgsql.warnings_as_errors/ is set, an literalERROR/ is + raised when attempting to create a function which would otherwise result in + a warning. +/para + + para + Warnings are enabled through the configuration variable + varnameplpgsql.warnings/. It can be set either to a comma-separated list + of warnings, literalnone/ or literalall/. The default is + literalnone/. Currently the list of available warnings includes only + one: + variablelist +varlistentry + termvarnameshadow/varname/term + listitem + para + Warns when a declaration shadows a previously defined variable. For + example: + programlisting + CREATE FUNCTION foo(f1 int) RETURNS int AS $$ + DECLARE + f1 int; + BEGIN + RETURN f1; + END + $$ LANGUAGE plpgsql; + WARNING: variable f1 shadows a previously defined variable + LINE 3: f1 int; + ^ + CREATE FUNCTION + /programlisting + /para + /listitem +/varlistentry + /variablelist + /para + /sect2 /sect1 !-- Porting from Oracle PL/SQL -- *** a/src/pl/plpgsql/src/pl_comp.c --- b/src/pl/plpgsql/src/pl_comp.c *** *** 352,357 do_compile(FunctionCallInfo fcinfo, --- 352,360 function-out_param_varno = -1; /* set up for no OUT param */ function-resolve_option = plpgsql_variable_conflict; function-print_strict_params = plpgsql_print_strict_params; + function-warnings = plpgsql_warnings; + /* only promote warnings to errors at CREATE FUNCTION time */ + function-warnings_as_errors = plpgsql_warnings_as_errors forValidator; if (is_dml_trigger) function-fn_is_trigger = PLPGSQL_DML_TRIGGER; *** *** 849,854 plpgsql_compile_inline(char *proc_source) --- 852,860 function-out_param_varno = -1; /* set up for no OUT param */ function-resolve_option = plpgsql_variable_conflict; function-print_strict_params = plpgsql_print_strict_params; + function-warnings = plpgsql_warnings; + /* never promote warnings to errors */ + function-warnings_as_errors = false; plpgsql_ns_init(); plpgsql_ns_push(func_name); *** a/src/pl/plpgsql/src/pl_gram.y --- b/src/pl/plpgsql/src/pl_gram.y *** *** 727,732 decl_varname : T_WORD --- 727,746 $1.ident, NULL, NULL, NULL) != NULL) yyerror(duplicate declaration); + + if (plpgsql_curr_compile-warnings) + { + PLpgSQL_nsitem *nsi; + nsi = plpgsql_ns_lookup(plpgsql_ns_top(), false, + $1.ident, NULL, NULL, NULL); + if (nsi != NULL) + ereport(plpgsql_curr_compile-warnings_as_errors ? ERROR : WARNING, + (errcode(ERRCODE_DUPLICATE_ALIAS), +
Re: [HACKERS] WITH ORDINALITY planner improvements
(2014/02/01 8:01), Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Thu, Aug 15, 2013 at 07:25:17PM +0900, Etsuro Fujita wrote: Attached is an updated version of the patch. In that version the code for the newly added function build_function_pathkeys() has been made more simple by using the macro INTEGER_BTREE_FAM_OID. Is this patch to be applied? It hasn't been reviewed AFAIK. Thanks for picking this up. I think it doesn't need to be reviewed anymore because the same feature has been commited in [1] if I understand correctly. [1] http://www.postgresql.org/message-id/e1vjel1-0007x9...@gemulon.postgresql.org Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer cr...@2ndquadrant.com wrote: On 02/02/2014 09:03 AM, Tom Lane wrote: According to the buildfarm database, narwhal is running a gcc build on Windows 2003. That hardly seems like a mainstream use case. I could believe that it might be of interest to developers, but clearly no developers are actually running such a build. I think we should give serious consideration to desupporting this combination I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also the only open source compiler currently supported for PostgreSQL on Windows - practically the only one available. I don't know about you, but I'm not too keen on assuming Microsoft will continue to offer free toolchains that're usable for our purposes. They're crippling their free build tools more and more with each release, which isn't a good trend. If you wish to eliminate PGDLLIMPORT from the codebase the correct approach would be building with --export-all-symbols (a MinGW extension flag to gcc). That would make the MinGW builds consistent with the MSVC build, which generates a .def file that exports all symbols. As for why PGDLLIMPORT appears to be required in some places on the MSVC build, so far it's looking like we auto-export functions, but not necessarily variables. In my experience, that is true. During some contrib module development, while accessing a postgresql variable it was crashing on windows, while same was accessible in non-windows platform. I'd need to read the fairly scary MSVC build genreator scripts in detail to confirm that, to see how they produce their DEF files; that'll have to wait until after I've got the row-security work sorted out. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- -- Thanks Regards, Ashesh Vashi EnterpriseDB INDIA: Enterprise PostgreSQL Companyhttp://www.enterprisedb.com *http://www.linkedin.com/in/asheshvashi*http://www.linkedin.com/in/asheshvashi
Re: [HACKERS] [PATCH] pg_basebackup: progress report max once per second
On Sat, Feb 1, 2014 at 8:29 PM, Oskari Saarenmaa o...@ohmu.fi wrote: 31.01.2014 10:59, Sawada Masahiko kirjoitti: I think the idea in the new progress_report() call (with force == true) is to make sure that there is at least one progress_report call that actually writes the progress report. Otherwise the final report may go missing if it gets suppressed by the time-based check. The force argument as used in the new call skips that check. I understood. I have two concerns as follows. - I think that there is possible that progress_report() is called frequently ( less than 1 second). That is, progress_report() is called with force == true after progress_report was called with force == false and execute this function. - progress_report() is called even if -P option is disabled. I'm concerned about that is cause of performance degradation. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
(2014/01/31 9:56), Robert Haas wrote: On Thu, Jan 30, 2014 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 30, 2014 at 11:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think this is totally misguided. Who's to say that some weird FDW might not pay attention to attstorage? I could imagine a file-based FDW using that to decide whether to compress columns, for instance. Admittedly, the chances of that aren't large, but it's pretty hard to argue that going out of our way to prevent it is a useful activity. I think that's a pretty tenuous position. There are already FDW-specific options sufficient to let a particular FDW store whatever kinds of options it likes; letting the user set options that were only ever intended to be applied to tables just because we can seems sort of dubious. I'm tempted by the idea of continuing to disallow SET STORAGE on an unvarnished foreign table, but allowing it on an inheritance hierarchy that contains at least one real table, with the semantics that we quietly ignore the foreign tables and apply the operation to the plain tables. [ shrug... ] By far the easiest implementation of that is just to apply the catalog change to all of them. According to your assumptions, it'll be a no-op on the foreign tables anyway. Well, there's some point to that, too, I suppose. What do others think? Allowing ALTER COLUMN SET STORAGE on foreign tables would make sense if for example, SELECT * INTO local_table FROM foreign_table did create a new local table of columns having the storage types associated with those of a foreign table? Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On 02/03/2014 11:10 AM, Ashesh Vashi wrote: On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer cr...@2ndquadrant.com mailto:cr...@2ndquadrant.com wrote: On 02/02/2014 09:03 AM, Tom Lane wrote: According to the buildfarm database, narwhal is running a gcc build on Windows 2003. That hardly seems like a mainstream use case. I could believe that it might be of interest to developers, but clearly no developers are actually running such a build. I think we should give serious consideration to desupporting this combination I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also the only open source compiler currently supported for PostgreSQL on Windows - practically the only one available. I don't know about you, but I'm not too keen on assuming Microsoft will continue to offer free toolchains that're usable for our purposes. They're crippling their free build tools more and more with each release, which isn't a good trend. If you wish to eliminate PGDLLIMPORT from the codebase the correct approach would be building with --export-all-symbols (a MinGW extension flag to gcc). That would make the MinGW builds consistent with the MSVC build, which generates a .def file that exports all symbols. As for why PGDLLIMPORT appears to be required in some places on the MSVC build, so far it's looking like we auto-export functions, but not necessarily variables. In my experience, that is true. During some contrib module development, while accessing a postgresql variable it was crashing on windows, while same was accessible in non-windows platform. I think it's a good thing personally - we shouldn't be exporting every little internal var in the symbol table. If we built with -fvisibility=hidden on 'nix there'd be no need to complain about commits being on on 'nix then breaking on Windows, 'cos the 'nix build would break in the same place. That's all or nothing though, there's no vars hidden, procs exported option in gcc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer cr...@2ndquadrant.com wrote: On 02/02/2014 09:03 AM, Tom Lane wrote: According to the buildfarm database, narwhal is running a gcc build on Windows 2003. That hardly seems like a mainstream use case. I could believe that it might be of interest to developers, but clearly no developers are actually running such a build. I think we should give serious consideration to desupporting this combination I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also the only open source compiler currently supported for PostgreSQL on Windows - practically the only one available. I don't know about you, but I'm not too keen on assuming Microsoft will continue to offer free toolchains that're usable for our purposes. They're crippling their free build tools more and more with each release, which isn't a good trend. If you wish to eliminate PGDLLIMPORT from the codebase the correct approach would be building with --export-all-symbols (a MinGW extension flag to gcc). That would make the MinGW builds consistent with the MSVC build, which generates a .def file that exports all symbols. As for why PGDLLIMPORT appears to be required in some places on the MSVC build, so far it's looking like we auto-export functions, but not necessarily variables. I could see both the variables (DateStyle and IntervalStyle) currently Tom points out in both .def file and by using dumpbin utility. I think these symbols are exported, the main reason of failure is that they are not imported in postres_fdw as we have not used PGDLLIMPORT. Looking at below code, as per my understanding for contrib modules BUILDING_DLL is not defined and by using PGDLLIMPORT, it can lead to import of the required variables. #ifdef BUILDING_DLL #define PGDLLIMPORT __declspec (dllexport) #else /* not BUILDING_DLL */ #define PGDLLIMPORT __declspec (dllimport) #endif We need this for variables, but they are optional for functions. Please refer below link: http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On Sun, Feb 2, 2014 at 6:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: I happened to notice today that the owner of buildfarm member narwhal is trying to revive it after a long time offline, but it's failing in the 9.3 branch (and not attempting to build HEAD, yet). The cause appears to be that contrib/postgres_fdw is referencing the DateStyle and IntervalStyle variables, which aren't marked PGDLLIMPORT. Hm, well, that would be an easy change ... but that code was committed last March. How is it that we didn't notice this long ago? What this seems to indicate is that narwhal is the only buildfarm animal that has a need for PGDLLIMPORT marks on global variables that are referenced by extensions. I have tried to debug the part of code where we are using DateStyle and IntervalStyle variables in postgres_fdw on my m/c (Win 7 - 64 bit, MSVC2010) and found that there value is junk, to me the problem looks exactly similar to what we have seen for test_shm_mq modules few days back. I am not sure why it is passing on other buildfarm m/c, but I think we need to put PGDLLIMPORT for global variables we want to use in extensions. Also as per below link, it seems that PGDLLIMPORT is required for exported global variables. http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [review] PostgreSQL Service on Windows does not start if data directory given is relative path
On 1st February 2014, MauMau Wrote: I reviewed the patch content. I find this fix useful. I'd like to suggest some code improvements. I'll apply and test the patch when I receive your reply. Thanks for reviewing the patch. (1) I think it is appropriate to place find_my_abs_path() in path.c rather than exec.c. Please look at the comments at the beginning of those files. exec.c handles functions related to executables, while path.c handles general functions handling paths. I have moved this function to path.c It's better to rename the function to follow the naming of other functions in path.c, something like get_absolute_path() or so. Unfortunately, we cannot use make_absolute_path() as the name because it is used in src/backend/utils/init/miscinit.c, which conflicts in the backend module. Renamed the function as get_absolute_path. (2) In pg_ctl.c, dbpath can be better named as datadir, because it holds data directory location. dbpath is used to mean some different location in other source files. Renamed as dataDir. (3) find_my_abs_path() had better not call make_native_path() because the role of this function should be to just return an absolute path. pg_ctl.c can instead call make_native_path() after find_my_abs_path(). Changed as per suggestion. (4) find_my_abs_path() should not call resolve_symlinks(). For reference, look at make_absolute_path() in src/backend/utils/init/miscinit.c and src/test/regress/pg_regress.c. I guess the reason is that if the user changed to the directory through a symbolic link, we should retain the symbolic link name. Changed as per suggestion. (5) Change file/path in the comment of find_my_abs_path() to path, because file is also a path. Changed as per suggestion. Please find the attached revised patch. Thanks and Regards, Kumar Rajeev Rastogi pgctl_win32service_rel_dbpath_v3.patch Description: pgctl_win32service_rel_dbpath_v3.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
Amit Kapila amit.kapil...@gmail.com writes: Also as per below link, it seems that PGDLLIMPORT is required for exported global variables. http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx That was what we'd always assumed, but the fact that postgres_fdw has been working for the last year (on everything except narwhal) puts the lie to that as a blanket assumption. So what I want to know now is how come it wasn't failing; because then we might be able to exploit that to eliminate the need for the PGDLLIMPORT cruft everywhere. I've never been a fan of the fact that Windows insists on its own private set of global symbol visibility rules. If we can make that build work more like other platforms, it'll be a step forward IMO. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal and PGDLLIMPORT
On Sun, Feb 2, 2014 at 5:22 PM, Craig Ringer cr...@2ndquadrant.com wrote: I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also the only open source compiler currently supported for PostgreSQL on Windows - practically the only one available. I don't know about you, but I'm not too keen on assuming Microsoft will continue to offer free toolchains that're usable for our purposes. They're crippling their free build tools more and more with each release, which isn't a good trend. I was under the impression that Microsoft had finally come around to implementing a few C99 features in Visual Studio 2013 precisely because they want there to be an open source ecosystem on Windows. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FOR [SHARE|UPDATE] NOWAIT may still block in EvalPlanQualFetch
On 02/01/2014 05:28 AM, Bruce Momjian wrote: On Fri, Aug 2, 2013 at 04:00:03PM +0800, Craig Ringer wrote: FOR SHARE|UPDATE NOWAIT will still block if they have to follow a ctid chain because the call to EvalPlanQualFetch doesn't take a param for noWait, so it doesn't know not to block if the updated row can't be locked. The attached patch against master includes an isolationtester spec to demonstrate this issue and a proposed fix. Builds with the fix applied pass make check and isolationtester make installcheck. To reach this point you need to apply the patch in http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com first, otherwise you'll get stuck there and won't touch the code changed in this patch. The above looks like a legitimate patch that was not applied: http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com The patch mentioned in the text above was applied, I think. The first patch, linked to in text, was commited as 706f9dd914c64a41e06b5fbfd62d6d6dab43eeb8. I can't see the second in the history either. It'd be good to get it committed, though the issue is obviously not causing any great outcry. It was detected when testing a high-rate queueing system in PostgreSQL. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs si...@2ndquadrant.com wrote: On 8 January 2014 08:33, Simon Riggs si...@2ndquadrant.com wrote: Patch attached, implemented to reduce writes by SELECTs only. This is really a valuable improvement over current SELECT behaviour w.r.t Writes. While going though patch, I observed few points, so thought of sharing with you: + /* + * If we are tracking pruning in SELECTs then we can only get + * here by heap_page_prune_opt() call that cleans a block, + * so in that case, register it as a pruning operation. + * Make sure we don't double count during VACUUMs. + */ + if (PrunePageDirtyLimit -1) + PrunePageDirty++; a. As PrunePageDirtyLimit variable is not initialized for DDL flow, any statement like Create Function().. will have value of PrunePageDirtyLimit as 4 (default) and in such cases MarkBufferDirty() will increment the wrong counter. b. For DDL statements like Create Materialized view, it will behave as Select statement. Ex. Create Materialized view mv1 as select * from t1; Now here I think it might not be a problem, because for t1 anyway there will be no write, so skipping pruning should not be a problem and for materialized views also there will no dead rows, so skipping should be okay, but I think it is not strictly adhering to statement to reduce writes by SELECTs only and purpose of patch which is to avoid only when Top level statement is SELECT. Do you think it's better to consider such cases and optimize for them or should we avoid it by following thumb rule that pruning will be avoided only for top level SELECT? 2. + Allow cleanup of shared buffers by foreground processes, allowing later cleanup by VACUUM, This line is not clear, what do you mean to say by allowing later cleanup by VACUUM, if already foreground process has done cleanup, then it should save effort of Vacuum. In general, though both the optimisations (allow_buffer_cleanup and prune_page_dirty_limit ) used in patch have similarity in the sense that they will be used to avoid pruning, but still I feel they are for different cases (READ ONLY OP and WRITE ON SMALL TABLES) and also as there are more people inclined to do this for only SELECT operations, do you think it will be a good idea to make them as separate patches? I think there can be some applications or use cases which can be benefited by avoiding pruning for WRITE ON SMALL TABLES, but the case for SELECT is more general and more applications can get benefit with this optimisation,so it would be better if we first try to accomplish that case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
Tomasa, it'd be nice if you use real data in your testing. One very good application of gin fast-scan is dramatic performance improvement of hstore/jsonb @ operator, see slides 57, 58 http://www.sai.msu.su/~megera/postgres/talks/hstore-dublin-2013.pdf. I'd like not to lost this benefit :) Oleg PS. I used data delicious-rss-1250k.gz from http://randomwalker.info/data/delicious/ On Mon, Feb 3, 2014 at 5:44 AM, Tomas Vondra t...@fuzzy.cz wrote: On 3.2.2014 00:13, Tomas Vondra wrote: On 2.2.2014 11:45, Heikki Linnakangas wrote: On 01/30/2014 01:53 AM, Tomas Vondra wrote: (3) A file with explain plans for 4 queries suffering ~2x slowdown, and explain plans with 9.4 master and Heikki's patches is available here: http://www.fuzzy.cz/tmp/gin/queries.txt All the queries have 6 common words, and the explain plans look just fine to me - exactly like the plans for other queries. Two things now caught my eye. First some of these queries actually have words repeated - either exactly like database database or in negated form like !anything anything. Second, while generating the queries, I use dumb frequency, where only exact matches count. I.e. write != written etc. But the actual number of hits may be much higher - for example write matches exactly just 5% documents, but using @@ it matches more than 20%. I don't know if that's the actual cause though. Ok, here's another variant of these patches. Compared to git master, it does three things: 1. It adds the concept of ternary consistent function internally, but no catalog changes. It's implemented by calling the regular boolean consistent function both ways. 2. Use a binary heap to get the next item from the entries in a scan. I'm pretty sure this makes sense, because arguably it makes the code more readable, and reduces the number of item pointer comparisons significantly for queries with a lot of entries. 3. Only perform the pre-consistent check to try skipping entries, if we don't already have the next item from the entry loaded in the array. This is a tradeoff, you will lose some of the performance gain you might get from pre-consistent checks, but it also limits the performance loss you might get from doing useless pre-consistent checks. So taken together, I would expect this patch to make some of the performance gains less impressive, but also limit the loss we saw with some of the other patches. Tomas, could you run your test suite with this patch, please? Sure, will do. Do I get it right that this should be applied instead of the four patches you've posted earlier? So, I was curious and did a basic testing - I've repeated the tests on current HEAD and 'HEAD with the new patch'. The complete data are available at [http://www.fuzzy.cz/tmp/gin/gin-scan-benchmarks.ods] and I've updated the charts at [http://www.fuzzy.cz/tmp/gin/] too. Look for branches named 9.4-head-2 and 9.4-heikki-2. To me it seems that: (1) The main issue was that with common words, it used to be much slower than HEAD (or 9.3). This seems to be fixed, i.e. it's not slower than before. See http://www.fuzzy.cz/tmp/gin/3-common-words.png (previous patch) http://www.fuzzy.cz/tmp/gin/3-common-words-new.png (new patch) for comparison vs. 9.4 HEAD. With the new patch there's no slowdown, which seems nice. Compared to 9.3 it looks like this: http://www.fuzzy.cz/tmp/gin/3-common-words-new-vs-93.png so there's a significant speedup (thanks to the modified layout). (2) The question is whether the new patch works fine on rare words. See this for comparison of the patches against HEAD: http://www.fuzzy.cz/tmp/gin/3-rare-words.png http://www.fuzzy.cz/tmp/gin/3-rare-words-new.png and this is the comparison of the two patches: http://www.fuzzy.cz/tmp/gin/patches-rare-words.png That seems fine to me - some queries are slower, but we're talking about queries taking 1 or 2 ms, so the measurement error is probably the main cause of the differences. (3) With higher numbers of frequent words, the differences (vs. HEAD or the previous patch) are not that dramatic as in (1) - the new patch is consistently by ~20% faster. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: option --if-exists for pg_dump
Hi Peter, I am not sure why you getting build unstable due to white-space errors. Are you referring to these line ? *11:25:01* src/bin/pg_dump/pg_backup_archiver.c:477: indent with spaces.*11:25:01* + dropStmt,*11:25:01* src/bin/pg_dump/pg_backup_archiver.c:478: indent with spaces.*11:25:10* + buffer,*11:25:14* src/bin/pg_dump/pg_backup_archiver.c:479: indent with spaces.*11:25:15* + mark + l);*11:25:15* + echo unstable If yes, then in my latest attached patch, these lines are NOT AT ALL there. I have informed on my comment that I have fixed these in my version of patch, but still you got unstable build. NOT sure how. Seems like you are applying wrong patch. Will you please let us know what's going wrong ? Thanks On Thu, Jan 30, 2014 at 6:56 PM, Pavel Stehule pavel.steh...@gmail.comwrote: 2014-01-30 Jeevan Chalke jeevan.cha...@enterprisedb.com: OK. Assigned it to committer. Thanks for the hard work. Thank you for review Pavel On Thu, Jan 30, 2014 at 6:16 PM, Pavel Stehule pavel.steh...@gmail.comwrote: Hello All is ok Thank you Pavel -- Jeevan B Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company -- Jeevan B Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 30589500 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.