Re: [HACKERS] Performance degradation in commit 6150a1b0
On Sat, Feb 27, 2016 at 12:41 AM, Andres Freund wrote: > > Hi, > > On 2016-02-25 12:56:39 +0530, Amit Kapila wrote: > > From past few weeks, we were facing some performance degradation in the > > read-only performance bench marks in high-end machines. My colleague > > Mithun, has tried by reverting commit ac1d794 which seems to degrade the > > performance in HEAD on high-end m/c's as reported previously[1], but still > > we were getting degradation, then we have done some profiling to see what > > has caused it and we found that it's mainly caused by spin lock when > > called via pin/unpin buffer and then we tried by reverting commit 6150a1b0 > > which has recently changed the structures in that area and it turns out > > that reverting that patch, we don't see any degradation in performance. > > The important point to note is that the performance degradation doesn't > > occur every time, but if the tests are repeated twice or thrice, it > > is easily visible. > > > m/c details > > IBM POWER-8 > > 24 cores,192 hardware threads > > RAM - 492GB > > > > Non-default postgresql.conf settings- > > shared_buffers=16GB > > max_connections=200 > > min_wal_size=15GB > > max_wal_size=20GB > > checkpoint_timeout=900 > > maintenance_work_mem=1GB > > checkpoint_completion_target=0.9 > > > > scale_factor - 300 > > > > Performance at commit 43cd468cf01007f39312af05c4c92ceb6de8afd8 is 469002 at > > 64-client count and then at 6150a1b08a9fe7ead2b25240be46dddeae9d98e1, it > > went down to 200807. This performance numbers are median of 3 15-min > > pgbench read-only tests. The similar data is seen even when we revert the > > patch on latest commit. We have yet to perform detail analysis as to why > > the commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 lead to degradation, > > but any ideas are welcome. > > Ugh. Especially the varying performance is odd. Does it vary between > restarts, or is it just happenstance? If it's the former, we might be > dealing with some alignment issues. > It varies between restarts. > > If not, I wonder if the issue is massive buffer header contention. As a > LL/SC architecture acquiring the content lock might interrupt buffer > spinlock acquisition and vice versa. > > Does applying the patch from http://archives.postgresql.org/message-id/CAPpHfdu77FUi5eiNb%2BjRPFh5S%2B1U%2B8ax4Zw%3DAUYgt%2BCPsKiyWw%40mail.gmail.com > change the picture? > Not tried, but if this is alignment issue as you are suspecting above, then does it make sense to try this out? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Feb 23, 2016 at 7:06 PM, Robert Haas wrote: > On Sun, Feb 21, 2016 at 7:45 PM, Amit Kapila > wrote: > >> I mean, my basic feeling is that I would not accept a 2-3% regression in >>> the single client case to get a 10% speedup in the case where we have 128 >>> clients. >>> >> >> When I tried by running the pgbench first with patch and then with Head, I see 1.2% performance increase with patch. TPS with patch is 976 and with Head it is 964. For 3, 30 mins TPS data, refer "Patch – group_clog_update_v5" and before that "HEAD – Commit 481725c0" in perf_write_clogcontrollock_data_v6.ods attached with this mail. Nonetheless, I have observed that below new check has been added by the patch which can effect single client performance. So I have changed it such that new check is done only when we there is actually a need of group update which means when multiple clients tries to update clog at-a-time. + if (!InRecovery && + all_trans_same_page && + nsubxids < PGPROC_MAX_CACHED_SUBXIDS && + !IsGXactActive()) > I understand your point. I think to verify whether it is run-to-run >> variation or an actual regression, I will re-run these tests on single >> client multiple times and post the result. >> > > Perhaps you could also try it on a couple of different machines (e.g. > MacBook Pro and a couple of different large servers). > Okay, I have tried latest patch (group_update_clog_v6.patch) on 2 different big servers and then on Mac-Pro. The detailed data for various runs can be found in attached document perf_write_clogcontrollock_data_v6.ods. I have taken the performance data for higher client-counts with somewhat larger scale factor (1000) and data for median of same is as below: M/c configuration - RAM - 500GB 8 sockets, 64 cores(Hyperthreaded128 threads total) Non-default parameters max_connections = 1000 shared_buffers=32GB min_wal_size=10GB max_wal_size=15GB checkpoint_timeout=35min maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 wal_buffers = 256MB Client_Count/Patch_ver 1 8 64 128 256 HEAD 871 5090 17760 17616 13907 PATCH 900 5110 18331 20277 19263 Here, we can see that there is a gain of ~15% to ~38% at higher client count. The attached document (perf_write_clogcontrollock_data_v6.ods) contains data, mainly focussing on single client performance. The data is for multiple runs on different machines, so I thought it is better to present in form of document rather than dumping everything in e-mail. Do let me know if there is any confusion in understanding/interpreting the data. Thanks to Dilip Kumar for helping me in conducting test of this patch on MacBook-Pro. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com group_update_clog_v6.patch Description: Binary data perf_write_clogcontrollock_data_v6.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila wrote: > > > Here, we can see that there is a gain of ~15% to ~38% at higher client > count. > > The attached document (perf_write_clogcontrollock_data_v6.ods) contains > data, mainly focussing on single client performance. The data is for > multiple runs on different machines, so I thought it is better to present > in form of document rather than dumping everything in e-mail. Do let me > know if there is any confusion in understanding/interpreting the data. > > Forgot to mention that all these tests have been done by reverting commit-ac1d794. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sun, Feb 28, 2016 at 9:05 PM, Dilip Kumar wrote: > > > On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund wrote: >> >> > > > >> ./pgbench -j$ -c$ -T300 -M prepared -S postgres >> > > > >> >> > > > >> ClientBasePatch >> > > > >> 11716916454 >> > > > >> 8108547105559 >> > > > >> 32241619262818 >> > > > >> 64206868233606 >> > > > >> 128137084217013 >> > > >> > > So, there's a small regression on low client counts. That's worth >> > > addressing. >> > > >> > >> > Interesting. I'll try to reproduce it. >> >> Any progress here? > > > In Multi socket machine with 8 sockets and 64 cores, I have seen more regression compared to my previous run in power8 with 2 socket, currently I tested Read only workload for 5 mins Run, When I get time, I will run for longer time and confirm again. > Have you tried by reverting the commits 6150a1b0 and ac1d794, which I think effects read-only performance and sometimes create variation in TPS across different runs, here second might have less impact, but first one could impact performance? Is it possible for you to get perf data with and without patch and share with others? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas wrote: > > On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila wrote: > >> I wouldn't bother tinkering with it at this point. The value isn't > >> going to be recorded on disk anywhere, so it will be easy to change > >> the way it's computed in the future if we ever need to do that. > >> > > > > Okay. Find the rebased patch attached with this mail. I have moved > > this patch to upcoming CF. > > I would call the functions pgstat_report_wait_start() and > pgstat_report_wait_end() instead of pgstat_report_start_waiting() and > pgstat_report_end_waiting(). > > I think pgstat_get_wait_event_type should not return HWLock, a term > that appears nowhere in our source tree at present. How about just > "Lock"? > > I think the wait event types should be documented - and the wait > events too, perhaps. > By above do you mean to say that we should document the name of each wait event type and wait event. Documenting wait event names is okay, but we have approximately 65~70 wait events (considering individuals LWLocks, Tranches, Locks, etc), if we want to document all the events, then I think we can have a separate table having columns (wait event name, description) just below pg_stat_activity and have link to that table in wait_event row of pg_stat_activity table. Does that matches your thought or you have something else in mind? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Thu, Feb 25, 2016 at 2:54 AM, Peter Eisentraut wrote: > > Could you enhance the documentation about the difference between "wait > event type name" and "wait event name" (examples?)? > I am planning to add possible values for each of the wait event type and wait event and will add few examples as well. Let me know if you want to see something else with respect to documentation? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas wrote: > > On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila wrote: > > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila > > wrote: > >> > >> Here, we can see that there is a gain of ~15% to ~38% at higher client > >> count. > >> > >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains > >> data, mainly focussing on single client performance. The data is for > >> multiple runs on different machines, so I thought it is better to present in > >> form of document rather than dumping everything in e-mail. Do let me know > >> if there is any confusion in understanding/interpreting the data. > > > > Forgot to mention that all these tests have been done by reverting > > commit-ac1d794. > > OK, that seems better. But I have a question: if we don't really need > to make this optimization apply only when everything is on the same > page, then why even try? If we didn't try, we wouldn't need the > all_trans_same_page flag, which would reduce the amount of code > change. I am not sure if I understood your question, do you want to know why at the first place transactions spanning more than one-page call the function TransactionIdSetPageStatus()? If we want to avoid trying the transaction status update when they are on different page, then I think we need some major changes in TransactionIdSetTreeStatus(). > Would that hurt anything? Taking it even further, we could > remove the check from TransactionGroupUpdateXidStatus too. I'd be > curious to know whether that set of changes would improve performance > or regress it. Or maybe it does nothing, in which case perhaps > simpler is better. > > All things being equal, it's probably better if the cases where > transactions from different pages get into the list together is > something that is more or less expected rather than a > once-in-a-blue-moon scenario - that way, if any bugs exist, we'll find > them. The downside of that is that we could increase latency for the > leader that way - doing other work on the same page shouldn't hurt > much but different pages is a bigger hit. But that hit might be > trivial enough not to be worth worrying about. > In my tests, the check related to same page in TransactionGroupUpdateXidStatus() doesn't impact performance in any way and I think the reason is that, it happens rarely that group contain multiple pages and even it occurs, there is hardly much impact. So, I will remove that check and I think thats what you also want for now. > + /* > +* Now that we've released the lock, go back and wake everybody up. We > +* don't do this under the lock so as to keep lock hold times to a > +* minimum. The system calls we need to perform to wake other processes > +* up are probably much slower than the simple memory writes > we did while > +* holding the lock. > +*/ > > This comment was true in the place that you cut-and-pasted it from, > but it's not true here, since we potentially need to read from disk. > Okay, will change. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Feb 29, 2016 at 3:37 PM, Dilip Kumar wrote: > > On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar > wrote: > >> > > Test2: Identify that improvement in case of multiextend is becuase of > avoiding context switch or some other factor, like reusing blocks b/w > backend by putting in FSM.. > > 1. Test by just extending multiple blocks and reuse in it's own backend > (Don't put in FSM) > Insert 1K record data don't fits in shared buffer 512MB Shared Buffer > > > ClientBaseExtend 800 block self useExtend 1000 Block > 1 117 131 118 > 2 111 203 140 > 351 242 178 > 451 231 190 > 552 259 224 > 651 263 243 > 743 253 254 > 843 240 254 > 16 40 190 243 > > We can see the same improvement in case of self using the blocks also, It > shows that Sharing the blocks between the backend was not the WIN but > avoiding context switch was the measure win. > > One thing that is slightly unclear is that whether there is any overhead due to buffer eviction especially when the buffer to be evicted is already dirty and needs XLogFlush(). One reason why it might not hurt is that by the time we tries to evict the buffer, corresponding WAL is already flushed by WALWriter or other possibility could be that even if it is getting done during buffer eviction, the impact for same is much lesser. Can we try to measure the number of flush calls which happen during buffer eviction? > 2. Tested the Number of ProcSleep during the Run. > This is the simple script of copy 1 record in one transaction of size > 4 Bytes > > * BASE CODE* > *PATCH MULTI EXTEND* > ClientBase_TPSProcSleep CountExtend By 10 BlockProc > Sleep Count > 2280 457,506 > 31162,641 > 32351,098,701 > 358 141,624 > 42161,155,735 > 368 188,173 > > What we can see in above test that, in Base code performance is degrading > after 2 threads, while Proc Sleep count in increasing with huge amount. > > Compared to that in Patch, with extending 10 blocks at a time Proc Sleep > reduce to ~1/8 and we can see it is constantly scaling. > > Proc Sleep test for Insert test when data don't fits in shared buffer and > inserting big record of 1024 bytes, is currently running > once I get the data will post the same. > > Okay. However, I wonder if the performance data for the case when data doesn't fit into shared buffer also shows similar trend, then it might be worth to try by doing extend w.r.t load in system. I mean to say we can batch the extension requests (as we have done in ProcArrayGroupClearXid) and extend accordingly, if that works out then the benefit could be that we don't need any configurable knob for the same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Fri, Feb 12, 2016 at 9:04 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > Hello, Robert > > > It also strikes me that it's probably quite likely that slock_t > > mutex[NUM_FREELISTS] is a poor way to lay out this data in memory. > > For example, on a system where slock_t is just one byte, most likely > > all of those mutexes are going to be in the same cache line, which > > means you're going to get a LOT of false sharing. It seems like it > > would be sensible to define a new struct that contains an slock_t, a > > long, and a HASHELEMENT *, and then make an array of those structs. > > That wouldn't completely eliminate false sharing, but it would reduce > > it quite a bit. My guess is that if you did that, you could reduce > > the number of freelists to 8 or less and get pretty much the same > > performance benefit that you're getting right now with 32. And that, > > too, seems likely to be good for single-client performance. > > I experimented for a while trying to fit every spinlock in a separate > cache line. Indeed we can gain some speedup this way. Here are > benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in > this case difference is more notable): > > | FREELISTS | SIZE = 32 | SIZE = 64 | SIZE = 128 | SIZE = 256 | > |---||||| > | 4 | +25.4% | +28.7% | +28.4% | +28.3% | > | 8 | +28.2% | +29.4% | +31.7% | +31.4% | > |16 | +29.9% | +32.6% | +31.6% | +30.8% | > |32 | +33.7% | +34.2% | +33.6% | +32.6% | > > Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to > align FREELIST structure, see attached patch). I am not sure, if this is exactly what has been suggested by Robert, so it is not straightforward to see if his suggestion can allow us to use NUM_FREELISTS as 8 rather than 32. I think instead of trying to use FREELISTBUFF, why not do it as Robert has suggested and try with different values of NUM_FREELISTS? > > > I am however wondering if it to set the freelist affinity based on > > something other than the hash value, like say the PID, so that the > > same process doesn't keep switching to a different freelist for every > > buffer eviction. > > Also I tried to use PID to determine freeList number: > > ``` > #include > #include > > ... > > #define FREELIST_IDX(hctl, hashcode) \ > (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0) > > ... > > // now nentries could be negative in this case > // Assert(FREELIST(hctl, freelist_idx).nentries > 0); > > In which case, do you think entries can go negative? I think the nentries is incremented and decremented in the way as without patch, so I am not getting what can make it go negative. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Feb 29, 2016 at 11:10 PM, Robert Haas wrote: > > On Fri, Feb 26, 2016 at 11:37 PM, Amit Kapila wrote: > > On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila > > wrote: > >> > >> Here, we can see that there is a gain of ~15% to ~38% at higher client > >> count. > >> > >> The attached document (perf_write_clogcontrollock_data_v6.ods) contains > >> data, mainly focussing on single client performance. The data is for > >> multiple runs on different machines, so I thought it is better to present in > >> form of document rather than dumping everything in e-mail. Do let me know > >> if there is any confusion in understanding/interpreting the data. > > > > Forgot to mention that all these tests have been done by reverting > > commit-ac1d794. > > OK, that seems better. But I have a question: if we don't really need > to make this optimization apply only when everything is on the same > page, then why even try? > This is to save the case when sub-transactions belonging to a transaction are on different pages, and the reason for same is that currently I am using XidCache as stored in each proc to pass the information of subtransactions to TransactionIdSetPageStatusInternal(), now if we allow subtransactions from different pages then I need to extract subxid's from that cache which belong to the page on which we are trying to update the status. Now this will add few more cycles in the code path under ExclusiveLock without any clear benefit, thats why I have not implemented it. I have explained the same in code comments as well: This optimization is only applicable if the transaction and + * all child sub-transactions belong to same page which we presume to be the + * most common case, we might be able to apply this when they are not on same + * page, but that needs us to map sub-transactions in proc's XidCache based + * on pageno for which each time Group leader needs to set the transaction + * status and that can lead to some performance penalty as well because it + * needs to be done after acquiring CLogControlLock, so let's leave that + * case for now. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] POC: Cache data in GetSnapshotData()
On Fri, Feb 26, 2016 at 2:55 PM, Robert Haas wrote: > > On Thu, Feb 25, 2016 at 12:57 PM, Mithun Cy wrote: > > I have fixed all of the issues reported by regress test. Also now when > > backend try to cache the snapshot we also try to store the self-xid and sub > > xid, so other backends can use them. > > + /* Add my own XID to snapshot. */ + csnap->xip[count] = MyPgXact->xid; + csnap->xcnt = count + 1; Don't we need to add this only when the xid of current transaction is valid? Also, I think it will be better if we can explain why we need to add the our own transaction id while caching the snapshot. > > I also did some read-only perf tests. > > I'm not sure what you are doing that keeps breaking threading for > Gmail, but this thread is getting split up for me into multiple > threads with only a few messages in each. The same seems to be > happening in the community archives. Please try to figure out what is > causing that and stop doing it. > > I notice you seem to not to have implemented this suggestion by Andres: > > http://www.postgresql.org//message-id/20160104085845.m5nrypvmmpea5...@alap3.anarazel.de > As far as I can understand by reading the patch, it is kind of already implemented the first suggestion by Andres which is to use try lock, now the patch is using atomic ops to implement the same instead of using lwlock, but I think it should show the same impact, do you see any problem with the same? Now talking about second suggestion which is to use some form of 'snapshot slots' to reduce the contention further, it seems that could be beneficial, if see any gain with try lock. If you think that can benefit over current approach taken in patch, then we can discuss how to implement the same. > Also, you should test this on a machine with more than 2 cores. > Andres's original test seems to have been on a 4-core system, where > this would be more likely to work out. > > Also, Andres suggested testing this on an 80-20 write mix, where as > you tested it on 100% read-only. In that case there is no blocking > ProcArrayLock, which reduces the chances that the patch will benefit. > +1 > > Also, you could consider repeating the LWLOCK_STATS testing that Amit > did in his original reply to Andres. That would tell you whether the > performance is not increasing because the patch doesn't reduce > ProcArrayLock contention, or whether the performance is not increasing > because the patch DOES reduce ProcArrayLock contention but then the > contention shifts to CLogControlLock or elsewhere. > +1 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Tue, Mar 1, 2016 at 8:13 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > > Hello, Amit > > > I am not sure, if this is exactly what has been suggested by Robert, > > so it is not straightforward to see if his suggestion can allow us to > > use NUM_FREELISTS as 8 rather than 32. I think instead of trying to > > use FREELISTBUFF, why not do it as Robert has suggested and try with > > different values of NUM_FREELISTS? > > Well, what I did is in fact a bit more general solution then Robert > suggested. At first I just joined freeList and mutex arrays into one > array and tried different NUM_FREELISTS values (see FREELISTS column). > That was Robert's idea. Unfortunately it didn't give any considerable > speedup comparing with previous approach. > Okay, now I can understand the data better and I think your tests indicate that it is better to keep NUM_FREELISTS as 32. > > Then I tried to manually control sizeof every array item (see > different SIZE values). By making it larger we can put every array item > into a separate cache line. This approach helped a bit in terms of TPS > (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the > same (32). > > So answering your question - it turned out that we _can't_ reduce > NUM_FREELISTS this way. > > Also I don't believe that +0.3% TPS according to synthetic benchmark > that doesn't represent any possible workload worth it considering > additional code complexity. > I think data structure HASHHDR is looking simpler, however the code is looking more complex, especially with the addition of FREELISTBUFF for cacheline purpose. I think you might want to try with exactly what has been suggested and see if the code looks better that way, but if you are not convinced, then lets leave it to committer unless somebody else wants to try that suggestion. > > In which case, do you think entries can go negative? I think the > > nentries is incremented and decremented in the way as without patch, > > so I am not getting what can make it go negative. > > In this context we are discussing a quick and dirty fix "what would > happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST > instead of hash % NUM_FREELIST". The code is: > > int freelist_idx = FREELIST_IDX(hctl, hashvalue); > > // ... > > switch (action) > { > > // ... > > case HASH_REMOVE: > if (currBucket != NULL) > { > if (IS_PARTITIONED(hctl)) > SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex)); > > // Assert(FREELIST(hctl, freelist_idx).nentries > 0); > FREELIST(hctl, freelist_idx).nentries--; > > /* remove record from hash bucket's chain. */ > *prevBucketPtr = currBucket->link; > > // ... > > No matter what freelist was used when element was added to a hash table. > We always try to return free memory to the same freelist number getpid() > % FREELIST_ITEMS and decrease number of elements in a hash table using > corresponding nentries field. > Won't it always use the same freelist to remove and add the entry from freelist as for both cases it will calculate the freelist_idx in same way? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas wrote: > > On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila wrote: > >> I wouldn't bother tinkering with it at this point. The value isn't > >> going to be recorded on disk anywhere, so it will be easy to change > >> the way it's computed in the future if we ever need to do that. > >> > > > > Okay. Find the rebased patch attached with this mail. I have moved > > this patch to upcoming CF. > > I would call the functions pgstat_report_wait_start() and > pgstat_report_wait_end() instead of pgstat_report_start_waiting() and > pgstat_report_end_waiting(). > Changed as per suggestion and made these functions inline. > I think pgstat_get_wait_event_type should not return HWLock, a term > that appears nowhere in our source tree at present. How about just > "Lock"? > Changed as per suggestion. > I think the wait event types should be documented - and the wait > events too, perhaps. > As discussed upthread, I have added documentation for all the possible wait events and an example. Some of the LWLocks like AsyncQueueLock and AsyncCtlLock are used for quite similar purpose, so I have kept there explanation as same. > Maybe it's worth having separate wait event type names for lwlocks and > lwlock tranches. We could report LWLockNamed and LWLockTranche and > document the difference: "LWLockNamed indicates that the backend is > waiting for a specific, named LWLock. The event name is the name of > that lock. LWLockTranche indicates that the backend is waiting for > any one of a group of locks with similar function. The event name > identifies the general purpose of locks in that group." > Changed as per suggestion. > There's no requirement that every session have every tranche > registered. I think we should consider displaying "extension" for any > tranche that's not built-in, or at least for tranches that are not > registered (rather than "unknown wait event"). > > + if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS) > > Isn't the second part automatically true at this point? > Fixed. > The changes to LockBufferForCleanup() don't look right to me. Waiting > for a buffer pin might be a reasonable thing to define as a wait > event, but it shouldn't reported as if we were waiting on the LWLock > itself. > As discussed upthread, added a new wait event BufferPin for this case. > What happens if an error is thrown while we're in a wait? > As discussed upthread, added in AbortTransaction and from where ever LWLockReleaseAll() gets called, point to note is that we can call this function only when we are sure there is no further possibility of wait on LWLock. > Does this patch hurt performance? > Performance tests are underway. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com extend_pg_stat_activity_v11.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Thu, Mar 3, 2016 at 1:50 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > > Won't it always use the same freelist to remove and add the entry from > > freelist as for both cases it will calculate the freelist_idx in same > > way? > > No. If "our" freelist is empty when we try to remove an item from it we > borrow item from another freelist. I think the way patch works is if indicated freelist is empty, then it tries to allocate new elements in that list and if it can't allocate, then it tries to borrow from other freelist and in both cases the element to be removed from freelist is considered to be the element of indicated freelist (basically it always increment nentries[freelist_idx]). > Then this borrowed item will be > returned to "our" freelist instead of original. Without some sort of > additional logic there is no way to figure out what freelist was > original. > > As the patch always operates on nentries[freelist_idx], so it seems to me that in both cases (remove, add), the element is considered to be belonging to same freelist. Have you faced this problem of negative entries in any of your test, if so then share the test, so that I can also understand the scenario? > Generally speaking negative nentries value is not something that > couldn't be solved. But I would like to remind that in this context we > are discussing a quick and dirty solution created for benchmark > purposes in a first place. > > I thought negative entries could be a problem for the patch as indicated by Robert[1], but may be I am missing something. > > You are not convinced, then lets leave it to committer unless > > somebody else wants to try that suggestion. > > Agree. Frankly I'm tired of rewriting this patch over and over and over > again. So I would like to avoid rewriting it once again unless there is > a clear indication that this way we would gain something. Benchmarks > shows that this is not a case thus it's only a matter of taste and > intuition. I can understand your feeling here and agree with you that it is a matter of taste. However, many a times if we change the patch according to committer's taste, the chances of patch getting accepted early is better, but I am not sure if that applies here, so feel free to go in the way you think is better. [1] - http://www.postgresql.org/message-id/CA+TgmoacVsdcY=qn0do7nok7w2-ssqz3kr2y84bavifckqd...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] ExecGather() + nworkers
On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi wrote: > On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila > wrote: > >> > > > > Changed the code such that nworkers_launched gets used wherever > > appropriate instead of nworkers. This includes places other than > > pointed out above. > > The changes of the patch are simple optimizations that are trivial. > I didn't find any problem regarding the changes. I think the same > optimization is required in "ExecParallelFinish" function also. > > There is already one change as below for ExecParallelFinish() in patch. @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei) WaitForParallelWorkersToFinish(pei->pcxt); /* Next, accumulate buffer usage. */ - for (i = 0; i < pei->pcxt->nworkers; ++i) + for (i = 0; i < pei->pcxt->nworkers_launched; ++i) InstrAccumParallelQuery(&pei->buffer_usage[i]); Can you be slightly more specific, where exactly you are expecting more changes? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] ExecGather() + nworkers
On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi wrote: > On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila > wrote: > > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi < > kommi.harib...@gmail.com> > > wrote: > >> > >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila > >> wrote: > >> >> > >> > > >> > Changed the code such that nworkers_launched gets used wherever > >> > appropriate instead of nworkers. This includes places other than > >> > pointed out above. > >> > >> The changes of the patch are simple optimizations that are trivial. > >> I didn't find any problem regarding the changes. I think the same > >> optimization is required in "ExecParallelFinish" function also. > >> > > > > There is already one change as below for ExecParallelFinish() in patch. > > > > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei) > > > > WaitForParallelWorkersToFinish(pei->pcxt); > > > > > > > > /* Next, accumulate buffer usage. */ > > > > - for (i = 0; i < pei->pcxt->nworkers; ++i) > > > > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i) > > > > InstrAccumParallelQuery(&pei->buffer_usage[i]); > > > > > > Can you be slightly more specific, where exactly you are expecting more > > changes? > > I missed it during the comparison with existing code and patch. > Everything is fine with the patch. I marked the patch as ready for > committer. > > Thanks! With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov wrote: > On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila > wrote: > >> > I think the wait event types should be documented - and the wait >> > events too, perhaps. >> > >> >> As discussed upthread, I have added documentation for all the possible >> wait events and an example. Some of the LWLocks like AsyncQueueLock and >> AsyncCtlLock are used for quite similar purpose, so I have kept there >> explanation as same. >> > > Do you think it worth grouping rows in "wait_event Description" table by > wait event type? > They are already grouped (implicitly), do you mean to say that we should add wait event type name as well in that table? If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] ExecGather() + nworkers
On Fri, Mar 4, 2016 at 11:41 PM, Robert Haas wrote: > On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila > wrote: > > On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi > > > wrote: > >> > >> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila > >> wrote: > >> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi > >> > > >> > wrote: > >> >> > >> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila < > amit.kapil...@gmail.com> > >> >> wrote: > >> >> >> > >> >> > > >> >> > Changed the code such that nworkers_launched gets used wherever > >> >> > appropriate instead of nworkers. This includes places other than > >> >> > pointed out above. > >> >> > >> >> The changes of the patch are simple optimizations that are trivial. > >> >> I didn't find any problem regarding the changes. I think the same > >> >> optimization is required in "ExecParallelFinish" function also. > >> >> > >> > > >> > There is already one change as below for ExecParallelFinish() in > patch. > >> > > >> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei) > >> > > >> > WaitForParallelWorkersToFinish(pei->pcxt); > >> > > >> > > >> > > >> > /* Next, accumulate buffer usage. */ > >> > > >> > - for (i = 0; i < pei->pcxt->nworkers; ++i) > >> > > >> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i) > >> > > >> > InstrAccumParallelQuery(&pei->buffer_usage[i]); > >> > > >> > > >> > Can you be slightly more specific, where exactly you are expecting > more > >> > changes? > >> > >> I missed it during the comparison with existing code and patch. > >> Everything is fine with the patch. I marked the patch as ready for > >> committer. > >> > > > > Thanks! > > OK, committed. > > Thanks. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Fri, Mar 4, 2016 at 7:23 PM, Thom Brown wrote: > > On 4 March 2016 at 13:41, Alexander Korotkov wrote: > > > >> > >> If yes, then the only slight worry is that there will lot of repetition in > >> wait_event_type column, otherwise it is okay. > > > > > > There is morerows attribute of entry tag in Docbook SGML, it behaves like > > rowspan in HTML. > > +1 > I will try to use morerows in documentation. > Yes, we do this elsewhere in the docs. And it is difficult to look > through the wait event names at the moment. > > I'm also not keen on all the "A server process is waiting" all the way > down the list. > How about giving the column name as "Wait For" instead of "Description" and then use text like "finding or allocating space in shared memory"? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Fri, Mar 4, 2016 at 9:59 PM, Robert Haas wrote: > > On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar wrote: > > I have tried the approach of group extend, > > > > 1. We convert the extension lock to TryLock and if we get the lock then > > extend by one block.2. > > 2. If don't get the Lock then use the Group leader concep where only one > > process will extend for all, Slight change from ProcArrayGroupClear is that > > here other than satisfying the requested backend we Add some extra blocks in > > FSM, say GroupSize*10. > > 3. So Actually we can not get exact load but still we have some factor like > > group size tell us exactly the contention size and we extend in multiple of > > that. > > This approach seems good to me, and the performance results look very > positive. The nice thing about this is that there is not a > user-configurable knob; the system automatically determines when > larger extensions are needed, which will mean that real-world users > are much more likely to benefit from this. > I think one thing which needs more thoughts about this approach is that we need to maintain some number of slots so that group extend for different relations can happen in parallel. Do we want to provide simultaneous extension for 1, 2, 3, 4, 5 or more number of relations? I think providing it for three or four relations should be okay as higher the number we want to provide, bigger the size of PGPROC structure will be. +GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState bistate) +{ + volatile PROC_HDR *procglobal = ProcGlobal; + uint32 nextidx; + uint32 wakeidx; + int extraWaits = -1; + BlockNumber targetBlock; + int count = 0; + + /* Add ourselves to the list of processes needing a group extend. */ + proc->groupExtendMember = true; .. .. + /* We are the leader. Acquire the lock on behalf of everyone. */ + LockRelationForExtension(relation, ExclusiveLock); To provide it for multiple relations, I think you need to advocate the reloid for relation in each proc and then get the relation descriptor for relation extension lock. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane wrote: > > OK, here is a version that I think addresses all of the recent comments: > > * Fixed handling of parallel-query fields in new path node types. > (BTW, I found what seemed to be a couple of pre-existing bugs of > the same kind, eg create_mergejoin_path was different from the > other two kinds of join as to setting parallel_degree.) > I think the reason for keeping parallel_degree as zero for mergejoin path is that currently it can't participate in parallelism. *** create_unique_path(PlannerInfo *root, Re *** 1440,1446 pathnode->path.param_info = subpath- >param_info; pathnode->path.parallel_aware = false; pathnode->path.parallel_safe = subpath->parallel_safe; ! pathnode->path.parallel_degree = 0; /* * Assume the output is unsorted, since we don't necessarily have pathkeys --- 1445,1451 pathnode->path.param_info = subpath->param_info; pathnode- >path.parallel_aware = false; pathnode->path.parallel_safe = subpath->parallel_safe; ! pathnode- >path.parallel_degree = subpath->parallel_degree; Similar to reason for merge join path, I think this should also be set to 0. Similarly for LimitPath, parallel_degree should be set to 0. + RecursiveUnionPath * + create_recursiveunion_path(PlannerInfo *root, + RelOptInfo *rel, + Path *leftpath, + Path *rightpath, + PathTarget *target, + List *distinctList, + int wtParam, + double numGroups) + { + RecursiveUnionPath *pathnode = makeNode(RecursiveUnionPath); + + pathnode->path.pathtype = T_RecursiveUnion; + pathnode->path.parent = rel; + pathnode->path.pathtarget = target; + /* For now, assume we are above any joins, so no parameterization */ + pathnode->path.param_info = NULL; + pathnode->path.parallel_aware = false; + pathnode->path.parallel_safe = + leftpath->parallel_safe && rightpath->parallel_safe; I think here we should use rel->consider_parallel to set parallel_safe as is done in create_mergejoin_path. + * It's only needed atop a node that doesn't support projection "needed atop a node", seems unclear to me, typo? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Sat, Mar 5, 2016 at 4:51 PM, Amit Kapila wrote: > > On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane wrote: > > > > OK, here is a version that I think addresses all of the recent comments: > > > > * Fixed handling of parallel-query fields in new path node types. > > (BTW, I found what seemed to be a couple of pre-existing bugs of > > the same kind, eg create_mergejoin_path was different from the > > other two kinds of join as to setting parallel_degree.) > > > > I think the reason for keeping parallel_degree as zero for mergejoin path is that currently it can't participate in parallelism. > > > *** create_unique_path(PlannerInfo *root, Re > *** 1440,1446 > pathnode->path.param_info = subpath- > >param_info; > pathnode->path.parallel_aware = false; > pathnode->path.parallel_safe = subpath->parallel_safe; > ! > pathnode->path.parallel_degree = 0; > > /* > * Assume the output is unsorted, since we don't necessarily > have pathkeys > --- 1445,1451 > pathnode->path.param_info = subpath->param_info; > pathnode- > >path.parallel_aware = false; > pathnode->path.parallel_safe = subpath->parallel_safe; > ! pathnode- > >path.parallel_degree = subpath->parallel_degree; > > Similar to reason for merge join path, I think this should also be set to 0. > > Similarly for LimitPath, parallel_degree should be set to 0. > Oops, It seems Robert has made comment upthread that we should set parallel_degree from subpath except for write paths, so I think the above comment can be ignored. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Sat, Mar 5, 2016 at 10:11 PM, Tom Lane wrote: > > Amit Kapila writes: > > On Fri, Mar 4, 2016 at 11:31 PM, Tom Lane wrote: > >> (BTW, I found what seemed to be a couple of pre-existing bugs of > >> the same kind, eg create_mergejoin_path was different from the > >> other two kinds of join as to setting parallel_degree.) > > > I think the reason for keeping parallel_degree as zero for mergejoin path > > is that currently it can't participate in parallelism. > > Is there some reason why hash and nestloop are safe but merge isn't? > I think it is because we consider to pushdown hash and nestloop to workers, but not merge join and the reason for not pushing mergejoin is that currently we don't have executor support for same, more work is needed there. I think even if we set parallel_degree as you are doing in patch for merge join is harmless, but ideally there is no need to set it as far as what we support today in terms of parallelism. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Sun, Mar 6, 2016 at 9:02 PM, Tom Lane wrote: > > Amit Kapila writes: > > On Sat, Mar 5, 2016 at 10:11 PM, Tom Lane wrote: > >> Is there some reason why hash and nestloop are safe but merge isn't? > > > I think it is because we consider to pushdown hash and nestloop to workers, > > but not merge join and the reason for not pushing mergejoin is that > > currently we don't have executor support for same, more work is needed > > there. > > If that's true, then mergejoin paths ought to be marked parallel-unsafe > explicitly (with a comment as to why), not just silently reduced to degree > zero in a manner that looks more like an oversight than anything > intentional. > > I also note that the regression tests pass with this patch and parallel > mode forced, which seems unlikely if allowing a parallel worker to execute > a join works for only two out of the three join types. And checking the > git history for nodeHashjoin.c, nodeHash.c, and nodeNestloop.c shows no > evidence that any of those files have been touched for parallel query, > so it's pretty hard to see a reason why those would work in parallel > queries but nodeMergejoin.c not. > To make hash and nestloop work in parallel queries, we just push those nodes below gather node. Refer code paths match_unsorted_outer()->consider_parallel_nestloop() and hash_inner_and_outer()->try_partial_hashjoin_path(). Once the partial path for hash and nestloop gets generated in above code path, we generate gather path in function add_paths_to_joinrel()->generate_gather_paths(). You won't find the code to generate partial path for merge join. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Mon, Mar 7, 2016 at 9:13 AM, Tom Lane wrote: > > Amit Kapila writes: > >>>> Is there some reason why hash and nestloop are safe but merge isn't? > > > To make hash and nestloop work in parallel queries, we just push those > > nodes below gather node. Refer code > > paths match_unsorted_outer()->consider_parallel_nestloop() > > and hash_inner_and_outer()->try_partial_hashjoin_path(). > > AFAICS, those are about generating partial paths, which is a completely > different thing from whether a regular path is parallel-safe or not. > Okay, but the main point I wanted to convey is that I think setting parallel_degree = 0 in mergejoin path is not necessarily a copy-paste error. If you see the other path generation code like create_index_path(), create_samplescan_path(), etc. there we set parallel_degree to zero even though the parallel-safety is determined based on reloptinfo. And I don't see any use of setting parallel_degree for path which can't be pushed beneath gather (aka can be executed by multiple workers). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] ExecGather() + nworkers
On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan wrote: > > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas wrote: > > Now, you might wonder why it is that the leader cannot also sort runs, > just as a worker would. It's possible, but it isn't exactly > straightforward. You have to have special cases in several places, > even though it probably is going to be uncommon to only have one > BackgroundWorkerSlot available in practice. It's simpler to just > opt-out, and seems better given that max_parallel_degree is a way of > resource limiting based on available cores (it's certainly not about > the availability of shared memory for the BackgroundWorkerSlot array). > > More importantly, I have other, entirely general concerns. Other major > RDBMSs have settings that are very similar to max_parallel_degree, > with a setting of 1 effectively disabling all parallelism. Both Oracle > and SQL Server have settings that they both call the "maximum degree > or parallelism". I think it's a bit odd that with Postgres, > max_parallel_degree = 1 can still use parallelism at all. I have to > wonder: are we conflating controlling the resources used by parallel > operations with how shared memory is doled out? > Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means parallelism is disabled then when somebody sets max_parallel_degree = 2, then it looks somewhat odd to me that, it will mean that 1 worker process can be used for parallel query. Also, I think the parallel query will be able to get parallel workers till max_worker_processes + 1 which seems less intuitive than the current. On your point of other databases, I have also checked and it seems like some of other databases like sybase [1] also provide a similar parameter and value 1 means serial execution, so we might also want to consider it similarly, but to me the current setting sounds quite intuitive, however if more people also feel the same as you, then we should change it. [1] - http://infocenter.sybase.com/archive/index.jsp?topic=/com.sybase.help.ase_15.0.sag1/html/sag1/sag1234.htm With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Mon, Mar 7, 2016 at 11:52 AM, Peter Geoghegan wrote: > > On Sun, Mar 6, 2016 at 9:59 PM, Tom Lane wrote: > > Perhaps it was intentional when written, but if Robert's advice is correct > > that the new upper-planner path nodes should copy up parallel_degree from > > their children, then it cannot be the case that parallel_degree>0 in a > > node above the scan level implies that that node type has any special > > behavior for parallelism. > > Right. > > > I continue to bemoan the lack of documentation about what these fields > > mean. > > Point taken and if Robert doesn't feel otherwise, I can try to write a patch to explain the newly added fields. > > As far as I can find, the sum total of the documentation about > > this field is > > > > int parallel_degree; /* desired parallel degree; 0 = not parallel */ > > While it doesn't particularly relate to parallel joins, I've expressed > a general concern about the max_parallel_degree GUC that I think is > worth considering again: > > http://www.postgresql.org/message-id/cam3swzrs1mtvrkkasy1xbshgzxkd6-hnxx3gq7x-p-dz0zt...@mail.gmail.com > > In summary, I think it's surprising that a max_parallel_degree of 1 > doesn't disable parallel workers entirely. > I have responded on the thread where you have raised that point with my thoughts, discussing it here on a separate point can dilute the purpose of this thread. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 7, 2016 at 8:34 PM, Robert Haas wrote: > > On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila wrote: > > I think one thing which needs more thoughts about this approach is that we > > need to maintain some number of slots so that group extend for different > > relations can happen in parallel. Do we want to provide simultaneous > > extension for 1, 2, 3, 4, 5 or more number of relations? I think providing > > it for three or four relations should be okay as higher the number we want > > to provide, bigger the size of PGPROC structure will be. > > Hmm. Can we drive this off of the heavyweight lock manager's idea of > how big the relation extension lock wait queue is, instead of adding > more stuff to PGPROC? > One idea to make it work without adding additional stuff in PGPROC is that after acquiring relation extension lock, check if there is any available block in fsm, if it founds any block, then release the lock and proceed, else extend the relation by one block and then check lock's wait queue size or number of lock requests (nRequested) and extend the relation further in proportion to wait queue size and then release the lock and proceed. Here, I think we can check for wait queue size even before extending the relation by one block. The benefit of doing it with PGPROC is that there will be relatively less number LockAcquire calls as compare to heavyweight lock approach, which I think should not matter much because we are planing to extend the relation in proportion to wait queue size (probably wait queue size * 10). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Tue, Mar 8, 2016 at 7:23 PM, Robert Haas wrote: > > On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila wrote: > >> Hmm. Can we drive this off of the heavyweight lock manager's idea of > >> how big the relation extension lock wait queue is, instead of adding > >> more stuff to PGPROC? > > > > One idea to make it work without adding additional stuff in PGPROC is that > > after acquiring relation extension lock, check if there is any available > > block in fsm, if it founds any block, then release the lock and proceed, > > else extend the relation by one block and then check lock's wait queue size > > or number of lock requests (nRequested) and extend the relation further in > > proportion to wait queue size and then release the lock and proceed. Here, > > I think we can check for wait queue size even before extending the relation > > by one block. > > > > The benefit of doing it with PGPROC is that there will be relatively less > > number LockAcquire calls as compare to heavyweight lock approach, which I > > think should not matter much because we are planing to extend the relation > > in proportion to wait queue size (probably wait queue size * 10). > > I don't think switching relation extension from heavyweight locks to > lightweight locks is going to work. > Sorry, but I am not suggesting to change it to lightweight locks. I am just suggesting how to make batching works with heavyweight locks as asked by you. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Wed, Mar 9, 2016 at 11:46 AM, Haribabu Kommi wrote: > > > I tried replacing the random() with PostmaterRandom() for a test and it worked. > This is generating different random values, so the issue is not occurring. > > "Global/PostgreSQL.2115609797" > > I feel, we should add the the data directory path + the random number to > generate the name for dynamic shared memory, this can fix problem. > As mentioned above, I think if we can investigate why this error is generated, that will be helpful. Currently the code ensures that if the segment already exists, it should retry to create a segment with other name (refer dsm_impl_windows()), so the point of investigation is, why it is not going via that path? I am guessing due to some reason CreateFileMapping() is returning NULL in this case whereas ideally it should return the existing handle with an error ERROR_ALREADY_EXISTS. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Upper planner pathification
On Tue, Mar 8, 2016 at 2:31 AM, Tom Lane wrote: > > I wrote: > >> Attached is a version that addresses today's concerns, and also finishes > >> filling in the loose ends I'd left before, such as documentation and > >> outfuncs.c support. I think this is in a committable state now, though > >> I plan to read through the whole thing again. > > The extra read-through located some minor bugs, mainly places where I'd > forgotten to ensure that Path cost info was transposed into the generated > Plan. That would only have the cosmetic effect that EXPLAIN would print > zeroes for estimated costs, and since we only use EXPLAIN COSTS OFF in > the regression tests, no test failures ensued :-(. > > I've pushed it now; we'll soon see if the buildfarm finds any problems. > On latest commit-51c0f63e, I am seeing some issues w.r.t parallel query. Consider a below case: create table t1(c1 int, c2 char(1000)); insert into t1 values(generate_series(1,30),''); analyze t1; set max_parallel_degree=2; postgres=# explain select c1, count(c1) from t1 where c1 < 1000 group by c1; ERROR: ORDER/GROUP BY expression not found in targetlist Without setting max_parallel_degree, it works fine and generate the appropriate results. Here the issue seems to be that the code in grouping_planner doesn't apply the required PathTarget to Path below Gather Path due to which when we generate target list for scan plan, it misses the required information which in this case is sortgrouprefs and the same target list is then propagated for upper nodes which eventually leads to the above mentioned failure. Due to same reason, if the target list contains some expression, it will give wrong results when parallel query is used. I could see below ways to solve this issue. Approach-1 - First way to solve this issue is to jam the PathTarget for partial paths similar to what we are doing for Paths and I have verified that resolves the issue, the patch for same is attached with this mail. However, it won't work as-is, because this will make target lists pushed to workers as we have applied them below Gather Path which we don't want if the target list has any parallel unsafe functions. To make this approach work, we need to ensure that we jam the PathTarget for partial paths (or generate a projection) only if they contain parallel-safe expressions. Now while creating Gather Plan (create_gather_plan()), we need to ensure in some way that if path below doesn't generate the required tlist, then it should generate it's own tlist rather than using it from subplan. Approach-2 -- Always generate a tlist in Gather plan as we do for many other cases. I think this approach will also resolve the issue but haven't tried yet. Approach-1 has a benefit that it can allow to push target lists to workers which will result in speed-up of parallel queries especially for cases when target list contain costly expressions, so I am slightly inclined to follow that even though that looks more work. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com apply_tlist_partial_path_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov wrote: > >> >> If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay. > > > There is morerows attribute of entry tag in Docbook SGML, it behaves like rowspan in HTML. > Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table. As asked above by Robert, below is performance data with the patch. M/C Details -- IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB Performance Data min_wal_size=15GB max_wal_size=20GB checkpoint_timeout=15min maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 pgbench read-only (median of 3, 5-min runs) clients BASE PATCH % 1 19703.549206 19992.141542 1.4646718364 8 120105.542849 127717.835367 6.3380026745 64 487334.338764 495861.7211254 1.7498012521 The read-only data shows some improvement with patch, but I think this is mostly attributed to run-to-run variation. pgbench read-write (median of 3, 30-min runs) clients BASE PATCH % 1 1703.275728 1696.568881 -0.3937616729 8 8884.406185 9442.387472 6.2804567394 64 32648.82798 32113.002416 -1.6411785572 In the above data, the read-write data shows small regression (1.6%) at higher client-count, but when I ran individually that test, the difference was 0.5%. I think it is mostly attributed to run-to-run variation as we see with read-only tests. Thanks to Mithun C Y for doing performance testing of this patch. As this patch is adding 4-byte variable to shared memory structure PGPROC, so this is susceptible to memory alignment issues for shared buffers as discussed in thread [1], but in general the performance data doesn't indicate any regression. [1] - http://www.postgresql.org/message-id/caa4ek1+zeb8pmwwktf+3brs0pt4ux6rs6aom0uip8c6shjw...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com extend_pg_stat_activity_v13.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas wrote: > > On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila wrote: > > Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table. > > I think we should remove "a server process is" from all of these entries. > > Also, I think this kind of thing should be tightened up: > > + A server process is waiting on any one of the commit_timestamp > + buffer locks to read or write the commit_timestamp page in the > + pg_commit_ts subdirectory. > > I'd just write: Waiting to read or write a commit timestamp buffer. > Okay, changed as per suggestion and fixed the morerows issue pointed by Thom. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com extend_pg_stat_activity_v14.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC: Cache data in GetSnapshotData()
On Thu, Mar 10, 2016 at 1:04 PM, Mithun Cy wrote: > > > On Thu, Mar 3, 2016 at 11:50 PM, Robert Haas > wrote: > >What if you apply both this and Amit's clog control log patch(es)? Maybe > the combination of the two helps substantially more than either >one alone. > > > I did the above tests along with Amit's clog patch. Machine :8 socket, 64 > core. 128 hyperthread. > > clients BASE ONLY CLOG CHANGES % Increase ONLY SAVE SNAPSHOT % Increase CLOG > CHANGES + SAVE SNAPSHOT % Increase > 64 29247.658034 30855.728835 5.4981181711 29254.532186 0.0235032562 > 32691.832776 11.7758992463 > 88 31214.305391 33313.393095 6.7247618606 32109.248609 2.8670931702 > 35433.655203 13.5173592978 > 128 30896.673949 34015.362152 10.0939285832 *** *** 34947.296355 > 13.110221549 > 256 27183.780921 31192.895437 14.7481857938 *** *** 32873.782735 > 20.9316056164 > With clog changes, perf of caching the snapshot patch improves. > > This data looks promising to me and indicates that saving the snapshot has benefits and we can see noticeable performance improvement especially once the CLog contention gets reduced. I wonder if we should try these tests with unlogged tables, because I suspect most of the contention after CLogControlLock and ProcArrayLock is for WAL related locks, so you might be able to see better gain with these patches. Do you think it makes sense to check the performance by increasing CLOG buffers (patch for same is posted in Speed up Clog thread [1]) as that also relieves contention on CLOG as per the tests I have done? [1] - http://www.postgresql.org/message-id/caa4ek1lmmgnq439bum0lcs3p0sb8s9kc-cugu_thnqmwa8_...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.
On Wed, Mar 9, 2016 at 8:18 PM, Mithun Cy wrote: > > Hi All, > > Explain [Analyze] Select Into table. produces the plan which uses parallel scans. > > Possible Fix: > > I tried to make a patch to fix this. Now in ExplainOneQuery if into clause is > > defined then parallel plans are disabled as similar to their execution. > - /* plan the query */ - plan = pg_plan_query(query, CURSOR_OPT_PARALLEL_OK, params); + /* + * plan the query. + * Note: If Explain is for CreateTableAs / SelectInto Avoid parallel + * plans. + */ + plan = pg_plan_query(query, into ? 0:CURSOR_OPT_PARALLEL_OK, params); There should be a white space between 0:CURSOR_OPT_PARALLEL_OK. Also I don't see this comment is required as similar other usage doesn't have any such comment. Fixed these two points in the attached patch. In general, the patch looks good to me and solves the problem mentioned. I have ran the regression tests with force_parallel_mode and doesn't see any problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com Analyze_select_into_disable_parallel_scan_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Wed, Mar 9, 2016 at 5:46 PM, Haribabu Kommi wrote: > On Wed, Mar 9, 2016 at 10:06 PM, Amit Kapila wrote: > > On Wed, Mar 9, 2016 at 11:46 AM, Haribabu Kommi < kommi.harib...@gmail.com> > > wrote: > >> > >> > >> I tried replacing the random() with PostmaterRandom() for a test and it > >> worked. > >> This is generating different random values, so the issue is not occurring. > >> > >> "Global/PostgreSQL.2115609797" > >> > >> I feel, we should add the the data directory path + the random number to > >> generate the name for dynamic shared memory, this can fix problem. > >> > > > > As mentioned above, I think if we can investigate why this error is > > generated, that will be helpful. Currently the code ensures that if the > > segment already exists, it should retry to create a segment with other name > > (refer dsm_impl_windows()), so the point of investigation is, why it is not > > going via that path? I am guessing due to some reason CreateFileMapping() > > is returning NULL in this case whereas ideally it should return the existing > > handle with an error ERROR_ALREADY_EXISTS. > > DEBUG: mapped win32 error code 5 to 13 > > Yes, the CreateFileMapping() is returning NULL with an error of > ERROR_ACCESS_DENIED. > Okay, so one probable theory for such an error could be that when there is already an object with same name exists, this API requests access to the that existing object and found that it can't access it due to some reason. On googling, I found some people suggesting to try by disabling UAC [1] on your m/c, can you once try that to see what is the result (this experiment is just to find out the actual reason of failure, rather than a permanent change suggestion). > > I am not able to find the reason for this error. This error is occurring only > when the PostgreSQL is started as a service only. > Did you use pg_ctl register/unregister to register different services. Can you share the detail steps and OS version on which you saw this behaviour? [1] - http://windows.microsoft.com/en-in/windows/turn-user-account-control-on-off#1TC=windows-7 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [COMMITTERS] pgsql: Provide much better wait information in pg_stat_activity.
On Fri, Mar 11, 2016 at 9:19 AM, Joel Jacobson wrote: > > On Fri, Mar 11, 2016 at 9:36 AM, Robert Haas wrote: > > Well, this was discussed. If we keep the Boolean "waiting" column, then either: > > Oh, sorry for missing out on that discussion. > > > 1. We make it true only for heavyweight lock waits, and false for > > other kinds of waits. That's pretty strange. > > 2. We make it true for all kinds of waits that we now know how to > > report. That still breaks compatibility. > > Why not 3: We make it true for exactly the same type of situations as > in previous versions. Or is it not possible to do so for some reason? > Thats exactly the first point (1) of Robert. One thing that will be strange according to me is that in some cases where waiting will be false, but still wait_event and wait_event_type contain some wait information and I think that will look odd to anybody new looking at the view. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Fri, Mar 11, 2016 at 12:28 AM, Robert Haas wrote: > > > Committed with some further editing. In particular, the way you > determined whether we could safely access the tranche information for > any given ID was wrong; please check over what I did and make sure > that isn't also wrong. > There are few typos which I have tried to fix with the attached patch. Can you tell me what was wrong with the way it was done in patch? @@ -4541,9 +4542,10 @@ AbortSubTransaction(void) */ LWLockReleaseAll(); + pgstat_report_wait_end(); + pgstat_progress_end_command(); AbortBufferIO(); UnlockBuffers(); - pgstat_progress_end_command(); /* Reset WAL record construction state */ XLogResetInsertion(); @@ -4653,6 +4655,9 @@ AbortSubTransaction(void) */ XactReadOnly = s->prevXactReadOnly; + /* Report wait end here, when there is no further possibility of wait */ + pgstat_report_wait_end(); + RESUME_INTERRUPTS(); } AbortSubTransaction() does call pgstat_report_wait_end() twice, is this intentional? I have kept it in the end because there is a chance that in between API's can again set the state to wait and also by that time we have not released buffer pins and heavyweight locks, so not sure if it makes sense to report wait end at that stage. I have noticed that in WaitOnLock(), on error the wait end is set, but now again thinking on it, it seems it will be better to set it in AbortTransaction/AbortSubTransaction at end. What do you think? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_typo_lwlock_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Fri, Mar 11, 2016 at 5:21 PM, Haribabu Kommi wrote: > > On Fri, Mar 11, 2016 at 12:00 AM, Amit Kapila wrote: > > > > Okay, so one probable theory for such an error could be that when there is > > already an object with same name exists, this API requests access to the > > that existing object and found that it can't access it due to some reason. > > On googling, I found some people suggesting to try by disabling UAC [1] on > > your m/c, can you once try that to see what is the result (this experiment > > is just to find out the actual reason of failure, rather than a permanent > > change suggestion). > > Thanks for the details. Currently I am unable to change the UAC settings in my > laptop. I will try to do it in a different system and let you know the > result later. > > > > >> I am not able to find the reason for this error. This error is occurring > >> only > >> when the PostgreSQL is started as a service only. > >> > > > > Did you use pg_ctl register/unregister to register different services. Can > > you share the detail steps and OS version on which you saw this behaviour? > > Operating system - windows 7 > Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the problem) > > 1. Create two standard users in the system (test_user1 and test_user2) I think one possibility is that one user is not able to access the object created by another user, if possible can you as well try with just one user (Have same user for both the services). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Mar 12, 2016 at 8:16 AM, Dilip Kumar wrote: > > > On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby wrote: >> >> FWIW, this is definitely a real possibility in any shop that has very high downtime costs and high transaction rates. >> >> I also think some kind of clamp is a good idea. It's not that uncommon to run max_connections significantly higher than 100, so the extension could be way larger than 16MB. In those cases this patch could actually make things far worse as everyone backs up waiting on the OS to extend many MB when all you actually needed were a couple dozen more pages. > > > I agree, We can have some max limit on number of extra pages, What other thinks ? > > >> >> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent demand for extension that means you're running lots of small DML operations, not really big ones. I'd think that would make *1 more appropriate. > > > *1 will not solve this problem, Here the main problem was many people are sleep/wakeup on the extension lock and that was causing the bottleneck. So if we do *1 this will satisfy only current requesters which has already waited on the lock. But our goal is to avoid backends from requesting this lock. > > Idea of Finding the requester to get the statistics on this locks (load on the lock) and extend in multiple of load so that in future this situation will be avoided for long time and again when happen next time extend in multiple of load. > > How 20 comes ? > I tested with Multiple clients loads 1..64, with multiple load size 4 byte records to 1KB Records, COPY/ INSERT and found 20 works best. > Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier you have tried, so that it is clear that 20 is best? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.
On Fri, Mar 11, 2016 at 3:34 PM, Mithun Cy wrote: > > On Thu, Mar 10, 2016 at 9:39 PM, Robert Haas wrote: > >I guess there must not be an occurrence of this pattern in the > >regression tests, or previous force_parallel_mode testing would have > >found this problem. Perhaps this patch should add one? > > I have added the test to select_into.sql. Added Explain select into statement. > I don't see how this test will fail with force_parallel_mode=regress and max_parallel_degree > 0 even without the patch proposed to fix the issue in hand. In short, I don't think this test would have caught the issue, so I don't see much advantage in adding such a test. Even if we want to add such a test case, I think as proposed this will substantially increase the timing for "Select Into" test which might not be an acceptable test case addition especially for testing one corner case. > > Explain Analyze produces planning time and execution time even with TIMING OFF > so not adding the same to regress tests. > Yeah, that makes the addition of test for this functionality difficult. Robert, do you have any idea what kind of test would have caught this issue? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [COMMITTERS] pgsql: Provide much better wait information in pg_stat_activity.
On Sat, Mar 12, 2016 at 5:01 AM, Jim Nasby wrote: > > On 3/10/16 8:36 PM, Robert Haas wrote: >> >> 1. We make it true only for heavyweight lock waits, and false for >> other kinds of waits. That's pretty strange. >> 2. We make it true for all kinds of waits that we now know how to >> report. That still breaks compatibility. > > > I would absolutely vote for 2 here. You could even argue that it's a bug fix, since those were waits we technically should have been indicating. > I see it as reverse. I think waiting=true for only heavyweight locks makes sense in existing versions as user can still find whats actually going in the system either by looking at "query" in pg_stat_activity or by referring pg_locks, but OTOH if waiting is true for all kind of waits (lwlock, heavyweight lock, I/O, etc) then I think it will be difficult for user to make any sense out of it. So I see going for option 2 can confuse users rather than simplifying anything. > > Another random thought... changes like this would probably be easier to handle if we provided backwards compatibility extensions that created views > that mimicked the catalog for a specific Postgres version. > That makes sense to me if other people agree to it, but I think there will be some maintenance overhead for it, but I see that as worth the effort in terms of user convenience. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.
On Sat, Mar 12, 2016 at 2:02 PM, Mithun Cy wrote: > > > > On Sat, Mar 12, 2016 at 12:28 PM, Amit Kapila wrote > >I don't see how this test will fail with force_parallel_mode=regress and max_parallel_degree > 0 even without the patch proposed to fix the issue in >hand. In short, I don't think this test would have caught the issue, so I don't see much advantage in adding such a test. Even if we want to add such a >test case, I think as proposed this will substantially increase the timing for "Select Into" test which might not be an acceptable test case addition >especially for testing one corner case. > > > Without above patch the make installcheck fails for select_into.sql with below diff > > when > force_parallel_mode = on > max_parallel_degree = 3 > With force_parallel_mode=on, I could see many other failures as well. I think it is better to have test, which tests this functionality with force_parallel_mode=regress With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Background Processes and reporting
On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund wrote: > > > > Similarly for the wait event stuff - checkpointer, wal writer, > > background writer are in many cases processes that very often are > > blocked on locks, IO and such. Thus restricting the facility to > > database connected processes seems like a loss. > > I think one way to address this would be to not only report > PgBackendStatus type processes in pg_stat_activity. While that'd > obviously be a compatibility break, I think it'd be an improvement. > I think here another point which needs more thoughts is that many of the pg_stat_activity fields are not relevant for background processes, ofcourse one can say that we can keep those fields as NULL, but still I think that indicates it is not the most suitable way to expose such information. Another way could be to have new view like pg_stat_background_activity with only relevant fields or try expose via individual views like pg_stat_bgwriter. Do you intend to get this done for 9.6 considering an add-on patch for wait event information displayed in pg_stat_activity? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Background Processes and reporting
On Sat, Mar 12, 2016 at 2:38 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > > On Sat, Mar 12, 2016 at 2:45 AM, Andres Freund wrote: >> >> >> I think I agree with Robert here. Providing hooks into very low level >> places tends to lead to problems in my experience; tight control over >> what happens is often important - I certainly don't want any external >> code to run while we're waiting for an lwlock. > > > So, I get following. > > 1) Detailed wait monitoring might cause high overhead on some systems. > 2) We want wait monitoring to be always on. And we don't want options to enable additional features of wait monitoring. > I am not able to see how any of above comments indicate that wait monitoring need to be always on, why can't we consider to be off by default especially for events like timing calculations where we suspect to have some performance penalty and during development if it is proven that none of the additional wait events cause any overhead, then we can keep them on by default. > 3) We don't want hook of wait events to be exposed. > > Can I conclude that we reject detailed wait monitoring by design? > I don't think so. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.
On Sat, Mar 12, 2016 at 7:11 PM, Mithun Cy wrote: > > > > On Sat, Mar 12, 2016 at 2:32 PM, Amit Kapila wrote: > >With force_parallel_mode=on, I could see many other failures as well. I think it is better to have test, which tests this functionality with >force_parallel_mode=regress > > as per user manual. > Setting this value to regress has all of the same effects as setting it to on plus some additional effect that are intended to facilitate automated > regression testing. > Yes, that is the only reason I mentioned that it better to have a test which can be checked in automated way and I understand that the way you have written test using Explain won't work in automated way, so not sure if it is good idea to add such a test. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Fri, Mar 11, 2016 at 5:21 PM, Haribabu Kommi wrote: > > On Fri, Mar 11, 2016 at 12:00 AM, Amit Kapila wrote: > > > >> I am not able to find the reason for this error. This error is occurring > >> only > >> when the PostgreSQL is started as a service only. > >> > > > > Did you use pg_ctl register/unregister to register different services. Can > > you share the detail steps and OS version on which you saw this behaviour? > > Operating system - windows 7 > Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the problem) > > 1. Create two standard users in the system (test_user1 and test_user2) > 2. Create two databases belongs each user listed above. > 3. Now using pg_ctl register the services for the two users. > 4. Provide logon permissions to these users to run the services by changing > service properties. Did you mean to say that you changed Log on as: Local System Account in service properties or something else? > 5. Now try to start the services, the second service fails with the > error message. > 6. Error details can be found out in Event log viewer. > If I follow above steps and do as I mentioned for step-4, I am not able to reproduce the issue on Windows-7 m/c using code of HEAD. > Yes, it is working as same user services. The main problem is, PostgreSQL > as a service for two different users in the same system is not working because > of same random getting generated for two services. > I am not sure why you think same random number is problem, as mentioned above, even if the dsm name is same due to same random number, the code has logic to process it appropriately (regenerate the name of dsm). Having said that, I don't mean to say that we shouldn't have logic to generate unique name and I think we might want to add data dir path to name generation as we do for main shared memory, however it is better to first completely understand the underneath issue. If I understand correctly, here the problem is due to the reason that the second user doesn't have appropriate access rights to access the object created by first user. On reading the documentation of CreateFileMapping(), it seems that user should have SeCreateGlobalPrivilege privilege to create an object in Global namespace. Can you once try giving that privilege to the users created by you? To give this privilege, go to control panel->System And Security->Administrative Tools->Local Security Policy->Local Policies->User Rights Assignment, in the right window, select Create global objects and double-click the same and add the newly created users. Rerun your test after these steps. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Prepared Statement support for Parallel query
On Fri, Feb 26, 2016 at 4:37 PM, Robert Haas wrote: > > > And, I'm going to revert this part. If you'd run the regression tests > under force_parallel_mode=regress, max_parallel_degree>0, you would > have noticed that this part breaks it, because of CREATE TABLE ... AS > EXECUTE. > I have looked into this issue and found that the reason for the failure is that in force_parallel_mode=regress, we enable parallel mode restrictions even if the parallel plan is not choosen as part of below code in standard_planner() if (force_parallel_mode == FORCE_PARALLEL_OFF || !glob->parallelModeOK) { glob->parallelModeNeeded = false; glob->wholePlanParallelSafe = false; /* either false or don't care */ } else { glob->parallelModeNeeded = true; glob->wholePlanParallelSafe = !has_parallel_hazard((Node *) parse, false); } The failure cases fall into that category, basically wholePlanParallelSafe will be false, but parallelModeNeeded will be true which will enable parallel mode restrictions even though the plan won't contain Gather node. I think if we want to operate in above way for testing purpose, then we need to force during execution that statements for non read-only operations should not enter into parallel mode similar to what we are doing for non-zero tuple count case. Attached patch fixes the problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com prepared_stmt_parallel_query_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Prepared Statement support for Parallel query
On Tue, Mar 15, 2016 at 12:21 AM, Robert Haas wrote: > > On Mon, Mar 14, 2016 at 9:18 AM, Amit Kapila wrote: > > On Fri, Feb 26, 2016 at 4:37 PM, Robert Haas wrote: > > > > > > The failure cases fall into that category, basically wholePlanParallelSafe > > will be false, but parallelModeNeeded will be true which will enable > > parallel mode restrictions even though the plan won't contain Gather node. > > I think if we want to operate in above way for testing purpose, then we need > > to force during execution that statements for non read-only operations > > should not enter into parallel mode similar to what we are doing for > > non-zero tuple count case. Attached patch fixes the problem. > > This seems like a really ugly fix. It might be possible to come up > with a fix along these lines, but I don't have much confidence in the > specific new test you've injected into executePlan(). Broadly, any > change of this type implicitly changes the contract between > executePlan() and the planner infrastructure - the planner can now > legally generate parallel plans in some cases where that would > previously have been unacceptable. But I'm not in a hurry to rethink > where we've drawn the line there for 9.6. Let's punt this issue for > now and come back to it in a future release. > No issues. I felt that it might be good to support parallel query via Prepare statement as there is no fundamental issue in the same, but as you say, we can do that in future release as well. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 15, 2016 at 12:00 AM, David Steele wrote: > > On 2/26/16 11:37 PM, Amit Kapila wrote: > >> On Sat, Feb 27, 2016 at 10:03 AM, Amit Kapila > >> Here, we can see that there is a gain of ~15% to ~38% at higher >> client count. >> >> The attached document (perf_write_clogcontrollock_data_v6.ods) >> contains data, mainly focussing on single client performance. The >> data is for multiple runs on different machines, so I thought it is >> better to present in form of document rather than dumping everything >> in e-mail. Do let me know if there is any confusion in >> understanding/interpreting the data. >> >> Forgot to mention that all these tests have been done by >> reverting commit-ac1d794. > > > This patch no longer applies cleanly: > > $ git apply ../other/group_update_clog_v6.patch > error: patch failed: src/backend/storage/lmgr/proc.c:404 > error: src/backend/storage/lmgr/proc.c: patch does not apply > error: patch failed: src/include/storage/proc.h:152 > error: src/include/storage/proc.h: patch does not apply > For me, with patch -p1 < it works, but any how I have updated the patch based on recent commit. Can you please check the latest patch and see if it applies cleanly for you now. > > It's not clear to me whether Robert has completed a review of this code or it still needs to be reviewed more comprehensively. > > Other than a comment that needs to be fixed it seems that all questions have been answered by Amit. > I have updated the comments and changed the name of one of a variable from "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by Alvaro. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com group_update_clog_v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Explain [Analyze] produces parallel scan for select Into table statements.
On Tue, Mar 15, 2016 at 5:22 AM, Robert Haas wrote: > > On Sat, Mar 12, 2016 at 1:58 AM, Amit Kapila wrote: > > Yeah, that makes the addition of test for this functionality difficult. > > Robert, do you have any idea what kind of test would have caught this issue? > > Yep. Committed with that test: > Nice. Thanks! With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Background Processes and reporting
On Tue, Mar 15, 2016 at 1:32 AM, Andres Freund wrote: > > On 2016-03-12 16:29:11 +0530, Amit Kapila wrote: > > On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund wrote: > > > > > > > > > > Similarly for the wait event stuff - checkpointer, wal writer, > > > > background writer are in many cases processes that very often are > > > > blocked on locks, IO and such. Thus restricting the facility to > > > > database connected processes seems like a loss. > > > > > > I think one way to address this would be to not only report > > > PgBackendStatus type processes in pg_stat_activity. While that'd > > > obviously be a compatibility break, I think it'd be an improvement. > > > > > > > I think here another point which needs more thoughts is that many of the > > pg_stat_activity fields are not relevant for background processes, ofcourse > > one can say that we can keep those fields as NULL, but still I think that > > indicates it is not the most suitable way to expose such information. > > But neither are all of them relevant for autovacuum workers, and we > already show them. > Right, currently any process which has been assigned BackendId is probable candidate for being displayed in pg_stat_activity and all the relative information is being captured in PGBackendStatus. > pg_stat_activity as a name imo doesn't really imply > that it's about plain queries. ISTM we should add a 'backend_type' > column that is one of backend|checkpointer|autovacuum|autovacuum-worker|wal writer| bgwriter| bgworker > (or similar). That makes querying easier. > +1 for going that way if we decide to display background process information in pg_stat_activity view. However I think we might need some additional space in shared memory to track some of the statistics as we track in PGBackendStatus or may be for now just display some minimal stats like wait events for background processes. > > > Another way could be to have new view like pg_stat_background_activity with > > only relevant fields or try expose via individual views like > > pg_stat_bgwriter. > > I think the second is a pretty bad alternative; it'll force us to add > new views with very similar information; and it'll be hard to get > information about the whole system. I mean if you want to know which > locks are causing problems, you don't primarily care whether it's a > background process or a backend that has contention issues. > Agreed. OTOH adding information from two different kind of processes (one which has Backendid and one which doesn't have) also doesn't sound to be neat from the internal code perspective, but may be this is just an initial fear or may be because we haven't comeup with a patch which can show it is actually a simple thing to achieve. Yet another idea could be for 9.6, lets just define statistics functions to get wait events for background process like we have for backends (similar to pg_stat_get_backend_idset() and pg_stat_get_backend_wait_event()). I think we most probably anyway need those kind of functions once we expose such information for background processes, so having them now will at least provide someway to user to have some minimal information about background processes. I think that won't need much additional work. > > > Do you intend to get this done for 9.6 considering an add-on patch for wait > > event information displayed in pg_stat_activity? > > I think we should fix this for 9.6; it's a weakness in a new > interface. Let's not yank people around more than we need to. > > I'm willing to do some work on that, if we can agree upon a course. > Good to hear. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 15, 2016 at 7:54 PM, David Steele wrote: > > On 3/15/16 1:17 AM, Amit Kapila wrote: > > > On Tue, Mar 15, 2016 at 12:00 AM, David Steele > > >> This patch no longer applies cleanly: > >> > >> $ git apply ../other/group_update_clog_v6.patch > >> error: patch failed: src/backend/storage/lmgr/proc.c:404 > >> error: src/backend/storage/lmgr/proc.c: patch does not apply > >> error: patch failed: src/include/storage/proc.h:152 > >> error: src/include/storage/proc.h: patch does not apply > > > > For me, with patch -p1 < it works, but any how I have > > updated the patch based on recent commit. Can you please check the > > latest patch and see if it applies cleanly for you now. > > Yes, it now applies cleanly (101fd93). > Thanks for verification. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] speeding up GIN build with parallel workers
On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan wrote: > On Mon, 14 Mar 2016 08:42:26 -0400 > David Steele wrote: > > > On 2/18/16 10:10 AM, Constantin S. Pan wrote: > > > On Wed, 17 Feb 2016 23:01:47 +0300 > > > Oleg Bartunov wrote: > > > > > >> My feedback is (Mac OS X 10.11.3) > > >> > > >> set gin_parallel_workers=2; > > >> create index message_body_idx on messages using gin(body_tsvector); > > >> LOG: worker process: parallel worker for PID 5689 (PID 6906) was > > >> terminated by signal 11: Segmentation fault > > > > > > Fixed this, try the new patch. The bug was in incorrect handling > > > of some GIN categories. > > > > Oleg, it looks like Constantin has updated to patch to address the > > issue you were seeing. Do you have time to retest and review? > > > > Thanks, > > Actually, there was some progress since. The patch is > attached. > > 1. Added another GUC parameter for changing the amount of > shared memory for parallel GIN workers. > > 2. Changed the way results are merged. It uses shared memory > message queue now. > > 3. Tested on some real data (GIN index on email message body > tsvectors). Here are the timings for different values of > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU > machine. Seems 'gin_shared_mem' has no visible effect. > > wnum mem(MB) time(s) >0 16 247 >1 16 256 > It seems from you data that with 1 worker, you are always seeing slowdown, have you investigated the reason of same? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)
On Wed, Mar 9, 2016 at 11:58 PM, Robert Haas wrote: > > On Wed, Mar 9, 2016 at 12:33 PM, Tom Lane wrote: > > > > Gather is a bit weird, because although it can project (and needs to, > > per the example of needing to compute a non-parallel-safe function), > > you would rather push down as much work as possible to the child node; > > and doing so is semantically OK for parallel-safe functions. (Pushing > > functions down past a Sort node, for a counterexample, is not so OK > > if you are concerned about function evaluation order, or even number > > of executions.) > > > > In the current code structure it would perhaps be reasonable to teach > > apply_projection_to_path about that --- although this would require > > logic to separate parallel-safe and non-parallel-safe subexpressions, > > which doesn't quite seem like something apply_projection_to_path > > should be doing. > > I think for v1 it would be fine to make this all-or-nothing; that's > what I had in mind to do. That is, if the entire tlist is > parallel-safe, push it all down. If not, let the workers just return > the necessary Vars and have Gather compute the final tlist. > I find it quite convenient to teach apply_projection_to_path() to push down target-list beneath Gather node, when targetlist contains parallel-safe expression. Attached patch implements pushing targetlist beneath gather node. Below is output of a simple test which shows the effect of implementation. Without Patch - postgres=# explain verbose select c1+2 from t1 where c1<10; QUERY PLAN - Gather (cost=0.00..44420.43 rows=30 width=4) Output: (c1 + 2) Number of Workers: 2 -> Parallel Seq Scan on public.t1 (cost=0.00..44420.35 rows=13 width=4) Output: c1 Filter: (t1.c1 < 10) (6 rows) With Patch - --- postgres=# explain verbose select c1+2 from t1 where c1<10; QUERY PLAN - Gather (cost=0.00..45063.75 rows=30 width=4) Output: ((c1 + 2)) Number of Workers: 1 -> Parallel Seq Scan on public.t1 (cost=0.00..45063.68 rows=18 width=4) Output: (c1 + 2) Filter: (t1.c1 < 10) (6 rows) In the above plans, you can notice that target list expression (c1 + 2) is pushed beneath Gather node after patch. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com parallel-tlist-pushdown-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Sat, Mar 19, 2016 at 1:41 AM, Robert Haas wrote: > > On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev > wrote: > > > > So answering your question - it turned out that we _can't_ reduce > > NUM_FREELISTS this way. > > That's perplexing. I would have expected that with all of the mutexes > packed in back-to-back like this, we would end up with a considerable > amount of false sharing. I don't know why it ever helps to have an > array of bytes all in the same cache line of which each one is a > heavily-trafficked mutex. Anybody else have a theory? > > One other thing that concerns me mildly is the way we're handling > nentries. It should be true, with this patch, that the individual > nentries sum to the right value modulo 2^32. But I don't think > there's any guarantee that the values are positive any more, and in > theory after running long enough one of them could manage to overflow > or underflow. > Won't in theory, without patch as well nentries can overflow after running for very long time? I think with patch it is more prone to overflow because we start borrowing from other free lists as well. So at a very minimum I think we need to remove the > Assert() the value is not negative. But really, I wonder if we > shouldn't rework things a little more than that. > > One idea is to jigger things so that we maintain a count of the total > number of entries that doesn't change except when we allocate, and > then for each freelist partition we maintain the number of entries in > that freelist partition. So then the size of the hash table, instead > of being sum(nentries) is totalsize - sum(nfree). > To me, your idea sounds much better than current code in terms of understanding the free list concept as well. So, +1 for changing the code in this way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund wrote: > > On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: > > > I have done some > > tests on Windows with 0003 patch which includes running the regressions > > (vcregress check) and it passes. Will look into it tomorrow once again and > > share if I find anything wrong with it, but feel to proceed if you want. > > Thanks for the testing thus far! Let's see what the buildfarm has to > say. > Won't the new code needs to ensure that ResetEvent(latchevent) should get called in case WaitForMultipleObjects() comes out when both pgwin32_signal_event and latchevent are signalled at the same time? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund wrote: > > > > On March 18, 2016 11:32:53 PM PDT, Amit Kapila wrote: > >On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund > >wrote: > >> > >> On 2016-03-18 20:14:07 +0530, Amit Kapila wrote: > >> > >> > I have done some > >> > tests on Windows with 0003 patch which includes running the > >regressions > >> > (vcregress check) and it passes. Will look into it tomorrow once > >again > >and > >> > share if I find anything wrong with it, but feel to proceed if you > >want. > >> > >> Thanks for the testing thus far! Let's see what the buildfarm has to > >> say. > >> > > > >Won't the new code needs to ensure that ResetEvent(latchevent) should > >get > >called in case WaitForMultipleObjects() comes out when both > >pgwin32_signal_event and latchevent are signalled at the same time? > > WaitForMultiple only reports the readiness of on event at a time, no? > I don't think so, please read link [1] with a focus on below paragraph which states how it reports the readiness or signaled state when multiple objects become signaled. "When *bWaitAll* is *FALSE*, this function checks the handles in the array in order starting with index 0, until one of the objects is signaled. If multiple objects become signaled, the function returns the index of the first handle in the array whose object was signaled." [1] - https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)
On Thu, Mar 17, 2016 at 7:10 PM, Tom Lane wrote: > > Amit Kapila writes: > > While reading above code changes, it looks like it is assuming that subpath > > and subplan will always be same (as it is verifying projection capability > > of subpath and attaching the tlist to subplan), but I think it is possible > > that subpath and subplan correspond to different nodes when gating Result > > node is added on to top of scan plan by create_scan_plan(). > > A little more thought will show you that that's not actually relevant, > because the tlist computation would have happened (or not) below the > gating Result. If gating Results had an impact on > apply_projection_to_path's decisions we'd have had to do something about > that before this. > I understand that gating Results won't impact it, but it was just not apparent from looking at the code I had referred. If you think it is quite obvious thing, then we can leave the comment as it is. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] speeding up GIN build with parallel workers
On Thu, Mar 17, 2016 at 2:56 PM, Constantin S. Pan wrote: > > On Thu, 17 Mar 2016 13:21:32 +0530 > Amit Kapila wrote: > > > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan > > wrote: > > > > > > On Wed, 16 Mar 2016 18:08:38 +0530 > > > Amit Kapila wrote: > > > > > > > > > > > Why backend just waits, why can't it does the same work as any > > > > worker does? In general, for other parallelism features the > > > > backend also behaves the same way as worker in producing the > > > > results if the results from workers is not available. > > > > > > We can make backend do the same work as any worker, but that > > > will complicate the code for less than 2 % perfomance boost. > > > > Why do you think it will be just 2%? I think for single worker case, > > it should be much more as the master backend will be less busy in > > consuming tuples from tuple queue. I can't say much about > > code-complexity, as I haven't yet looked carefully at the logic of > > patch, but we didn't find much difficulty while doing it for parallel > > scans. One of the commit which might help you in understanding how > > currently heap scans are parallelised is > > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can > > help you in anyway for writing a generic API for Gin parallel builds. > > I looked at the timing details some time ago, which showed > that the backend spent about 1% of total time on data > transfer from 1 worker, and 3% on transfer and merging from > 2 workers. So if we use (active backend + 1 worker) instead > of (passive backend + 2 workers), we still have to spend > 1.5% on transfer and merging. > I think here the comparison should be between the case of (active backend + 1 worker) with (passive backend + 1 worker) or (active backend + 2 worker) with (passive backend + 2 workers). I don't think it is good assumption that workers are always freely available and you can use them as and when required for any operation. > > Or we can look at these measurements (from yesterday's > message): > > wnum mem(MB) time(s) >0 16 247 >1 16 256 >2 16 126 > > If 2 workers didn't have to transfer and merge their > results, they would have finished in 247 / 2 = 123.5 > seconds. But the transfer and merging took another 2.5 > seconds. The merging takes a little longer than the > transfer. If we now use backend+worker we get rid of 1 > transfer, but still have to do 1 transfer and then merge, so > we will save less than a quarter of those 2.5 seconds. > If I understand the above data correctly, then it seems to indicate that majority of the work is done in processing the data, so I think it should be better if master and worker both can work together. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund wrote: > Hi, > > * I can do a blind rewrite of the windows implementation, but I'm > obviously not going to get that entirely right. So I need some help > from a windows person to test this. > I can help you verifying the windows implementation. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Mar 16, 2016 at 11:57 PM, Jesper Pedersen < jesper.peder...@redhat.com> wrote: > > On 03/15/2016 01:17 AM, Amit Kapila wrote: >> >> I have updated the comments and changed the name of one of a variable from >> "all_trans_same_page" to "all_xact_same_page" as pointed out offlist by >> Alvaro. >> >> > > I have done a run, and don't see any regressions. > Can you provide the details of test, like is this pgbench read-write test and if possible steps for doing test execution. I wonder if you can do the test with unlogged tables (if you are using pgbench, then I think you need to change the Create Table command to use Unlogged option). > > Intel Xeon 28C/56T @ 2GHz w/ 256GB + 2 x RAID10 (data + xlog) SSD. > Can you provide CPU information (probably by using lscpu). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Aggregate
On Thu, Mar 17, 2016 at 10:35 AM, David Rowley wrote: > > On 17 March 2016 at 01:19, Amit Kapila wrote: > > Few assorted comments: > > > > 2. > > AggPath * > > create_agg_path(PlannerInfo *root, > > @@ -2397,9 +2399,11 @@ create_agg_path(PlannerInfo *root, > > > > List *groupClause, > > List *qual, > > const AggClauseCosts > > *aggcosts, > > - double numGroups) > > + double numGroups, > > + > > bool combineStates, > > + bool finalizeAggs) > > > > Don't you need to set parallel_aware flag in this function as we do for > > create_seqscan_path()? > > I don't really know the answer to that... I mean there's nothing > special done in nodeAgg.c if the node is running in a worker or in the > main process. > On again thinking about it, I think it is okay to set parallel_aware flag as false. This flag means whether that particular node has any parallelism behaviour which is true for seqscan, but I think not for partial aggregate node. Few other comments on latest patch: 1. + /* + * XXX does this need estimated for each partial path, or are they + * all going to be the same anyway? + */ + dNumPartialGroups = get_number_of_groups(root, + clamp_row_est(partial_aggregate_path->rows), + rollup_lists, + rollup_groupclauses); For considering partial groups, do we need to rollup related lists? 2. + hashaggtablesize = estimate_hashagg_tablesize(partial_aggregate_path, + &agg_costs, + dNumPartialGroups); + + /* + * Generate a hashagg Path, if we can, but we'll skip this if the hash + * table looks like it'll exceed work_mem. + */ + if (can_hash && hashaggtablesize < work_mem * 1024L) hash table size should be estimated only if can_hash is true. 3. + foreach(lc, grouped_rel->partial_pathlist) + { + Path *path = (Path *) lfirst(lc); + double total_groups; + + total_groups = path- >parallel_degree * path->rows; + + path = (Path *) create_gather_path(root, grouped_rel, path, NULL, + &total_groups); Do you need to perform it foreach partial path or just do it for firstpartial path? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund wrote: > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila wrote: > >> >Won't the new code needs to ensure that ResetEvent(latchevent) > >should > >> >get > >> >called in case WaitForMultipleObjects() comes out when both > >> >pgwin32_signal_event and latchevent are signalled at the same time? > >> > >> WaitForMultiple only reports the readiness of on event at a time, no? > >> > > > >I don't think so, please read link [1] with a focus on below paragraph > >which states how it reports the readiness or signaled state when > >multiple > >objects become signaled. > > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the > >array > >in order starting with index 0, until one of the objects is signaled. > >If > >multiple objects become signaled, the function returns the index of the > >first handle in the array whose object was signaled." > > I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no? > I think the primary difference with socket and postmaster event as compare to latch event is that it won't allow to start waiting with the waitevent in signalled state. For socket event, it will close the event in the end and create again before entring the wait loop in WaitLatchOrSocket. I could not see any major problem apart from may be spurious wake ups in few cases (as we haven't reset the event to non signalled state for latch event before entering wait, so it can just return immediately) even if we don't Reset the latch event. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund wrote: > > Hi, > > On 2016-03-17 09:01:36 -0400, Robert Haas wrote: > > 0001: Looking at this again, I'm no longer sure this is a bug. > > Doesn't your patch just check the same conditions in the opposite > > order? > > Which is important, because what's in what pfds[x] depends on > wakeEvents. Folded it into a later patch; it's not harmful as long as > we're only ever testing pfds[0]. > > > > 0003: Mostly boring. But the change to win32_latch.c seems to remove > > an unrelated check. > > Argh. > + * from inside a signal handler in latch_sigusr1_handler(). * * Note: we assume that the kernel calls involved in drainSelfPipe() * and SetLatch() will provide adequate synchronization on machines * with weak memory ordering, so that we cannot miss seeing is_set if * the signal byte is already in the pipe when we drain it. */ - drainSelfPipe(); - Above part of comment looks redundant after this patch. I have done some tests on Windows with 0003 patch which includes running the regressions (vcregress check) and it passes. Will look into it tomorrow once again and share if I find anything wrong with it, but feel to proceed if you want. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund wrote: > > > Attached is a significantly revised version of the earlier series. Most > importantly I have: > * Unified the window/unix latch implementation into one file (0004) > After applying patch 0004* on HEAD, using command patch -p1 < , I am getting build failure: c1 : fatal error C1083: Cannot open source file: 'src/backend/storage/ipc/latch.c': No such file or directory I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I have tried with git apply, but no success. Am I doing something wrong? One minor suggestion about patch: +#ifndef WIN32 void latch_sigusr1_handler(void) { if (waiting) sendSelfPipeByte(); } +#endif /* !WIN32 */ /* Send one byte to the self-pipe, to wake up WaitLatch */ +#ifndef WIN32 static void sendSelfPipeByte(void) Instead of individually defining these functions under #ifndef WIN32, isn't it better to combine them all as they are at end of file. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Pushdown target list below gather node (WAS Re: [HACKERS] WIP: Upper planner pathification)
On Wed, Mar 16, 2016 at 10:57 PM, Robert Haas wrote: > > On Wed, Mar 16, 2016 at 12:57 PM, Tom Lane wrote: > > Yeah, I was thinking about the same thing. The comment block above > > where you're looking would need some adjustment. > > OK, how about this? > + * children. Alternatively, apply_projection_to_path might have created + * a projection path as the subpath of a Gather node even though the + * subpath was projection-capable. So, if the subpath is capable of + * projection or the desired tlist is the same expression-wise as the + * subplan's, just jam it in there. We'll have charged for a Result that + * doesn't actually appear in the plan, but that's better than having a + * Result we don't need. */ - if (tlist_same_exprs(tlist, subplan->targetlist)) + if (is_projection_capable_path(best_path->subpath) || + tlist_same_exprs(tlist, subplan->targetlist)) While reading above code changes, it looks like it is assuming that subpath and subplan will always be same (as it is verifying projection capability of subpath and attaching the tlist to subplan), but I think it is possible that subpath and subplan correspond to different nodes when gating Result node is added on to top of scan plan by create_scan_plan(). I think it might be better to explain in comments, why it is safe to rely on projection capability of subpath to attach tlist to subplan. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] speeding up GIN build with parallel workers
On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan wrote: > > On Wed, 16 Mar 2016 12:14:51 +0530 > Amit Kapila wrote: > > > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan > > wrote: > > > 3. Tested on some real data (GIN index on email message body > > > tsvectors). Here are the timings for different values of > > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU > > > machine. Seems 'gin_shared_mem' has no visible effect. > > > > > > wnum mem(MB) time(s) > > >0 16 247 > > >1 16 256 > > > > > > > > > It seems from you data that with 1 worker, you are always seeing > > slowdown, have you investigated the reason of same? > > > > That slowdown is expected. It slows down because with 1 worker it > has to transfer the results from the worker to the backend. > > The backend just waits for the results from the workers and merges them > (in case wnum > 0). Why backend just waits, why can't it does the same work as any worker does? In general, for other parallelism features the backend also behaves the same way as worker in producing the results if the results from workers is not available. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Aggregate
of context->subplan_itlist->has_non_vars as it is handled in fix_upper_expr_mutator()? If not then, I think adding the reason for same in comments above function would be better. 7. tlist.c +} \ No newline at end of file There should be a new line at end of file. [1] - http://www.postgresql.org/message-id/CAA4eK1Jk8hm-2j-CKjvdd0CZTsdPX=edk_qhzc4689hq0xt...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Aggregate
On Thu, Mar 17, 2016 at 6:41 PM, David Rowley wrote: > > On 18 March 2016 at 01:22, Amit Kapila wrote: > > On Thu, Mar 17, 2016 at 10:35 AM, David Rowley > > wrote: > > Updated patch is attached. Thanks for the re-review. > Few more comments: 1. + if (parse->groupClause) + path = (Path *) create_sort_path(root, + grouped_rel, + path, + root->group_pathkeys, + -1.0); For final path, why do you want to sort just for group by case? 2. + path = (Path *) create_gather_path(root, partial_grouped_rel, path, + NULL, &total_groups); + + if (parse->groupClause) + path = (Path *) create_sort_path(root, + grouped_rel, + path, + root->group_pathkeys, + -1.0); + + if (parse->hasAggs) + add_path(grouped_rel, (Path *) + create_agg_path(root, + grouped_rel, + path, + target, + parse->groupClause ? AGG_SORTED : AGG_PLAIN, + parse->groupClause, + (List *) parse->havingQual, + &agg_costs, + partial_grouped_rel->rows, + true, + true)); + else + add_path(grouped_rel, (Path *) + create_group_path(root, + grouped_rel, + path, + target, + parse->groupClause, + (List *) parse->havingQual, + total_groups)); In above part of patch, it seems you are using number of groups differenetly; for create_group_path() and create_gather_path(), you have used total_groups whereas for create_agg_path() partial_grouped_rel->rows is used, is there a reason for the same? 3. + if (grouped_rel->partial_pathlist) + { + Path *path = (Path *) linitial(grouped_rel->partial_pathlist); + double total_groups; + + total_groups = path->rows * path->parallel_degree; + path = (Path *) create_gather_path(root, partial_grouped_rel, path, + NULL, &total_groups); A. Won't passing partial_grouped_rel lead to incomplete information required by create_gather_path() w.r.t the case of parameterized path info? B. You have mentioned that passing grouped_rel will make gather path contain the information of final path target, but what is the problem with that? I mean to ask why Gather node is required to contain partial path target information instead of final path target. C. Can we consider passing pathtarget to create_gather_path() as that seems to save us from inventing new UpperRelationKind? If you are worried about adding the new parameter (pathtarget) to create_gather_path(), then I think we are already passing it in many other path generation functions, so why not for gather path generation as well? 4A. Overall function create_grouping_paths() looks better than previous, but I think still it is difficult to read. I think it can be improved by generating partial aggregate paths separately as we do for nestloop join refer function consider_parallel_nestloop 4B. Rather than directly using create_gather_path(), can't we use generate_gather_paths as for all places where we generate gather node, generate_gather_paths() is used. 5. +make_partialgroup_input_target(PlannerInfo *root, PathTarget *final_target) { .. .. + foreach(lc, final_target->exprs) + { + Expr *expr = (Expr *) lfirst(lc); + + i++; + + if (parse->groupClause) + { + Index sgref = final_target- >sortgrouprefs[i]; + + if (sgref && get_sortgroupref_clause_noerr(sgref, parse->groupClause) + != NULL) + { + /* + * It's a grouping column, so add it to the input target as-is. + */ + add_column_to_pathtarget(input_target, expr, sgref); + continue; + } + } + + /* + * Non-grouping column, so just remember the expression for later + * call to pull_var_clause. + */ + non_group_cols = lappend(non_group_cols, expr); + } .. } Do we want to achieve something different in the above foreach loop than the similar loop in make_group_input_target(), if not then why are they not exactly same? 6. + /* XXX this causes some redundant cost calculation ... */ + input_target = set_pathtarget_cost_width(root, input_target); + return input_target; Can't we use return set_pathtarget_cost_width() directly rather than fetching it in input_target and then returning input_target? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Thu, Mar 17, 2016 at 7:33 AM, Michael Paquier wrote: > > On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas wrote: > > On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss wrote: > >> Here's a small docpatch to fix two typos in the new documentation. > > > > Thanks, committed. > > I just had a quick look at the wait_event committed, and I got a > little bit disappointed that we actually do not track latch waits yet, > which is perhaps not that useful actually as long as an event name is > not associated to a given latch wait when calling WaitLatch. > You are right and few more like I/O wait are also left out from initial patch intentionally just to get the base functionality in. One of the reasons I have not kept it in the patch was it needs much more thorough performance testing (even though theoretically overhead shouldn't be there) with specific kind of tests. > > I am not > asking for that with this release, this is just for the archive's > sake, and I don't mind coding that myself anyway if need be. Thanks, feel free to pickup in next release (or for this release, if everybody feels strongly to have it in this release) if you don't see any patch for the same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] POC: Cache data in GetSnapshotData()
On Wed, Mar 16, 2016 at 10:59 PM, Robert Haas wrote: > On Wed, Mar 16, 2016 at 4:40 AM, Mithun Cy > wrote: > >> >> On Thu, Mar 3, 2016 at 6:20 AM, Mithun Cy >> wrote: >> > I will continue to benchmark above tests with much wider range of >> clients. >> >> Latest Benchmarking shows following results for unlogged tables. >> >> clients BASE ONLY CLOG CHANGES % Increase CLOG CHANGES + SAVE SNAPSHOT % >> Increase >> 1 1198.326337 1328.069656 10.8270439357 1234.078342 2.9834948875 >> 32 37455.181727 38295.250519 2.2428640131 41023.126293 9.5259037641 >> 64 48838.016451 50675.845885 3.7631123611 51662.814319 5.7840143259 >> 88 36878.187766 53173.577363 44.1870671639 56025.454917 51.9203038731 >> 128 35901.537773 52026.024098 44.9130798434 53864.486733 50.0339263281 >> 256 28130.354402 46793.134156 66.3439197647 46817.04602 66.4289235427 >> > > Whoa. At 64 clients, we're hardly getting any benefit, but then by 88 > clients, we're getting a huge benefit. I wonder why there's that sharp > change there. > > If you see, for the Base readings, there is a performance increase up till 64 clients and then there is a fall at 88 clients, which to me indicates that it hits very high-contention around CLogControlLock at 88 clients which CLog patch is able to control to a great degree (if required, I think the same can be verified by LWLock stats data). One reason for hitting contention at 88 clients is that this machine seems to have 64-cores (it has 8 sockets and 8 Core(s) per socket) as per below information of lscpu command. Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family:6 Model: 47 Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz Stepping: 2 CPU MHz: 1064.000 BogoMIPS: 4266.62 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K NUMA node0 CPU(s): 0,65-71,96-103 NUMA node1 CPU(s): 72-79,104-111 NUMA node2 CPU(s): 80-87,112-119 NUMA node3 CPU(s): 88-95,120-127 NUMA node4 CPU(s): 1-8,33-40 NUMA node5 CPU(s): 9-16,41-48 NUMA node6 CPU(s): 17-24,49-56 NUMA node7 CPU(s): 25-32,57-64 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] speeding up GIN build with parallel workers
On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan wrote: > > On Wed, 16 Mar 2016 18:08:38 +0530 > Amit Kapila wrote: > > > > > Why backend just waits, why can't it does the same work as any worker > > does? In general, for other parallelism features the backend also > > behaves the same way as worker in producing the results if the > > results from workers is not available. > > We can make backend do the same work as any worker, but that > will complicate the code for less than 2 % perfomance boost. Why do you think it will be just 2%? I think for single worker case, it should be much more as the master backend will be less busy in consuming tuples from tuple queue. I can't say much about code-complexity, as I haven't yet looked carefully at the logic of patch, but we didn't find much difficulty while doing it for parallel scans. One of the commit which might help you in understanding how currently heap scans are parallelised is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help you in anyway for writing a generic API for Gin parallel builds. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas wrote: > > On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila wrote: > > Won't in theory, without patch as well nentries can overflow after running > > for very long time? I think with patch it is more prone to overflow because > > we start borrowing from other free lists as well. > > Uh, I don't think so. Without the patch, there is just one entries > counter and it goes up and down. How would it ever overflow? > I thought it can overflow because we haven't kept any upper limit on incrementing it unless the memory finishes (ofcourse that is just a theoretical assumption, as the decrements will keep the number in control), so are you thinking about the risk of overflow with patch because we have to use sum of all the nentries from all the arrays for total or is there any thing else which makes you think that changing nentries into arrays of nentries can make it prone to overflow? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Mon, Mar 21, 2016 at 5:12 AM, Robert Haas wrote: > > On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila wrote: > > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas wrote: > >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila > >> wrote: > >> > Won't in theory, without patch as well nentries can overflow after > >> > running > >> > for very long time? I think with patch it is more prone to overflow > >> > because > >> > we start borrowing from other free lists as well. > >> > >> Uh, I don't think so. Without the patch, there is just one entries > >> counter and it goes up and down. How would it ever overflow? > > > > I thought it can overflow because we haven't kept any upper limit on > > incrementing it unless the memory finishes (ofcourse that is just a > > theoretical assumption, as the decrements will keep the number in control), > > so are you thinking about the risk of overflow with patch because we have to > > use sum of all the nentries from all the arrays for total or is there any > > thing else which makes you think that changing nentries into arrays of > > nentries can make it prone to overflow? > > Well, I mean, perhaps nentries could overflow if you had more than > 2^32 elements, but I'm not even positive we support that. If you > assume a fixed table with a million entries, the nentries value can > vary only between 0 and a million. But now split that into a bunch of > separate counters. The increment when you allocate an entry and the > decrement when you put one back don't have to hit the same bucket, > This is the point where I think I am missing something about patch. As far as I can understand, it uses the same freelist index (freelist_idx) for allocating and putting back the entry, so I think the chance of increment in one list and decrement in another is there when the value of freelist_idx is calculated differently for the same input, is it so, or there is something else in patch which I am missing? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund wrote: > > On 2016-03-19 15:43:27 +0530, Amit Kapila wrote: > > On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund wrote: > > > > > > On March 18, 2016 11:52:08 PM PDT, Amit Kapila < amit.kapil...@gmail.com> > > wrote: > > > >> >Won't the new code needs to ensure that ResetEvent(latchevent) > > > >should > > > >> >get > > > >> >called in case WaitForMultipleObjects() comes out when both > > > >> >pgwin32_signal_event and latchevent are signalled at the same time? > > > > >> WaitForMultiple only reports the readiness of on event at a time, no? > > > >> > > > > > > > >I don't think so, please read link [1] with a focus on below paragraph > > > >which states how it reports the readiness or signaled state when > > > >multiple > > > >objects become signaled. > > > > > > > >"When *bWaitAll* is *FALSE*, this function checks the handles in the > > > >array > > > >in order starting with index 0, until one of the objects is signaled. > > > >If > > > >multiple objects become signaled, the function returns the index of the > > > >first handle in the array whose object was signaled." > > I think this is just incredibly bad documentation. See > https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 > (Raymond Chen can be considered an authority here imo). > The article pointed by you justifies that the way ResetEvent is done by patch is correct. I am not sure, but you can weigh, if there is a need of comment so that if we want enhance this part of code (or want to write something similar) in future, we don't need to rediscover this fact. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund wrote: > > > On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila < > amit.kapil...@gmail.com> wrote: > > >The article pointed by you justifies that the way ResetEvent is done by > >patch is correct. I am not sure, but you can weigh, if there is a need > >of > >comment so that if we want enhance this part of code (or want to write > >something similar) in future, we don't need to rediscover this fact. > > I've added a reference in a comment. > > Did you have a chance of running the patched versions on windows? > > I am planning to do it in next few hours. > I plan to push this sometime today, so I can get on to some performance > patches I was planning to look into committing. > > have we done testing to ensure that it actually mitigate the impact of performance degradation due to commit ac1d794. I wanted to do that, but unfortunately the hight-end m/c on which this problem is reproducible is down. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit ac1d794
On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila wrote: > > On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund wrote: >> >> >> >> On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila < amit.kapil...@gmail.com> wrote: >> >> >The article pointed by you justifies that the way ResetEvent is done by >> >patch is correct. I am not sure, but you can weigh, if there is a need >> >of >> >comment so that if we want enhance this part of code (or want to write >> >something similar) in future, we don't need to rediscover this fact. >> >> I've added a reference in a comment. >> >> Did you have a chance of running the patched versions on windows? >> > > I am planning to do it in next few hours. > With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server start leads to below problem: LOG: database system was shut down at 2016-03-21 11:17:13 IST LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started FATAL: failed to set up event for socket: error code 10022 LOG: statistics collector process (PID 668) exited with exit code 1 LOG: autovacuum launcher process (PID 3868) was terminated by exception 0xC005 HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. I haven't investigated the problem still. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Fri, Mar 18, 2016 at 2:38 PM, Dilip Kumar wrote: > > > On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek wrote: >> >> Great. >> >> Just small notational thing, maybe this would be simpler?: >> extraBlocks = Min(512, lockWaiters * 20); > > > Done, new patch attached. > Review comments: 1. /* + * RelationAddOneBlock + * + * Extend relation by one block and lock the buffer + */ +static Buffer +RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate) Shall we mention in comments that this API returns locked buffer and it's the responsibility of caller to unlock it. 2. + /* To avoid the cases when there are huge number of lock waiters, and + * extend file size by big amount at a time, put some limit on the first line in multi-line comments should not contain anything. 3. + extraBlocks = Min(512, lockWaiters * 20); I think we should explain in comments about the reason of choosing 20 in above calculation. 4. Sometime back [1], you agreed on doing some analysis for the overhead that XLogFlush can cause during buffer eviction, but I don't see the results of same, are you planing to complete the same? 5. + if (LOCKACQUIRE_OK + != RelationExtensionLockConditional(relation, ExclusiveLock)) I think the coding style is to keep constant on right side of condition, did you see any other place in code which uses the check in a similar way? 6. - /* - * Now acquire lock on the new page. + /* For all case we need to add at least one block to satisfy our + * own request. */ Same problem as in point 2. 7. @@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len, if (needLock) UnlockRelationForExtension(relation, ExclusiveLock); + + Spurious newline addition. 8. +int +RelationExtensionLockWaiter(Relation relation) How about naming this function as RelationExtensionLockWaiterCount? 9. + /* Other waiter has extended the block for us*/ Provide an extra space before ending the comment. 10. + if (use_fsm) + { + if (lastValidBlock != InvalidBlockNumber) + { + targetBlock = RecordAndGetPageWithFreeSpace(relation, + lastValidBlock, + pageFreeSpace, + len + saveFreeSpace); + } Are you using RecordAndGetPageWithFreeSpace() instead of GetPageWithFreeSpace() to get the page close to the previous target page? If yes, then do you see enough benefit of the same that it can compensate the additional write operation which Record* API might cause due to additional dirtying of buffer? 11. + { + /* + * First try to get the lock in no-wait mode, if succeed extend one + * block, else get the lock in normal mode and after we get the lock + * extend some extra blocks, extra blocks will be added to satisfy + * request of other waiters and avoid future contention. Here instead + * of directly taking lock we try no-wait mode, this is to handle the + * case, when there is no contention - it should not find the lock + * waiter and execute extra instructions. + */ + if (LOCKACQUIRE_OK + != RelationExtensionLockConditional(relation, ExclusiveLock)) + { + LockRelationForExtension(relation, ExclusiveLock); + + if (use_fsm) + { + if (lastValidBlock != InvalidBlockNumber) + { + targetBlock = RecordAndGetPageWithFreeSpace(relation, + lastValidBlock, + pageFreeSpace, + len + saveFreeSpace); + } + + /* Other waiter has extended the block for us*/ + if (targetBlock != InvalidBlockNumber) + { + UnlockRelationForExtension(relation, ExclusiveLock); + goto loop; + } + + RelationAddExtraBlocks(relation, bistate); + } + } + } - /* - * Now acquire lock on the new page. + /* For all case we need to add at least one block to satisfy our + * own request. */ - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + buffer = RelationAddOneBlock(relation, otherBuffer, bistate); Won't this cause one extra block addition after backend extends the relation for multiple blocks, what is the need of same? 12. I think it is good to once test pgbench read-write tests to ensure that this doesn't introduce any new regression. [1] - http://www.postgresql.org/message-id/caa4ek1lonxz4qa_dquqbanspxsctjxrkexjii8h3gnd9z8u...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Mon, Mar 21, 2016 at 6:16 PM, Haribabu Kommi wrote: > > On Mon, Mar 14, 2016 at 4:51 PM, Amit Kapila wrote: > >> Operating system - windows 7 > >> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the > >> problem) > >> > >> 1. Create two standard users in the system (test_user1 and test_user2) > >> 2. Create two databases belongs each user listed above. > >> 3. Now using pg_ctl register the services for the two users. > >> 4. Provide logon permissions to these users to run the services by > >> changing > >> service properties. > > > > Did you mean to say that you changed Log on as: Local System Account in > > service properties or something else? > > No. Not as local service. The user should be the new standard user > that is created > in the system. > So what do you exactly mean by "Provide logon permissions to these users", can you describe in detail what exactly you have done to give those permissions. If I try to do with a new user, it gives me error "could not open service manager" at start of service. > > >> 5. Now try to start the services, the second service fails with the > >> error message. > >> 6. Error details can be found out in Event log viewer. > >> > > > > If I follow above steps and do as I mentioned for step-4, I am not able to > > reproduce the issue on Windows-7 m/c using code of HEAD. > > I am not able to start a service with HEAD code in the same machine, where > as it is working for 9.5. I will look into it later and update it. > Okay. But it is confusing for me because you told earlier that you are able to reproduce problem in 9.5. > >> Yes, it is working as same user services. The main problem is, PostgreSQL > >> as a service for two different users in the same system is not working > >> because > >> of same random getting generated for two services. > >> > > > > I am not sure why you think same random number is problem, as mentioned > > above, even if the dsm name is same due to same random number, the code has > > logic to process it appropriately (regenerate the name of dsm). Having said > > that, I don't mean to say that we shouldn't have logic to generate unique > > name and I think we might want to add data dir path to name generation as we > > do for main shared memory, however it is better to first completely > > understand the underneath issue. > > Yes, same random number generation is not the problem. In windows apart > from EEXIST error, EACCES also needs to be validated and returned for > new random number generation, instead of throwing an error. > Doing the same handling for EACCES doesn't seem to be sane because if EACCES came for reason other than duplicate dsm name, then we want to report the error instead of trying to regenerate the name. I think here fix should be to append data_dir path as we do for main shared memory. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas wrote: > > On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier > wrote: > > Thoughts? I have registered that in the CF app, and a patch is attached. > > It is very difficult to believe that this is a good idea: > > --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) > if (PQresultStatus(lastResult) == PGRES_COPY_IN || > PQresultStatus(lastResult) == PGRES_COPY_OUT || > PQresultStatus(lastResult) == PGRES_COPY_BOTH || > +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || > PQstatus(streamConn) == CONNECTION_BAD) > break; > } > > I mean, why would it be a good idea to blindly skip over fatal errors? > I think it is not about skipping the FATAL error, rather to stop trying to get further results on FATAL error. This is to ensure that OOM error is reported rather that ignored. There has been discussion about this previously as well [1]. [1] - http://www.postgresql.org/message-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission
On Tue, Mar 22, 2016 at 9:13 AM, Haribabu Kommi wrote: > > On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila wrote: > > On Mon, Mar 21, 2016 at 6:16 PM, Haribabu Kommi < kommi.harib...@gmail.com> > > wrote: > >> > >> On Mon, Mar 14, 2016 at 4:51 PM, Amit Kapila > >> wrote: > >> >> Operating system - windows 7 > >> >> Binary - PostgreSQL 9.5 (This doesn't matter, 9.4+ can produce the > >> >> problem) > >> >> > >> >> 1. Create two standard users in the system (test_user1 and test_user2) > >> >> 2. Create two databases belongs each user listed above. > >> >> 3. Now using pg_ctl register the services for the two users. > >> >> 4. Provide logon permissions to these users to run the services by > >> >> changing > >> >> service properties. > >> > > >> > Did you mean to say that you changed Log on as: Local System Account in > >> > service properties or something else? > >> > >> No. Not as local service. The user should be the new standard user > >> that is created > >> in the system. > >> > > > > So what do you exactly mean by "Provide logon permissions to these users", > > can you describe in detail what exactly you have done to give those > > permissions. If I try to do with a new user, it gives me error "could not > > open service manager" at start of service. > > 1. Start the cmd with administrator user and add the new postgresql service > with a standard user that is created. > 2. Start the services window with the user having administrator privileges and > go to the corresponding added service. > 3. Right click on the service provides an properties option. > 4. In the properties, there is an logon tab. Click it > 5. Provide the password for the new user that is used for creating the service. > 6. This adds the user to log on permissions. > I am also able to reproduce the issue with these steps. > >> > >> Yes, same random number generation is not the problem. In windows apart > >> from EEXIST error, EACCES also needs to be validated and returned for > >> new random number generation, instead of throwing an error. > >> > > > > Doing the same handling for EACCES doesn't seem to be sane because if EACCES > > came for reason other than duplicate dsm name, then we want to report the > > error instead of trying to regenerate the name. I think here fix should be > > to append data_dir path as we do for main shared memory. > > Yes, EACCES may be possible other than duplicate dsm name. > So as far as I can see there are two ways to resolve this issue, one is to retry generation of dsm name if CreateFileMapping returns EACCES and second is to append data_dir name to dsm name as the same is done for main shared memory, that will avoid the error to occur. First approach has minor flaw that if CreateFileMapping returns EACCES due to reason other then duplicate dsm name which I am not sure is possible to identify, then we should report error instead try to regenerate the name Robert and or others, can you share your opinion on what is the best way to proceed for this issue. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane wrote: > > Amit Kapila writes: > > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas wrote: > >> It is very difficult to believe that this is a good idea: > >> > >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) > >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || > >> PQresultStatus(lastResult) == PGRES_COPY_OUT || > >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || > >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || > >> PQstatus(streamConn) == CONNECTION_BAD) > >> break; > >> > >> I mean, why would it be a good idea to blindly skip over fatal errors? > > > I think it is not about skipping the FATAL error, rather to stop trying to > > get further results on FATAL error. > > If the code already includes "lost the connection" as a case to break on, > I'm not quite sure why "got a query error" is not. > This error check is exactly same as PQexecFinish() and there some explanation is given in comments which hints towards the reason for continuing on error, basically both the functions PQexecFinish() and libpqrcv_PQexec() returns the last result if there are many and PQexecFinish() concatenates the errors as well in some cases. Do we see any use in continuing to get result after getting PGRES_FATAL_ERROR error? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund wrote: > > Hi, > > On 2016-03-15 10:47:12 +0530, Amit Kapila wrote: > > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, > > * Record the final state of transaction entries in the commit log for > > * all entries on a single page. Atomic only on this page. > > * > > + * Group the status update for transactions. This improves the efficiency > > + * of the transaction status update by reducing the number of lock > > + * acquisitions required for it. To achieve the group transaction status > > + * update, we need to populate the transaction status related information > > + * in shared memory and doing it for overflowed sub-transactions would need > > + * a big chunk of shared memory, so we are not doing this optimization for > > + * such cases. This optimization is only applicable if the transaction and > > + * all child sub-transactions belong to same page which we presume to be the > > + * most common case, we might be able to apply this when they are not on same > > + * page, but that needs us to map sub-transactions in proc's XidCache based > > + * on pageno for which each time a group leader needs to set the transaction > > + * status and that can lead to some performance penalty as well because it > > + * needs to be done after acquiring CLogControlLock, so let's leave that > > + * case for now. We don't do this optimization for prepared transactions > > + * as the dummy proc associated with such transactions doesn't have a > > + * semaphore associated with it and the same is required for group status > > + * update. We choose not to create a semaphore for dummy procs for this > > + * purpose as the advantage of using this optimization for prepared transactions > > + * is not clear. > > + * > > I think you should try to break up some of the sentences, one of them > spans 7 lines. > Okay, I will try to do so in next version. > I'm actually rather unconvinced that it's all that common that all > subtransactions are on one page. If you have concurrency - otherwise > there'd be not much point in this patch - they'll usually be heavily > interleaved, no? You can argue that you don't care about subxacts, > because they're more often used in less concurrent scenarios, but if > that's the argument, it should actually be made. > Note, that we are doing it only when a transaction has less than equal to 64 sub transactions. Now, I am not denying from the fact that there will be cases where subtransactions won't fall into different pages, but I think chances of such transactions to participate in group mode will be less and this patch is mainly targeting scalability for short transactions. > > > * Otherwise API is same as TransactionIdSetTreeStatus() > > */ > > static void > > TransactionIdSetPageStatus(TransactionId xid, int nsubxids, > > TransactionId *subxids, XidStatus status, > > -XLogRecPtr lsn, int pageno) > > +XLogRecPtr lsn, int pageno, > > +bool all_xact_same_page) > > +{ > > + /* > > + * If we can immediately acquire CLogControlLock, we update the status > > + * of our own XID and release the lock. If not, use group XID status > > + * update to improve efficiency and if still not able to update, then > > + * acquire CLogControlLock and update it. > > + */ > > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE)) > > + { > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + LWLockRelease(CLogControlLock); > > + } > > + else if (!all_xact_same_page || > > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS || > > + IsGXactActive() || > > + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno)) > > + { > > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); > > + > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + > > + LWLockRelease(CLogControlLock); > > + } > > +} > > > > This code is a bit arcane. I think it should be restructured to > a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids > >PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional >lock acquire first
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 22, 2016 at 6:29 PM, Andres Freund wrote: > > On 2016-03-22 18:19:48 +0530, Amit Kapila wrote: > > > I'm actually rather unconvinced that it's all that common that all > > > subtransactions are on one page. If you have concurrency - otherwise > > > there'd be not much point in this patch - they'll usually be heavily > > > interleaved, no? You can argue that you don't care about subxacts, > > > because they're more often used in less concurrent scenarios, but if > > > that's the argument, it should actually be made. > > > > > > > Note, that we are doing it only when a transaction has less than equal to > > 64 sub transactions. > > So? > They should fall on one page, unless they are heavily interleaved as pointed by you. I think either subtransactions are present or not, this patch won't help for bigger transactions. I will address your other review comments and send an updated patch. Thanks for the review. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund wrote: > > Hi, > > On 2016-03-15 10:47:12 +0530, Amit Kapila wrote: > > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids, > > * Record the final state of transaction entries in the commit log for > > * all entries on a single page. Atomic only on this page. > > * > > + * Group the status update for transactions. This improves the efficiency > > + * of the transaction status update by reducing the number of lock > > + * acquisitions required for it. To achieve the group transaction status > > + * update, we need to populate the transaction status related information > > + * in shared memory and doing it for overflowed sub-transactions would need > > + * a big chunk of shared memory, so we are not doing this optimization for > > + * such cases. This optimization is only applicable if the transaction and > > + * all child sub-transactions belong to same page which we presume to be the > > + * most common case, we might be able to apply this when they are not on same > > + * page, but that needs us to map sub-transactions in proc's XidCache based > > + * on pageno for which each time a group leader needs to set the transaction > > + * status and that can lead to some performance penalty as well because it > > + * needs to be done after acquiring CLogControlLock, so let's leave that > > + * case for now. We don't do this optimization for prepared transactions > > + * as the dummy proc associated with such transactions doesn't have a > > + * semaphore associated with it and the same is required for group status > > + * update. We choose not to create a semaphore for dummy procs for this > > + * purpose as the advantage of using this optimization for prepared transactions > > + * is not clear. > > + * > > I think you should try to break up some of the sentences, one of them > spans 7 lines. > Okay, I have simplified the sentences in the comment. > > > > * Otherwise API is same as TransactionIdSetTreeStatus() > > */ > > static void > > TransactionIdSetPageStatus(TransactionId xid, int nsubxids, > > TransactionId *subxids, XidStatus status, > > -XLogRecPtr lsn, int pageno) > > +XLogRecPtr lsn, int pageno, > > +bool all_xact_same_page) > > +{ > > + /* > > + * If we can immediately acquire CLogControlLock, we update the status > > + * of our own XID and release the lock. If not, use group XID status > > + * update to improve efficiency and if still not able to update, then > > + * acquire CLogControlLock and update it. > > + */ > > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE)) > > + { > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + LWLockRelease(CLogControlLock); > > + } > > + else if (!all_xact_same_page || > > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS || > > + IsGXactActive() || > > + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno)) > > + { > > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); > > + > > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno); > > + > > + LWLockRelease(CLogControlLock); > > + } > > +} > > > > This code is a bit arcane. I think it should be restructured to > a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids > >PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional >lock acquire first can be rather expensive. > b) I'd rather see an explicit fallback for the >!TransactionGroupUpdateXidStatus case, this way it's too hard to >understand. It's also harder to add probes to detect whether that > Changed. > > > > The first process to add itself to the list will acquire > > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal > > + * on behalf of all group members. This avoids a great deal of contention > > + * around CLogControlLock when many processes are trying to commit at once, > > + * since the lock need not be repeatedly handed off from one committing > > + * process to the next. > > + * > > + * Returns true, if transaction status is updated in clog page, else return
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Mar 23, 2016 at 12:26 PM, Amit Kapila wrote: > > On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund wrote: > > > > > > I think it's worthwhile to create a benchmark that does something like > > BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time); > > INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms, > > completely realistic values for network RTT + application computation), > > the success rate of group updates shrinks noticeably. > > > > Will do some tests based on above test and share results. > Forgot to mention that the effect of patch is better visible with unlogged tables, so will do the test with those and request you to use same if you yourself is also planning to perform some tests. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Mar 22, 2016 at 12:33 PM, Dilip Kumar wrote: > > > On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila wrote: > > I have reviewed the patch.. here are some review comments, I will continue to review.. > > 1. > > + > + /* > + * Add the proc to list, if the clog page where we need to update the > > + */ > + if (nextidx != INVALID_PGPROCNO && > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage) > + return false; > > Should we clear all these structure variable what we set above in case we are not adding our self to group, I can see it will not have any problem even if we don't clear them, > I think if we don't want to clear we can add some comment mentioning the same. > I have updated the patch to just clear clogGroupMember as that is what is done when we wake the processes. > > 2. > > Here we are updating in our own proc, I think we don't need atomic operation here, we are not yet added to the list. > > + if (nextidx != INVALID_PGPROCNO && > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage) > + return false; > + > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx); > > We won't be able to assign nextidx to clogGroupNext is of type pg_atomic_uint32. Thanks for the review. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Thu, Mar 24, 2016 at 12:09 AM, Robert Haas wrote: > > On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek wrote: > > I've read this over several times and looked at > RecordAndGetPageWithFreeSpace() and I'm still confused. First of all, > if the lock was acquired by some other backend which did > RelationAddExtraBlocks(), it *will* have updated the FSM - that's the > whole point. > It doesn't update the FSM uptill root in some cases, as per comments on top of RecordPageWithFreeSpace and the code as well. > > Second, if the other backend extended the relation in > some other manner and did not extend the FSM, how does calling > RecordAndGetPageWithFreeSpace help? As far as I can see, > GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just > searching the FSM, so if one is stymied the other will be too. What > am I missing? > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to it, rather than from top, so even if RecordPageWithFreeSpace() doesn't update till root, it will be able to search the newly added page. I agree with whatever you have said in another mail that we should introduce a new API to do a more targeted search for such cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund wrote: > > On 2016-03-23 21:43:41 +0100, Andres Freund wrote: > > I'm playing around with SELECT txid_current(); right now - that should > > be about the most specific load for setting clog bits. > > Or so I thought. > > In my testing that showed just about zero performance difference between > the patch and master. And more surprisingly, profiling showed very > little contention on the control lock. Hacking > TransactionIdSetPageStatus() to return without doing anything, actually > only showed minor performance benefits. > > [there's also the fact that txid_current() indirectly acquires two > lwlock twice, which showed up more prominently than control lock, but > that I could easily hack around by adding a xid_current().] > > Similar with an INSERT only workload. And a small scale pgbench. > > > Looking through the thread showed that the positive results you'd posted > all were with relativey big scale factors. > I have seen smaller benefits at 300 scale factor and somewhat larger benefits at 1000 scale factor. Also Mithun has done similar testing with unlogged tables and the results of same [1] also looks good. > > Which made me think. Running > a bigger pgbench showed that most the interesting (i.e. long) lock waits > were both via TransactionIdSetPageStatus *and* TransactionIdGetStatus(). > Yes, this is same what I have observed as well. > > So I think what happens is that once you have a big enough table, the > UPDATEs standard pgbench does start to often hit *old* xids (in unhinted > rows). Thus old pages have to be read in, potentially displacing slru > content needed very shortly after. > > > Have you, in your evaluation of the performance of this patch, done > profiles over time? I.e. whether the performance benefits are the > immediately, or only after a significant amount of test time? Comparing > TPS over time, for both patched/unpatched looks relevant. > I have mainly done it with half-hour read-write tests. What do you want to observe via smaller tests, sometimes it gives inconsistent data for read-write tests? > > Even after changing to scale 500, the performance benefits on this, > older 2 socket, machine were minor; even though contention on the > ClogControlLock was the second most severe (after ProcArrayLock). > I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor. I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results. > Afaics that squares with Jesper's result, which basically also didn't > show a difference either way? > One difference was that I think Jesper has done testing with synchronous_commit mode as off whereas my tests were with synchronous commit mode on. > > I'm afraid that this patch might be putting bandaid on some of the > absolutely worst cases, without actually addressing the core > problem. Simon's patch in [1] seems to come closer addressing that > (which I don't believe it's safe without going doing every status > manipulation atomically, as individual status bits are smaller than 4 > bytes). Now it's possibly to argue that the bandaid might slow the > bleeding to a survivable level, but I have to admit I'm doubtful. > > Here's the stats for a -s 500 run btw: > Performance counter stats for 'system wide': > 18,747 probe_postgres:TransactionIdSetTreeStatus > 68,884 probe_postgres:TransactionIdGetStatus > 9,718 probe_postgres:PGSemaphoreLock > (the PGSemaphoreLock is over 50% ProcArrayLock, followed by ~15% > SimpleLruReadPage_ReadOnly) > > > My suspicion is that a better approach for now would be to take Simon's > patch, but add a (per-page?) 'ClogModificationLock'; to avoid the need > of doing something fancier in TransactionIdSetStatusBit(). > I think we can try that as well and if you see better results by that Approach, then we can use that instead of current patch. [1] - http://www.postgresql.org/message-id/cad__oujrdwqdjdovhahqldg-6ivu6ibci9ij1qpu6atuqpl...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila wrote: > > On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund wrote: > > > > Even after changing to scale 500, the performance benefits on this, > > older 2 socket, machine were minor; even though contention on the > > ClogControlLock was the second most severe (after ProcArrayLock). > > > One more point, I wanted to say here which is that I think the benefit will be shown mainly when the ClogControlLock has contention more than or near to ProcArrayLock, otherwise even if patch reduces contention (you can see via LWLock stats), the performance doesn't increase. From Mithun's data [1], related to LWLocks, it seems like at 88 clients in his test, the contention on CLOGControlLock becomes more than ProcArrayLock and that is the point where it has started showing noticeable performance gain. I have explained some more on that thread [2] about this point. Is it possible for you to once test in similar situation and see the behaviour (like for client count greater than number of cores) w.r.t locking contention and TPS. [1] - http://www.postgresql.org/message-id/cad__ouh6pahj+q1mzzjdzlo4v9gjyufegtjnayxc0_lfh-4...@mail.gmail.com [2] - http://www.postgresql.org/message-id/caa4ek1lboq4e3ycge+fe0euzvu89cqgtugneajoienxjr0a...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Thu, Mar 24, 2016 at 1:48 PM, Petr Jelinek wrote: > > On 24/03/16 07:04, Dilip Kumar wrote: >> >> >> On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas > <mailto:robertmh...@gmail.com>> wrote: >> >> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila >> mailto:amit.kapil...@gmail.com>> wrote: >> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to >> > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't >> > update till root, it will be able to search the newly added page. I agree >> > with whatever you have said in another mail that we should introduce a new >> > API to do a more targeted search for such cases. >> >> OK, let's do that, then. >> >> >> Ok, I have added new API which just find the free block from and start >> search from last given page. >> >> 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I >> don't like this name, but I could not come up with any better, Please >> suggest one. >> > 1. +GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage, + Size spaceNeeded) { .. + /* + * If fsm_set_and_search found a suitable new block, return that. + * Otherwise, search as usual. + */ .. } In the above comment, you are referring wrong function. 2. +static int +fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue) +{ + Buffer buf; + int newslot = -1; + + buf = fsm_readbuf(rel, addr, true); + LockBuffer(buf, BUFFER_LOCK_SHARE); + + if (minValue != 0) + { + /* Search while we still hold the lock */ + newslot = fsm_search_avail(buf, minValue, + addr.level == FSM_BOTTOM_LEVEL, + false); In this new API, I don't understand why we need minValue != 0 check, basically if user of API doesn't want to search for space > 0, then what is the need of calling this API? I think this API should use Assert for minValue!=0 unless you see reason for not doing so. > > GetNearestPageWithFreeSpace? (although not sure that's accurate description, maybe Nearby would be better) > Better than what is used in patch. Yet another possibility could be to call it as GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with value of oldPage as InvalidBlockNumber. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila wrote: > > On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund wrote: > > > > Have you, in your evaluation of the performance of this patch, done > > profiles over time? I.e. whether the performance benefits are the > > immediately, or only after a significant amount of test time? Comparing > > TPS over time, for both patched/unpatched looks relevant. > > > > I have mainly done it with half-hour read-write tests. What do you want to observe via smaller tests, sometimes it gives inconsistent data for read-write tests? > I have done some tests on both intel and power m/c (configuration of which are mentioned at end-of-mail) to see the results at different time-intervals and it is always showing greater than 50% improvement in power m/c at 128 client-count and greater than 29% improvement in Intel m/c at 88 client-count. Non-default parameters max_connections = 300 shared_buffers=8GB min_wal_size=10GB max_wal_size=15GB checkpoint_timeout=35min maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 wal_buffers = 256MB pgbench setup scale factor - 300 used *unlogged* tables : pgbench -i --unlogged-tables -s 300 .. pgbench -M prepared tpc-b Results on Intel m/c client-count - 88 Time (minutes) Base Patch % 5 39978 51858 29.71 10 38169 52195 36.74 20 36992 52173 41.03 30 37042 52149 40.78 Results on power m/c --- Client-count - 128 Time (minutes) Base Patch % 5 42479 65655 54.55 10 41876 66050 57.72 20 38099 65200 71.13 30 37838 61908 63.61 > > > > > Even after changing to scale 500, the performance benefits on this, > > older 2 socket, machine were minor; even though contention on the > > ClogControlLock was the second most severe (after ProcArrayLock). > > > > I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor. I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results. > Do you think in your tests, we don't see increase in performance in your tests because of m/c difference (sockets/cpu cores) or client-count? Intel m/c config (lscpu) - Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family:6 Model: 47 Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz Stepping: 2 CPU MHz: 1064.000 BogoMIPS: 4266.62 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K NUMA node0 CPU(s): 0,65-71,96-103 NUMA node1 CPU(s): 72-79,104-111 NUMA node2 CPU(s): 80-87,112-119 NUMA node3 CPU(s): 88-95,120-127 NUMA node4 CPU(s): 1-8,33-40 NUMA node5 CPU(s): 9-16,41-48 NUMA node6 CPU(s): 17-24,49-56 NUMA node7 CPU(s): 25-32,57-64 Power m/c config (lscpu) - Architecture: ppc64le Byte Order:Little Endian CPU(s):192 On-line CPU(s) list: 0-191 Thread(s) per core:8 Core(s) per socket:1 Socket(s): 24 NUMA node(s): 4 Model: IBM,8286-42A L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 NUMA node2 CPU(s): 96-143 NUMA node3 CPU(s): 144-191 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Performance degradation in commit 6150a1b0
On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma wrote: > > Hi All, > > I have been working on this issue for last few days trying to investigate what could be the probable reasons for Performance degradation at commit 6150a1b0. After going through Andres patch for moving buffer I/O and content lock out of Main Tranche, the following two things come into my > mind. > > 1. Content Lock is no more used as a pointer in BufferDesc structure instead it is included as LWLock structure. This basically increases the overall structure size from 64bytes to 80 bytes. Just to investigate on this, I have reverted the changes related to content lock from commit 6150a1b0 and taken at least 10 readings and with this change i can see that the overall performance is similar to what it was observed earlier i.e. before commit 6150a1b0. > > 2. Secondly, i can see that the BufferDesc structure padding is 64 bytes however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the BufferDesc structure padding size to 128 bytes along with the changes mentioned in above point #1, I see that the overall performance is again similar to what is observed before commit 6150a1b0. > > Please have a look into the attached test report that contains the performance test results for all the scenarios discussed above and let me know your thoughts. > So this indicates that changing back content lock as LWLock* in BufferDesc brings back the performance which indicates that increase in BufferDesc size to more than 64bytes on this platform has caused regression. I think it is worth trying the patch [1] as suggested by Andres as that will reduce the size of BufferDesc which can bring back the performance. Can you once try the same? [1] - http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Mar 24, 2016 at 8:08 AM, Amit Kapila wrote: > > On Thu, Mar 24, 2016 at 5:40 AM, Andres Freund wrote: > > > > Even after changing to scale 500, the performance benefits on this, > > older 2 socket, machine were minor; even though contention on the > > ClogControlLock was the second most severe (after ProcArrayLock). > > > > I have tried this patch on mainly 8 socket machine with 300 & 1000 scale factor. I am hoping that you have tried this test on unlogged tables and by the way at what client count, you have seen these results. > > > Afaics that squares with Jesper's result, which basically also didn't > > show a difference either way? > > > > One difference was that I think Jesper has done testing with synchronous_commit mode as off whereas my tests were with synchronous commit mode on. > On again looking at results posted by Jesper [1] and Mithun [2], I have one more observation which is that in HEAD, the performance doesn't dip even at higher client count (>75) on tests done by Jesper, whereas the results of tests done by Mithun indicate that it dips at high client count (>64) in HEAD and that is where the patch is helping. Now there is certainly some difference in test environment like Jesper has done testing on 2 socket m/c whereas mine and Mithun's tests were done 4 or 8 socket m/c. So I think the difference in TPS due to reduced contention on CLogControlLock are mainly visible with high socket m/c. Can anybody having access to 4 or more socket m/c help in testing this patch with --unlogged-tables? [1] - http://www.postgresql.org/message-id/56e9a596.2000...@redhat.com [2] - http://www.postgresql.org/message-id/cad__oujrdwqdjdovhahqldg-6ivu6ibci9ij1qpu6atuqpl...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas wrote: > > On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar wrote: > > > Results: > > -- > > 1. With this performance is little less than v14 but the problem of extra > > relation size is solved. > > 2. With this we can conclude that extra size of relation in v14 is because > > some while extending the pages, its not immediately available and at end > > some of the pages are left unused. > > I agree with that conclusion. I'm not quite sure where that leaves > us, though. We can go back to v13, but why isn't that producing extra > pages? It seems like it should: whenever a bulk extend rolls over to > a new FSM page, concurrent backends will search either the old or the > new one but not both. > I have not debugged the flow, but by looking at v13 code, it looks like it will search both old and new. In function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(), the basic idea of search is: Start the search from the target slot. At every step, move one node to the right, then climb up to the parent. Stop when we reach a node with enough free space (as we must, since the root has enough space). So shouldn't it be able to find the new FSM page where the bulk extend rolls over? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Fri, Sep 11, 2015 at 8:01 PM, Amit Kapila wrote: > > On Thu, Sep 3, 2015 at 5:11 PM, Andres Freund wrote: > > > > Updated comments and the patch (increate_clog_bufs_v2.patch) > containing the same is attached. > Andres mentioned to me in off-list discussion, that he thinks we should first try to fix the clog buffers problem as he sees in his tests that clog buffer replacement is one of the bottlenecks. He also suggested me a test to see if the increase in buffers could lead to regression. The basic idea of test was to ensure every access on clog access to be a disk one. Based on his suggestion, I have written a SQL statement which will allow every access of CLOG to be a disk access and the query used for same is as below: With ins AS (INSERT INTO test_clog_access values(default) RETURNING c1) Select * from test_clog_access where c1 = (Select c1 from ins) - 32768 * :client_id; Test Results - HEAD - commit d12e5bb7 Clog Buffers - 32 Patch-1 - Clog Buffers - 64 Patch-2 - Clog Buffers - 128 Patch_Ver/Client_Count 1 64 HEAD 12677 57470 Patch-1 12305 58079 Patch-2 12761 58637 Above data is a median of 3 10-min runs. Above data indicates that there is no substantial dip in increasing clog buffers. Test scripts used in testing are attached with this mail. In perf_clog_access.sh, you need to change data_directory path as per your m/c, also you might want to change the binary name, if you want to create postgres binaries with different names. Andres, Is this test inline with what you have in mind? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com access_clog_prep.sql Description: Binary data access_clog.sql Description: Binary data perf_clog_access.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Tue, Mar 29, 2016 at 3:21 AM, Petr Jelinek wrote: > On 28/03/16 14:46, Dilip Kumar wrote: > >> >> Conclusion: >> --- >> 1. I think v15 is solving the problem exist with v13 and performance >> is significantly high compared to base, and relation size is also >> stable, So IMHO V15 is winner over other solution, what other thinks ? >> >> > It seems so, do you have ability to reasonably test with 64 clients? I am > mostly wondering if we see the performance going further down or just > plateau. > > Yes, that makes sense. One more point is that if the reason for v13 giving better performance is extra blocks (which we believe in certain cases can leak till the time Vacuum updates the FSM tree), do you think it makes sense to once test by increasing lockWaiters * 20 limit to may be lockWaiters * 25 or lockWaiters * 30. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 29, 2016 at 8:31 PM, David Steele wrote: > Hi Amit, > > On 3/22/16 3:37 AM, Michael Paquier wrote: > > Hope this brings some light in. >> > > Do you know when you'll have time to respond to Michael's last email? I've > marked this patch "waiting on author" in the meantime. > > I think this patch needs to be looked upon by committer now. I have done review and added some code in this patch as well long back, just see the e-mail [1], patch is just same as it was in October 2015. I think myself and Michael are in agreement that this patch solves the reported problem. There is one similar problem [2] reported by Heikki which I think can be fixed separately. I think the main reason of moving this thread to hackers from bugs is to gain some more traction which I see that it achieves its purpose to some extent, but I find that more or less we are at same situation as we were back in October. Let me know if you think anything more from myside can help in moving patch. [1] - http://www.postgresql.org/message-id/cab7npqtnrw8lr1hd7zlnojjc1bp1evw_emadgorv+s-sbcr...@mail.gmail.com [2] - http://www.postgresql.org/message-id/566ef84f.1030...@iki.fi With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Updated backup APIs for non-exclusive backups
On Tue, Mar 29, 2016 at 11:39 PM, Magnus Hagander wrote: > On Tue, Mar 29, 2016 at 6:40 PM, Magnus Hagander > wrote: > >> So - I can definitely see the argument for returning the stop wal >> *location*. But I'm still not sure what the definition of the time would >> be? We can't return it before we know what it means... >> > > > I had a chat with Heikki, and here's another suggestion: > > 1. We don't touch the current exclusive backups at all, as previously > discussed, other than deprecating their use. For backwards compat. > > 2. For new backups, we return the contents of pg_control as a bytea from > pg_stop_backup(). We tell backup programs they are supposed to write this > out as pg_control.backup, *not* as pg_control. > > 3a. On recovery, if it's an exlcusive backup, we do as we did before. > > 3b. on recovery, in non-exclusive backups (determined from backup_label), > we check that pg_control.backup exists *and* that pg_control does *not* > exist. > Currently pg_control has been read before backup_label file, so as per this proposal do you want to change that? If yes, I think that will make this patch more invasive with respect to handling of failure modes. Also as David points out, I also feel that it will raise the bar for usage of this API. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com