Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch wrote: > The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath > vs. (b) master + lazy-vxid + [2]sinval-hasmessages. The only claimed benefit > of > [2] over [1], as far as I can see, is invulnerability to the hazard described > in > [3]. Before selecting [2] over [1], it would be instructive to know how much > performance we exchange for its benefit. > > I did a bit of (relatively unrigorous) poking at this workload with oprofile, > and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's > perhaps too cheap relative to the workload's other costs to matter. That's > good > enough for me. Me, too. There's another possible benefit, though: in sinval-fastpath, everybody's got to read maxMsgNum every time through; whereas in sinval-hasmessages, everyone's reading their own flag. I'm not sure how much it costs to have memory that gets read by multiple readers rather than just a single reader, but I think I've seen some things that indicate it might not always be free. > Interesting. I had hypothesized that at two clients per core, nearly every > call > to SIGetDataEntries() would find work to do, making the patch a strict loss. > A > bit of instrumented testing showed otherwise: at two clients per core, > sinval-hasmessages optimized away 93% of the calls. At fifty clients per > core, > it still optimized away more than half of the calls. Wow. That's better than I expected. Great idea for a test, too. Unless someone has another concern here, I think we've got this one nailed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas wrote: > >> I'll also test out creating and dropping some tables. > > > > Still need to work on this one. > > And there results are in. I set up the following sophisticated test > script for pgbench: > > CREATE TEMP TABLE foo (a int); > DROP TABLE foo; > > And then did 5-minute test runs with varying numbers of clients on > Nate Boley's 32-core machine with (a) master, (b) master + > sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid > + sinval-hasmessages. The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath vs. (b) master + lazy-vxid + [2]sinval-hasmessages. The only claimed benefit of [2] over [1], as far as I can see, is invulnerability to the hazard described in [3]. Before selecting [2] over [1], it would be instructive to know how much performance we exchange for its benefit. I did a bit of (relatively unrigorous) poking at this workload with oprofile, and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's perhaps too cheap relative to the workload's other costs to matter. That's good enough for me. [1] http://archives.postgresql.org/message-id/ca+tgmozc8z_jtj44xvpwpxkqt2jgjb5ygcz3t9u5-qzvdbm...@mail.gmail.com [2] http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xox3ahupybmjt9+pkw6qubqzn7sukk...@mail.gmail.com [3] http://archives.postgresql.org/message-id/20110727033537.gb18...@tornado.leadboat.com > 80L tps = 1192.768020 (including connections establishing) > 80L tps = 1165.545010 (including connections establishing) > 80L tps = 1169.776066 (including connections establishing) > 80LS tps = 1510.778084 (including connections establishing) > 80LS tps = 1484.423486 (including connections establishing) > 80LS tps = 1480.692051 (including connections establishing) > 80 tps = 1154.272515 (including connections establishing) > 80 tps = 1168.805881 (including connections establishing) > 80 tps = 1173.971801 (including connections establishing) > 80S tps = 1483.037788 (including connections establishing) > 80S tps = 1481.059504 (including connections establishing) > 80S tps = 1487.215748 (including connections establishing) > > So, apparently, the extra work in SIInsertDataEntries() is more than > paid for by the speedup in SIGetDataEntries(). I'm guessing that at > high client counts you win because of reduced spinlock contention, and > at low client counts you still win a little bit because > SIGetDataEntries() is called multiple times per transaction, whereas > SIInsertDataEntries() is only called once. I could be all wet on the > reason, but at any rate the numbers look pretty good. Interesting. I had hypothesized that at two clients per core, nearly every call to SIGetDataEntries() would find work to do, making the patch a strict loss. A bit of instrumented testing showed otherwise: at two clients per core, sinval-hasmessages optimized away 93% of the calls. At fifty clients per core, it still optimized away more than half of the calls. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas wrote: >> I'll also test out creating and dropping some tables. > > Still need to work on this one. And there results are in. I set up the following sophisticated test script for pgbench: CREATE TEMP TABLE foo (a int); DROP TABLE foo; And then did 5-minute test runs with varying numbers of clients on Nate Boley's 32-core machine with (a) master, (b) master + sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid + sinval-hasmessages. I held my breath until the results started popping out, then felt much better. In the table below, results with L are with the lazy-vxid patch, S is with the sinval-hasmessages patch, LS with both, and results with no letters are neither. The numbers are the client count. Long story short, it seems that the patch actually makes this test case faster at any client code, though the improvement at 1 and 8 clients might be in the noise: 01L tps = 514.880290 (including connections establishing) 01L tps = 525.097199 (including connections establishing) 01L tps = 508.319588 (including connections establishing) 08L tps = 1834.259810 (including connections establishing) 08L tps = 1846.846089 (including connections establishing) 08L tps = 1841.402433 (including connections establishing) 32L tps = 1463.822936 (including connections establishing) 32L tps = 1481.169483 (including connections establishing) 32L tps = 1393.780335 (including connections establishing) 80L tps = 1192.768020 (including connections establishing) 80L tps = 1165.545010 (including connections establishing) 80L tps = 1169.776066 (including connections establishing) 01LS tps = 517.624068 (including connections establishing) 01LS tps = 524.507723 (including connections establishing) 01LS tps = 507.847622 (including connections establishing) 08LS tps = 1831.248178 (including connections establishing) 08LS tps = 1873.932133 (including connections establishing) 08LS tps = 1863.048113 (including connections establishing) 32LS tps = 1851.143407 (including connections establishing) 32LS tps = 1754.683356 (including connections establishing) 32LS tps = 1785.926527 (including connections establishing) 80LS tps = 1510.778084 (including connections establishing) 80LS tps = 1484.423486 (including connections establishing) 80LS tps = 1480.692051 (including connections establishing) 01 tps = 511.572832 (including connections establishing) 01 tps = 499.389527 (including connections establishing) 01 tps = 495.697080 (including connections establishing) 08 tps = 1832.762548 (including connections establishing) 08 tps = 1819.884564 (including connections establishing) 08 tps = 1835.608561 (including connections establishing) 32 tps = 1417.168790 (including connections establishing) 32 tps = 1447.478971 (including connections establishing) 32 tps = 1427.489879 (including connections establishing) 80 tps = 1154.272515 (including connections establishing) 80 tps = 1168.805881 (including connections establishing) 80 tps = 1173.971801 (including connections establishing) 01S tps = 519.860218 (including connections establishing) 01S tps = 510.759087 (including connections establishing) 01S tps = 517.159276 (including connections establishing) 08S tps = 1880.179600 (including connections establishing) 08S tps = 1829.693454 (including connections establishing) 08S tps = 1886.168401 (including connections establishing) 32S tps = 1809.950929 (including connections establishing) 32S tps = 1809.474070 (including connections establishing) 32S tps = 1798.620842 (including connections establishing) 80S tps = 1483.037788 (including connections establishing) 80S tps = 1481.059504 (including connections establishing) 80S tps = 1487.215748 (including connections establishing) So, apparently, the extra work in SIInsertDataEntries() is more than paid for by the speedup in SIGetDataEntries(). I'm guessing that at high client counts you win because of reduced spinlock contention, and at low client counts you still win a little bit because SIGetDataEntries() is called multiple times per transaction, whereas SIInsertDataEntries() is only called once. I could be all wet on the reason, but at any rate the numbers look pretty good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas wrote: > The reason the benefit is smaller is, I believe, because the previous > numbers were generated with the lazy vxid locks patch applied, and > these were generated against master. With the lock manager as a > bottleneck, the sinval stuff doesn't get hit quite as hard, so the > benefit is less. I can regenerate the numbers with the lazy vxid > patch applied; I suspect they'll be similar to what we saw before. Yep. Here's with both lazy-vxids and sinval-hasmessages; 01 tps = 4470.133776 (including connections establishing) 01 tps = 4471.450650 (including connections establishing) 01 tps = 4490.833194 (including connections establishing) 32 tps = 191416.960956 (including connections establishing) 32 tps = 190653.742400 (including connections establishing) 32 tps = 191832.231458 (including connections establishing) 80 tps = 189348.509378 (including connections establishing) 80 tps = 191080.641878 (including connections establishing) 80 tps = 191366.728275 (including connections establishing) And with just lazy vxids: 01 tps = 4458.667013 (including connections establishing) 01 tps = 4526.922638 (including connections establishing) 01 tps = 4480.415099 (including connections establishing) 32 tps = 193273.434028 (including connections establishing) 32 tps = 190661.279391 (including connections establishing) 32 tps = 189526.560031 (including connections establishing) 80 tps = 150572.020250 (including connections establishing) 80 tps = 118643.970622 (including connections establishing) 80 tps = 119211.643930 (including connections establishing) Same select-only, scale-factor-100 pgbench test, same 32 core machine, as I've been using for my other recent tests. > I'll also test out creating and dropping some tables. Still need to work on this one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 27, 2011 at 1:58 PM, Noah Misch wrote: >> > I think a benchmark is in order, something like 900 idle connections and 80 >> > connections running small transactions that create a few temporary tables. >> > If >> > that shows no statistically significant regression, then we're probably >> > fine >> > here. I'm not sure what result to expect, honestly. >> >> That's setting the bar pretty high. I don't mind doing the >> experiment, but I'm not sure that's the case we should be optimizing >> for. > > Granted. How about 32 clients running the temporary table transaction, no > idle > connections? Given the meager benefit of this patch compared to your previous > version, it would be hard to justify a notable performance drop in return. The reason the benefit is smaller is, I believe, because the previous numbers were generated with the lazy vxid locks patch applied, and these were generated against master. With the lock manager as a bottleneck, the sinval stuff doesn't get hit quite as hard, so the benefit is less. I can regenerate the numbers with the lazy vxid patch applied; I suspect they'll be similar to what we saw before. I'll also test out creating and dropping some tables. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 27, 2011 at 01:30:47PM -0400, Robert Haas wrote: > On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch wrote: > > [wrong objection] > > Eh, how can this possibly happen? You have to hold msgNumLock to to > set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough > to guarantee that we never read a stale value, what is? Indeed, my analysis was all wrong. > > I think a benchmark is in order, something like 900 idle connections and 80 > > connections running small transactions that create a few temporary tables. > > If > > that shows no statistically significant regression, then we're probably fine > > here. I'm not sure what result to expect, honestly. > > That's setting the bar pretty high. I don't mind doing the > experiment, but I'm not sure that's the case we should be optimizing > for. Granted. How about 32 clients running the temporary table transaction, no idle connections? Given the meager benefit of this patch compared to your previous version, it would be hard to justify a notable performance drop in return. > > What did you think of making the message number a uint64, removing > > wraparound > > handling, and retaining SISeg.msgnumLock for 32-bit only? You could > > isolate the > > variant logic in READ_MSGNUM and WRITE_MSGNUM macros. > > Well, what you really need to know is whether the platform has 8-byte > atomic stores, which doesn't seem trivial to figure out, plus you need > a memory fence. If that's the only method of fixing this problem we > can agree on, I'm willing to work on it, but an > architecture-independent fix would be nicer. Fair enough. Thanks, nm -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch wrote: > This approach would work if a spinlock release constrained the global stores > timeline. It makes a weaker guarantee: all stores preceding the lock release > in program order will precede it globally. Consequently, no processor will > observe the spinlock to be available without also observing all stores made by > the last holding processor before that processor released it. That guarantee > is not enough for this sequence of events: > > Backend 0: > add a message for table "a" > store 5 => maxMsgNum > store true => timeToPayAttention > LWLockRelease(SInvalWriteLock) > > Backend 2: > LOCK TABLE a; > load timeToPayAttention => true > Backend 1: > add a message for table "b" > store 6 => maxMsgNum > SpinLockRelease(&vsegP->msgnumLock); > store true => timeToPayAttention > Backend 2: > store false => timeToPayAttention > LWLockAcquire(SInvalReadLock, LW_SHARED) > load maxMsgNum => 5 [!] Eh, how can this possibly happen? You have to hold msgNumLock to to set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough to guarantee that we never read a stale value, what is? > I think a benchmark is in order, something like 900 idle connections and 80 > connections running small transactions that create a few temporary tables. If > that shows no statistically significant regression, then we're probably fine > here. I'm not sure what result to expect, honestly. That's setting the bar pretty high. I don't mind doing the experiment, but I'm not sure that's the case we should be optimizing for. > What did you think of making the message number a uint64, removing wraparound > handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate > the > variant logic in READ_MSGNUM and WRITE_MSGNUM macros. Well, what you really need to know is whether the platform has 8-byte atomic stores, which doesn't seem trivial to figure out, plus you need a memory fence. If that's the only method of fixing this problem we can agree on, I'm willing to work on it, but an architecture-independent fix would be nicer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 09:57:10PM -0400, Robert Haas wrote: > On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote: > > No new ideas come to mind, here. > > OK, I have a new idea. :-) > > 1. Add a new flag to each procState called something like > "timeToPayAttention". > 2. Each call to SIGetDataEntries() iterates over all the ProcStates > whose index is < lastBackend and sets stateP->timeToPayAttention = > TRUE for each. > 3. At the beginning of SIGetDataEntries(), we do an unlocked if > (!stateP->timeToPayAttention) return 0. > 4. Immediately following that if statement and before acquiring any > locks, we set stateP->timeToPayAttention = FALSE. > > The LWLockRelease() in SIGetDataEntries() will be sufficient to force > all of the stateP->timeToPayAttention writes out to main memory, and > the read side is OK because we either just took a lock (which acted as > a fence) or else there's a race anyway. But unlike my previous > proposal, it doesn't involve *comparing* anything. We needn't worry > about whether we could read two different values that are through > great misfortune out of sync because we're only reading one value. > > If, by chance, the value is set to true just after we set it to false, > that won't mess us up either: we'll still read all the messages after > acquiring the lock. This approach would work if a spinlock release constrained the global stores timeline. It makes a weaker guarantee: all stores preceding the lock release in program order will precede it globally. Consequently, no processor will observe the spinlock to be available without also observing all stores made by the last holding processor before that processor released it. That guarantee is not enough for this sequence of events: Backend 0: add a message for table "a" store 5 => maxMsgNum store true => timeToPayAttention LWLockRelease(SInvalWriteLock) Backend 2: LOCK TABLE a; load timeToPayAttention => true Backend 1: add a message for table "b" store 6 => maxMsgNum SpinLockRelease(&vsegP->msgnumLock); store true => timeToPayAttention Backend 2: store false => timeToPayAttention LWLockAcquire(SInvalReadLock, LW_SHARED) load maxMsgNum => 5 [!] Backend 1: LWLockRelease(SInvalWriteLock); Backend 2: LOCK TABLE b; load timeToPayAttention => false The "SpinLockRelease(&vsegP->msgnumLock)" guarantees that any process observing the backend 2 store of timeToPayAttention will also observe maxMsgNum = 6. However, nothing constrains which timeToPayAttention store will "win" in main memory. Backend 1 can observe neither update from backend 2, yet still have its store appear later than the backend 1 stores in the global timeline. > Now, there's some downside to this approach - it involves doing O(N) > work for each invalidation message we receive. But as long as it's > only O(N) stores and not O(N) lock acquisitions and releases, I think > that might be OK. I think a benchmark is in order, something like 900 idle connections and 80 connections running small transactions that create a few temporary tables. If that shows no statistically significant regression, then we're probably fine here. I'm not sure what result to expect, honestly. What did you think of making the message number a uint64, removing wraparound handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate the variant logic in READ_MSGNUM and WRITE_MSGNUM macros. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 27, 2011 at 12:34 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote: >>> 1. Add a new flag to each procState called something like >>> "timeToPayAttention". >>> 2. Each call to SIGetDataEntries() iterates over all the ProcStates >>> whose index is < lastBackend and sets stateP->timeToPayAttention = >>> TRUE for each. >>> 3. At the beginning of SIGetDataEntries(), we do an unlocked if >>> (!stateP->timeToPayAttention) return 0. >>> 4. Immediately following that if statement and before acquiring any >>> locks, we set stateP->timeToPayAttention = FALSE. > >> There turned out to be a little bit of further subtlety to this, but >> it seems to work. Patch attached. > > And? > > It didn't sound to me like this could possibly be a performance win, > but I await some numbers ... On master, with the patch, scale factor 100, pgbench -S -c $CLIENTS -j $CLIENTS -T 300 results for different client counts, on a 32-core machine with 128GB RAM, shared_buffers = 8GB: 01 tps = .280459 (including connections establishing) 01 tps = 4438.365576 (including connections establishing) 01 tps = 4423.718466 (including connections establishing) 08 tps = 33187.827872 (including connections establishing) 08 tps = 33288.247330 (including connections establishing) 08 tps = 33304.307835 (including connections establishing) 32 tps = 178876.350559 (including connections establishing) 32 tps = 177293.082295 (including connections establishing) 32 tps = 175577.058885 (including connections establishing) 80 tps = 159202.449993 (including connections establishing) 80 tps = 151541.717088 (including connections establishing) 80 tps = 150454.658132 (including connections establishing) Without the patch: 01 tps = 4447.660101 (including connections establishing) 01 tps = 4440.830713 (including connections establishing) 01 tps = 4411.610348 (including connections establishing) 08 tps = 33135.773476 (including connections establishing) 08 tps = 33365.387051 (including connections establishing) 08 tps = 33364.859705 (including connections establishing) 32 tps = 175834.280471 (including connections establishing) 32 tps = 176713.118271 (including connections establishing) 32 tps = 176830.687087 (including connections establishing) 80 tps = 135211.036587 (including connections establishing) 80 tps = 130642.264016 (including connections establishing) 80 tps = 133621.549513 (including connections establishing) I'm fairly certain the results will be even more dramatic with the lazy-vxid patch applied, since master still has to fight with the lock manager on this workload. I haven't tested that yet, but there's not much reason to suppose that the results will be dramatically different from any of the other methods of reducing the sinval overhead that I've benchmarked, many of which I've already posted about. See, for example, the OP on this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas wrote: > There turned out to be a little bit of further subtlety to this, > but it seems to work. Patch attached. Stylistic question: Why is stateP->hasMessages set to false in one place and FALSE and TRUE in others? It seems like it would be less confusing to be consistent. A quick count of the usage of both forms in the PostgreSQL source codes shows the lowercase is used about 10 times more often: kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c '\b(FALSE|TRUE)\b' 1670 kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c '\b(false|true)\b' 17349 -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas writes: > On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote: >> 1. Add a new flag to each procState called something like >> "timeToPayAttention". >> 2. Each call to SIGetDataEntries() iterates over all the ProcStates >> whose index is < lastBackend and sets stateP->timeToPayAttention = >> TRUE for each. >> 3. At the beginning of SIGetDataEntries(), we do an unlocked if >> (!stateP->timeToPayAttention) return 0. >> 4. Immediately following that if statement and before acquiring any >> locks, we set stateP->timeToPayAttention = FALSE. > There turned out to be a little bit of further subtlety to this, but > it seems to work. Patch attached. And? It didn't sound to me like this could possibly be a performance win, but I await some numbers ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote: > On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote: >> No new ideas come to mind, here. > > OK, I have a new idea. :-) > > 1. Add a new flag to each procState called something like > "timeToPayAttention". > 2. Each call to SIGetDataEntries() iterates over all the ProcStates > whose index is < lastBackend and sets stateP->timeToPayAttention = > TRUE for each. > 3. At the beginning of SIGetDataEntries(), we do an unlocked if > (!stateP->timeToPayAttention) return 0. > 4. Immediately following that if statement and before acquiring any > locks, we set stateP->timeToPayAttention = FALSE. > > The LWLockRelease() in SIGetDataEntries() will be sufficient to force > all of the stateP->timeToPayAttention writes out to main memory, and > the read side is OK because we either just took a lock (which acted as > a fence) or else there's a race anyway. But unlike my previous > proposal, it doesn't involve *comparing* anything. We needn't worry > about whether we could read two different values that are through > great misfortune out of sync because we're only reading one value. > > If, by chance, the value is set to true just after we set it to false, > that won't mess us up either: we'll still read all the messages after > acquiring the lock. There turned out to be a little bit of further subtlety to this, but it seems to work. Patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company sinval-hasmessages.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 06:04:16PM -0400, Tom Lane wrote: > Noah Misch writes: > > On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote: > >> Dirty cache line, maybe not, but what if the assembly code commands the > >> CPU to load those variables into CPU registers before doing the > >> comparison? If they're loaded with maxMsgNum coming in last (or at > >> least after resetState), I think you can have the problem without any > >> assumptions about cache line behavior at all. You just need the process > >> to lose the CPU at the right time. > > > True. If the compiler places the resetState load first, you could hit the > > anomaly by "merely" setting a breakpoint on the next instruction, waiting > > for > > exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend > > continue. > > I think, though, we should either plug that _and_ the cache incoherency > > case or > > worry about neither. > > How do you figure that? The poor-assembly-code-order risk is both a lot > easier to fix and a lot higher probability. Admittedly, it's still way > way down there, but you only need a precisely-timed sleep, not a > precisely-timed sleep *and* a cache line that somehow remained stale. I think both probabilities are too low to usefully distinguish. An sinval wraparound takes a long time even in a deliberate test setup: almost 30 hours @ 10k messages/sec. To get a backend to sleep that long, you'll probably need something like SIGSTOP or a debugger attach. The sleep has to fall within the space of no more than a few instructions. Then, you'd need to release the process at the exact moment for it to observe wrapped equality. In other words, you get one split-millisecond opportunity every 30 hours of process sleep time. If your backends don't have multi-hour sleeps, it can't ever happen. Even so, all the better if we settle on an approach that has neither hazard. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote: > No new ideas come to mind, here. OK, I have a new idea. :-) 1. Add a new flag to each procState called something like "timeToPayAttention". 2. Each call to SIGetDataEntries() iterates over all the ProcStates whose index is < lastBackend and sets stateP->timeToPayAttention = TRUE for each. 3. At the beginning of SIGetDataEntries(), we do an unlocked if (!stateP->timeToPayAttention) return 0. 4. Immediately following that if statement and before acquiring any locks, we set stateP->timeToPayAttention = FALSE. The LWLockRelease() in SIGetDataEntries() will be sufficient to force all of the stateP->timeToPayAttention writes out to main memory, and the read side is OK because we either just took a lock (which acted as a fence) or else there's a race anyway. But unlike my previous proposal, it doesn't involve *comparing* anything. We needn't worry about whether we could read two different values that are through great misfortune out of sync because we're only reading one value. If, by chance, the value is set to true just after we set it to false, that won't mess us up either: we'll still read all the messages after acquiring the lock. Now, there's some downside to this approach - it involves doing O(N) work for each invalidation message we receive. But as long as it's only O(N) stores and not O(N) lock acquisitions and releases, I think that might be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 3:25 PM, Simon Riggs wrote: > On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote: >> It wouldn't, although it might be bad in the case where there are lots >> of temp tables being created and dropped. > > Do temp tables cause relcache invalidations? > > That seems like something we'd want to change in itself. I agree. Unfortunately, I think it's a non-trivial fix. I've also been wondering if we could avoid taking an AccessExclusiveLock on a newly created (temporary?) table. It seems like no one should be able to see it until commit, at which point we'd be releasing the lock anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 3:34 PM, Simon Riggs wrote: > You might be right, but I think we have little knowledge of how some > memory barrier code you haven't written yet effects performance on > various architectures. > > A spinlock per backend would cache very nicely, now you mention it. So > my money would be on the multiple copies. Maybe so, but you can see from the numbers in my OP that the results still leave something to be desired. > It's not completely clear to me that updating N copies would be more > expensive. Accessing N low contention copies rather than 1 > high-contention value might actually be a win. Yeah, I haven't tested that approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Noah Misch writes: > On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote: >> Dirty cache line, maybe not, but what if the assembly code commands the >> CPU to load those variables into CPU registers before doing the >> comparison? If they're loaded with maxMsgNum coming in last (or at >> least after resetState), I think you can have the problem without any >> assumptions about cache line behavior at all. You just need the process >> to lose the CPU at the right time. > True. If the compiler places the resetState load first, you could hit the > anomaly by "merely" setting a breakpoint on the next instruction, waiting for > exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend > continue. > I think, though, we should either plug that _and_ the cache incoherency case > or > worry about neither. How do you figure that? The poor-assembly-code-order risk is both a lot easier to fix and a lot higher probability. Admittedly, it's still way way down there, but you only need a precisely-timed sleep, not a precisely-timed sleep *and* a cache line that somehow remained stale. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote: > Noah Misch writes: > > That's the theoretical risk I wished to illustrate. Though this appears > > possible on an abstract x86_64 system, I think it's unrealistic to suppose > > that > > a dirty cache line could persist *throughout* the issuance of more than 10^9 > > invalidation messages on a concrete implementation. > > Dirty cache line, maybe not, but what if the assembly code commands the > CPU to load those variables into CPU registers before doing the > comparison? If they're loaded with maxMsgNum coming in last (or at > least after resetState), I think you can have the problem without any > assumptions about cache line behavior at all. You just need the process > to lose the CPU at the right time. True. If the compiler places the resetState load first, you could hit the anomaly by "merely" setting a breakpoint on the next instruction, waiting for exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue. I think, though, we should either plug that _and_ the cache incoherency case or worry about neither. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Noah Misch writes: > On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote: >> After some further reflection I believe this patch actually is pretty >> safe, although Noah's explanation of why seems a bit confused. > Here's the way it can fail: > 1. Backend enters SIGetDataEntries() with main memory bearing > stateP->resetState > = false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those > latest stateP values in cache, but segP->maxMsgNum is *not* in cache. > 2. Backend stalls for . Meanwhile, other backends issue > MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears > stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND, > segP->maxMsgNum = 500. > 3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum = > 500 from main memory. The patch's test finds no need to reset or process > invalidation messages. [ squint... ] Hmm, you're right. The case where this would break things is if (some of) the five unprocessed messages relate to some object we've just locked. But the initial state you describe would be valid right after obtaining such a lock. > That's the theoretical risk I wished to illustrate. Though this appears > possible on an abstract x86_64 system, I think it's unrealistic to suppose > that > a dirty cache line could persist *throughout* the issuance of more than 10^9 > invalidation messages on a concrete implementation. Dirty cache line, maybe not, but what if the assembly code commands the CPU to load those variables into CPU registers before doing the comparison? If they're loaded with maxMsgNum coming in last (or at least after resetState), I think you can have the problem without any assumptions about cache line behavior at all. You just need the process to lose the CPU at the right time. If we marked the pointers volatile, we could probably ensure that the assembly code tests resetState last, but that's not sufficient to avoid the stale-cache-line risk. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 01:36:26PM -0400, Robert Haas wrote: > On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch wrote: > > On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote: > >> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote: > >> > This is attractive, and I don't see any problems with it. (In theory, > >> > you could > >> > hit a case where the load of resetState gives an ancient "false" just as > >> > the > >> > counters wrap to match. Given that the wrap interval is 100x as > >> > long as the > >> > reset interval, I'm not worried about problems on actual silicon.) > >> > >> It's actually 262,144 times as long - see MSGNUMWRAPAROUND. > > > > Ah, so it is. > > > >> It would be pretty easy to eliminate even the theoretical possibility > >> of a race by getting rid of resetState altogether and using nextMsgNum > >> = -1 to mean that. Maybe I should go ahead and do that. > > > > Seems like a nice simplification. > > On further reflection, I don't see that this helps: it just moves the > problem around. Alas, yes. > With resetState as a separate variable, nextMsgNum is > never changed by anyone other than the owner, so we can never have a > stale load. It's also changed when SICleanupQueue() decides to wrap everyone. This could probably be eliminated by using uint32 and letting overflow take care of wrap implicitly, but ... > But if we overload nextMsgNum to also indicate whether > our state has been reset, then there's a race between when we load > nextMsgNum and when we load maxMsgNum (instead of code I posted > previously, which has a race between when we load resetState and when > we load maxMsgNum). Now, as you say, it seems really, really > difficult to hit that in practice, but I don't see a way of getting > rid of the theoretical possibility without either (1) a spinlock or > (2) a fence. (Of course, on x86, the fence could be optimized down to > a compiler barrier.) No new ideas come to mind, here. We could migrate back toward your original proposal of making the counter a non-wrapping uint64; Florian described some nice techniques for doing atomic 64-bit reads on x86: http://archives.postgresql.org/message-id/7c94c386-122f-4918-8624-a14352e8d...@phlo.org I'm not personally acquainted with those approaches, but they sound promising. Since the overlap between 32-bit installations and performance-sensitive installations is all but gone, we could even just use a spinlock under 32-bit. > I guess the question is "should we worry about that?". I'd lean toward "no". I share Tom's unease about writing off a race condition as being too improbable, but this is quite exceptionally improbable. On the other hand, some of the fixes don't look too invasive. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote: > Noah Misch writes: > > On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote: > >> I think I have a simpler idea, though: > >> before acquiring any locks, just have SIGetDataEntries() do this: > >> > >> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) > >> + return 0; > >> > >> Patch (with comment explaining why I think this is OK) attached. If > >> the message numbers happen to be equal only because the counter has > >> wrapped, then stateP->resetState will be true, so we'll still realize > >> we need to do some work. > > > This is attractive, and I don't see any problems with it. (In theory, you > > could > > hit a case where the load of resetState gives an ancient "false" just as the > > counters wrap to match. Given that the wrap interval is 100x as long > > as the > > reset interval, I'm not worried about problems on actual silicon.) > > > +1 for doing this and moving on. > > After some further reflection I believe this patch actually is pretty > safe, although Noah's explanation of why seems a bit confused. The key > points are that (1) each of the reads is atomic, and (2) we should not > be able to see a value that is older than our last acquisition/release > of a shared memory lock. These being granted, we will make a decision > that is correct, or at least was correct as of some time after that last > lock action. As stated in the patch comments, we are not required to > promptly assimilate sinval actions on objects we don't hold any lock on, > so this seems sufficient. In essence, we are relying on an assumed > prior visit to the lock manager to provide memory access synchronization > for the sinval no-work-to-do test. > > The case Noah speculates about above amounts to supposing that this > reasoning fails to hold for the resetState value, and I don't see why > that would be true. Even if the above scenario did manage to happen, > it would not mean that we missed the reset, just that we hadn't noticed > it *yet*. And by hypothesis, none of the as-yet-not-seen catalog > updates are a problem for us. Here's the way it can fail: 1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState = false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those latest stateP values in cache, but segP->maxMsgNum is *not* in cache. 2. Backend stalls for . Meanwhile, other backends issue MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND, segP->maxMsgNum = 500. 3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum = 500 from main memory. The patch's test finds no need to reset or process invalidation messages. That's the theoretical risk I wished to illustrate. Though this appears possible on an abstract x86_64 system, I think it's unrealistic to suppose that a dirty cache line could persist *throughout* the issuance of more than 10^9 invalidation messages on a concrete implementation. A way to think about the problem is that our read of segP->maxMsgNum is too new. If we had used a snapshot of values as of the most recent lock acquisition/release, there would be no problem. It's the mix of a new-enough stateP with an all-too-new segP that yields the anomaly. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Noah Misch writes: > On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote: >> I think I have a simpler idea, though: >> before acquiring any locks, just have SIGetDataEntries() do this: >> >> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) >> + return 0; >> >> Patch (with comment explaining why I think this is OK) attached. If >> the message numbers happen to be equal only because the counter has >> wrapped, then stateP->resetState will be true, so we'll still realize >> we need to do some work. > This is attractive, and I don't see any problems with it. (In theory, you > could > hit a case where the load of resetState gives an ancient "false" just as the > counters wrap to match. Given that the wrap interval is 100x as long as > the > reset interval, I'm not worried about problems on actual silicon.) > +1 for doing this and moving on. After some further reflection I believe this patch actually is pretty safe, although Noah's explanation of why seems a bit confused. The key points are that (1) each of the reads is atomic, and (2) we should not be able to see a value that is older than our last acquisition/release of a shared memory lock. These being granted, we will make a decision that is correct, or at least was correct as of some time after that last lock action. As stated in the patch comments, we are not required to promptly assimilate sinval actions on objects we don't hold any lock on, so this seems sufficient. In essence, we are relying on an assumed prior visit to the lock manager to provide memory access synchronization for the sinval no-work-to-do test. The case Noah speculates about above amounts to supposing that this reasoning fails to hold for the resetState value, and I don't see why that would be true. Even if the above scenario did manage to happen, it would not mean that we missed the reset, just that we hadn't noticed it *yet*. And by hypothesis, none of the as-yet-not-seen catalog updates are a problem for us. BTW, the patch really ought to add something to the comment around line 100. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote: * Can we partition the sinval lock, so we have multiple copies? That increases the task for those who trigger an invalidation, but will relieve the pressure for most readers. >>> >>> Not sure there's a way to meaningfully partition the queue. In any >>> case, I think the problem being dealt with here is how to update the >>> read heads of the queue, not its contents. >> >> I agree there's no meaningful way to partition the queue, but we can >> store the information in more than one place to reduce the contention >> of people requesting it. > > I thought about that. Basically, that saves you a factor of N in > contention on the read side (where N is the number of copies) and > costs you a factor of N on the write side (you have to update N > copies, taking a spinlock or lwlock for each). In the limit, you > could do one copy of the counter per backend. > > I think, though, that a lock-free implementation using memory barriers > is going to end up being the only real solution. We might possibly > convince ourselves that we're OK with increasing the cost of > SIInsertDataEntries(), but any solution that involves replication the > data is still based on doing at least some locking. And I am pretty > well convinced that even one spinlock acquisition in > SIInsertDataEntries() is too many. You might be right, but I think we have little knowledge of how some memory barrier code you haven't written yet effects performance on various architectures. A spinlock per backend would cache very nicely, now you mention it. So my money would be on the multiple copies. It's not completely clear to me that updating N copies would be more expensive. Accessing N low contention copies rather than 1 high-contention value might actually be a win. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote: > It wouldn't, although it might be bad in the case where there are lots > of temp tables being created and dropped. Do temp tables cause relcache invalidations? That seems like something we'd want to change in itself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Simon Riggs writes: > On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera > wrote: >> Signals are already in use for special cases (queue is full), and I >> think going through the kernel to achieve much more will lower >> performance significantly. > If there are no invalidations, there would be no signals. How would > zero signals decrease performance? But if there *is* an invalidation (which is not a negligible case), it'd get significantly slower. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 2:56 PM, Simon Riggs wrote: > On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera > wrote: >> Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011: >> >>> Let me ask a few questions to stimulate a different solution >>> >>> * Can we do this using an active technique (e.g. signals) rather than >>> a passive one (reading a counter?) >> >> Signals are already in use for special cases (queue is full), and I >> think going through the kernel to achieve much more will lower >> performance significantly. > > If there are no invalidations, there would be no signals. How would > zero signals decrease performance? It wouldn't, although it might be bad in the case where there are lots of temp tables being created and dropped. I think the biggest problem with signals is that they don't provide any meaningful synchronization guarantees. When you send somebody a signal, you don't really know how long it's going to take for them to receive it. >>> * Can we partition the sinval lock, so we have multiple copies? That >>> increases the task for those who trigger an invalidation, but will >>> relieve the pressure for most readers. >> >> Not sure there's a way to meaningfully partition the queue. In any >> case, I think the problem being dealt with here is how to update the >> read heads of the queue, not its contents. > > I agree there's no meaningful way to partition the queue, but we can > store the information in more than one place to reduce the contention > of people requesting it. I thought about that. Basically, that saves you a factor of N in contention on the read side (where N is the number of copies) and costs you a factor of N on the write side (you have to update N copies, taking a spinlock or lwlock for each). In the limit, you could do one copy of the counter per backend. I think, though, that a lock-free implementation using memory barriers is going to end up being the only real solution. We might possibly convince ourselves that we're OK with increasing the cost of SIInsertDataEntries(), but any solution that involves replication the data is still based on doing at least some locking. And I am pretty well convinced that even one spinlock acquisition in SIInsertDataEntries() is too many. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera wrote: > Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011: > >> Let me ask a few questions to stimulate a different solution >> >> * Can we do this using an active technique (e.g. signals) rather than >> a passive one (reading a counter?) > > Signals are already in use for special cases (queue is full), and I > think going through the kernel to achieve much more will lower > performance significantly. If there are no invalidations, there would be no signals. How would zero signals decrease performance? >> * Can we partition the sinval lock, so we have multiple copies? That >> increases the task for those who trigger an invalidation, but will >> relieve the pressure for most readers. > > Not sure there's a way to meaningfully partition the queue. In any > case, I think the problem being dealt with here is how to update the > read heads of the queue, not its contents. I agree there's no meaningful way to partition the queue, but we can store the information in more than one place to reduce the contention of people requesting it. Both those ideas are still on the table. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011: > Let me ask a few questions to stimulate a different solution > > * Can we do this using an active technique (e.g. signals) rather than > a passive one (reading a counter?) Signals are already in use for special cases (queue is full), and I think going through the kernel to achieve much more will lower performance significantly. > * Can we partition the sinval lock, so we have multiple copies? That > increases the task for those who trigger an invalidation, but will > relieve the pressure for most readers. Not sure there's a way to meaningfully partition the queue. In any case, I think the problem being dealt with here is how to update the read heads of the queue, not its contents. > * Can we put the sinval info in a different place? e.g. inside each > lock partition. I don't think you can relate sinval messages to particular locks, which would make this infeasible. I think this bit is directly related to the point above. > * Why do we have a different mechanism for cache invalidation > internally (sinval) to the one we offer externally (LISTEN/NOTIFY)? > Why don't we have just one? Performance. Not sure there are other reasons as well; but just this one is significant enough. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Tue, Jul 26, 2011 at 6:36 PM, Robert Haas wrote: > Now, as you say, it seems really, really > difficult to hit that in practice, but I don't see a way of getting > rid of the theoretical possibility without either (1) a spinlock or > (2) a fence. (Of course, on x86, the fence could be optimized down to > a compiler barrier.) I guess the question is "should we worry about > that?". Perhaps the answer lies in a different direction altogether? Let me ask a few questions to stimulate a different solution * Can we do this using an active technique (e.g. signals) rather than a passive one (reading a counter?) * Can we partition the sinval lock, so we have multiple copies? That increases the task for those who trigger an invalidation, but will relieve the pressure for most readers. * Can we put the sinval info in a different place? e.g. inside each lock partition. * Why do we have a different mechanism for cache invalidation internally (sinval) to the one we offer externally (LISTEN/NOTIFY)? Why don't we have just one? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Excerpts from Tom Lane's message of mar jul 26 13:43:08 -0400 2011: > Uh, yes. I've lost count of the number of times I've seen people hit > one-instruction-wide race condition windows, SIGSEGV crash based on > accessing only one byte past the valid data structure, etc etc. I think you need a better locking protocol on that counter of yours. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas writes: > On further reflection, I don't see that this helps: it just moves the > problem around. With resetState as a separate variable, nextMsgNum is > never changed by anyone other than the owner, so we can never have a > stale load. But if we overload nextMsgNum to also indicate whether > our state has been reset, then there's a race between when we load > nextMsgNum and when we load maxMsgNum (instead of code I posted > previously, which has a race between when we load resetState and when > we load maxMsgNum). Now, as you say, it seems really, really > difficult to hit that in practice, but I don't see a way of getting > rid of the theoretical possibility without either (1) a spinlock or > (2) a fence. (Of course, on x86, the fence could be optimized down to > a compiler barrier.) I guess the question is "should we worry about > that?". Uh, yes. I've lost count of the number of times I've seen people hit one-instruction-wide race condition windows, SIGSEGV crash based on accessing only one byte past the valid data structure, etc etc. The worst thing about this type of bug is that you can't reproduce the failure when you try; doesn't mean your users won't see it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch wrote: > On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote: >> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote: >> > This is attractive, and I don't see any problems with it. (In theory, you >> > could >> > hit a case where the load of resetState gives an ancient "false" just as >> > the >> > counters wrap to match. Given that the wrap interval is 100x as long >> > as the >> > reset interval, I'm not worried about problems on actual silicon.) >> >> It's actually 262,144 times as long - see MSGNUMWRAPAROUND. > > Ah, so it is. > >> It would be pretty easy to eliminate even the theoretical possibility >> of a race by getting rid of resetState altogether and using nextMsgNum >> = -1 to mean that. Maybe I should go ahead and do that. > > Seems like a nice simplification. On further reflection, I don't see that this helps: it just moves the problem around. With resetState as a separate variable, nextMsgNum is never changed by anyone other than the owner, so we can never have a stale load. But if we overload nextMsgNum to also indicate whether our state has been reset, then there's a race between when we load nextMsgNum and when we load maxMsgNum (instead of code I posted previously, which has a race between when we load resetState and when we load maxMsgNum). Now, as you say, it seems really, really difficult to hit that in practice, but I don't see a way of getting rid of the theoretical possibility without either (1) a spinlock or (2) a fence. (Of course, on x86, the fence could be optimized down to a compiler barrier.) I guess the question is "should we worry about that?". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote: > On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote: > > This is attractive, and I don't see any problems with it. (In theory, you > > could > > hit a case where the load of resetState gives an ancient "false" just as the > > counters wrap to match. Given that the wrap interval is 100x as long > > as the > > reset interval, I'm not worried about problems on actual silicon.) > > It's actually 262,144 times as long - see MSGNUMWRAPAROUND. Ah, so it is. > It would be pretty easy to eliminate even the theoretical possibility > of a race by getting rid of resetState altogether and using nextMsgNum > = -1 to mean that. Maybe I should go ahead and do that. Seems like a nice simplification. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote: > On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote: >> I think I have a simpler idea, though: >> before acquiring any locks, just have SIGetDataEntries() do this: >> >> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) >> + return 0; >> >> Patch (with comment explaining why I think this is OK) attached. If >> the message numbers happen to be equal only because the counter has >> wrapped, then stateP->resetState will be true, so we'll still realize >> we need to do some work. > > This is attractive, and I don't see any problems with it. (In theory, you > could > hit a case where the load of resetState gives an ancient "false" just as the > counters wrap to match. Given that the wrap interval is 100x as long as > the > reset interval, I'm not worried about problems on actual silicon.) It's actually 262,144 times as long - see MSGNUMWRAPAROUND. It would be pretty easy to eliminate even the theoretical possibility of a race by getting rid of resetState altogether and using nextMsgNum = -1 to mean that. Maybe I should go ahead and do that. > +1 for doing this and moving on. Yeah, I think I'll go ahead and commit something along these lines if no one objects. We can always fine-tune it more if needed (but it probably isn't). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote: > I think I have a simpler idea, though: > before acquiring any locks, just have SIGetDataEntries() do this: > > + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) > + return 0; > > Patch (with comment explaining why I think this is OK) attached. If > the message numbers happen to be equal only because the counter has > wrapped, then stateP->resetState will be true, so we'll still realize > we need to do some work. This is attractive, and I don't see any problems with it. (In theory, you could hit a case where the load of resetState gives an ancient "false" just as the counters wrap to match. Given that the wrap interval is 100x as long as the reset interval, I'm not worried about problems on actual silicon.) +1 for doing this and moving on. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote: >>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: SIGetDataEntries() can pretty easily be made lock-free. The only real changes that seem to be are needed are (1) to use a 64-bit counter, so you never need to decrement > >>> On second thought, won't this be inadequate on 32-bit systems, where >>> updating >>> the 64-bit counter produces two stores? You must avoid reading it between >>> those stores. > >> Now that is a potentially big problem. > > Could we do something similar to the xxid hacks? That is, we have a lot > of counters that should be fairly close to each other, so we store only > the low-order 32 bits of each notional value, and separately maintain a > common high-order word. You probably would need some additional > overhead each time the high-order word bumps, but that's reasonably > infrequent. Well, the trouble is figuring out what the shape of that additional overhead needs to look like. I think I have a simpler idea, though: before acquiring any locks, just have SIGetDataEntries() do this: + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) + return 0; Patch (with comment explaining why I think this is OK) attached. If the message numbers happen to be equal only because the counter has wrapped, then stateP->resetState will be true, so we'll still realize we need to do some work. Test results, with the lazy vxid patch plus this patch, at 8 clients: tps = 34028.144439 (including connections establishing) tps = 34079.085935 (including connections establishing) tps = 34125.295938 (including connections establishing) And at 32 clients: tps = 185521.605364 (including connections establishing) tps = 188250.700451 (including connections establishing) tps = 186077.847215 (including connections establishing) And at 80 clients: tps = 188568.886569 (including connections establishing) tps = 191035.971512 (including connections establishing) tps = 189363.019377 (including connections establishing) Not quite as good as the unlocked version, but better than the per-backend mutex, and a whole lot simpler. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company sinval-fastpath.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas writes: > On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote: >> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: >>> SIGetDataEntries() can pretty easily be made lock-free. The only real >>> changes that seem to be are needed are (1) to use a 64-bit counter, so >>> you never need to decrement >> On second thought, won't this be inadequate on 32-bit systems, where updating >> the 64-bit counter produces two stores? You must avoid reading it between >> those stores. > Now that is a potentially big problem. Could we do something similar to the xxid hacks? That is, we have a lot of counters that should be fairly close to each other, so we store only the low-order 32 bits of each notional value, and separately maintain a common high-order word. You probably would need some additional overhead each time the high-order word bumps, but that's reasonably infrequent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote: > On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: >> Profiling this combination of patches reveals that there is still some >> pretty ugly spinlock contention on sinval's msgNumLock. And it occurs >> to me that on x86, we really don't need this lock ... or >> SInvalReadLock ... or a per-backend mutex. The whole of >> SIGetDataEntries() can pretty easily be made lock-free. The only real >> changes that seem to be are needed are (1) to use a 64-bit counter, so >> you never need to decrement > > On second thought, won't this be inadequate on 32-bit systems, where updating > the 64-bit counter produces two stores? You must avoid reading it between > those > stores. Now that is a potentially big problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports wrote: > If you're suggesting that hardware memory barriers aren't going to be > needed to implement lock-free code on x86, that isn't true. Because a > read can be reordered with respect to a write to a different memory > location, you can still have problems. So you do still need memory > barriers, just fewer of them. > > Dekker's algorithm is the classic example: two threads each set a flag > and then check whether the other thread's flag is set. In any > sequential execution, at least one should see the other's flag set, but > on the x86 that doesn't always happen. One thread's read might be > reordered before its write. In the case of sinval, what we need to do for SIGetDataEntries() is, approximately, a bunch of loads, followed by a store to one of the locations we loaded (which no one else can have written meanwhile). So I think that's OK. In SIInsertDataEntries(), what we need to do is, approximately, take a lwlock, load from a location which can only be written while holding the lwlock, do a bunch of stores, ending with a store to that first location, and release the lwlock. I think that's OK, too. >> 2. Machines with weak memory ordering. On this category of machines >> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU >> reorders memory accesses arbitrarily unless you explicitly issue >> instructions that enforce synchronization. You still need to keep the >> compiler from moving things around, too. Alpha is particularly >> pernicious, because something like a->b can fetch the pointed-to value >> before loading the pointer itself. This is otherwise known as "we >> have basically no cache coherency circuits on this chip at all". On >> these machines, you need to issue an explicit memory barrier >> instruction at each sequence point, or just acquire and release a >> spinlock. > > The Alpha is pretty much unique (thankfully!) in allowing dependent > reads to be reordered. That makes it even weaker than the typical > weak-ordering machine. Since reading a pointer and then dereferencing > it is a pretty reasonable thing to do regularly in RCU code, you > probably don't want to emit barriers in between on architectures where > it's not actually necessary. That argues for another operation that's > defined to be a barrier (mb) on the Alpha but a no-op elsewhere. > Certainly the Linux kernel found it useful to do so > (read_barrier_depends) > > Alternatively, one might question how important it is to support the > Alpha these days... Well, currently, we do, so we probably don't want to drop support for that without some careful thought. I searched the archive and found someone trying to compile 8.3.something on Alpha just a few years ago, so it's apparently not totally dead yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug wrote: > On Jul21, 2011, at 21:15 , Robert Haas wrote: >> On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote: >>> Robert Haas writes: ... On these machines, you need to issue an explicit memory barrier instruction at each sequence point, or just acquire and release a spinlock. >>> >>> Right, and the reason that a spinlock fixes it is that we have memory >>> barrier instructions built into the spinlock code sequences on machines >>> where it matters. >>> >>> To get to the point where we could do the sort of optimization Robert >>> is talking about, someone will have to build suitable primitives for >>> all the platforms we support. In the cases where we use gcc ASM in >>> s_lock.h, it shouldn't be too hard to pull out the barrier >>> instruction(s) ... but on platforms where we rely on OS-supplied >>> functions, some research is going to be needed. >> >> Yeah, although falling back to SpinLockAcquire() and SpinLockRelease() >> on a backend-private slock_t should work anywhere that PostgreSQL >> works at all[1]. That will probably be slower than a memory fence >> instruction and certainly slower than a compiler barrier, but the >> point is that - right now - we're doing it the slow way everywhere. > > As I discovered while playing with various lockless algorithms to > improve our LWLocks, spin locks aren't actually a replacement for > a (full) barrier. > > Lock acquisition only really needs to guarantee that loads and stores > which come after the acquisition operation in program order (i.e., in > the instruction stream) aren't globally visible before that operation > completes. This kind of barrier behaviour is often fittingly called > "acquire barrier". > > Similarly, a lock release operation only needs to guarantee that loads > and stores which occur before that operation in program order are > globally visible before the release operation completes. This, again, > is fittingly called "release barrier". > > Now assume the following code fragment > > global1 = 1; > SpinLockAcquire(); > SpinLockRelease(); > global2 = 1; > > If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease() > has "release barrier" sematics, the it's possible for the store to global1 > to be delayed until after SpinLockAcquire(), and similarly for the store > to global2 to be executed before SpinLockRelease() completes. In other > words, what happens is > > SpinLockAcquire(); > global1 = 1; > global2 = 1; > SpinLockRelease(); > > But once that can happens, there's no reason that it couldn't also be > > SpinLockAcquire(); > global2 = 1; > global1 = 1; > SpinLockRelease(); > > I didn't check if any of our spin lock implementations is actually affected > by this, but it doesn't seem wise to rely on them being full barriers, even > if it may be true today. Hmm. I'm not worried about that. AFAIK, only IA64 has such an implementation, and our existing spinlock implementation doesn't use it. If we were to add something like that in the future, we'd presumably know that we were doing it, and would add the appropriate memory barrier primitive at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Jul21, 2011, at 03:46 , Robert Haas wrote: > Profiling this combination of patches reveals that there is still some > pretty ugly spinlock contention on sinval's msgNumLock. And it occurs > to me that on x86, we really don't need this lock ... or > SInvalReadLock ... or a per-backend mutex. The whole of > SIGetDataEntries() can pretty easily be made lock-free. The only real > changes that seem to be are needed are (1) to use a 64-bit counter, so > you never need to decrement and (2) to recheck resetState after > reading the entries from the queue, to see if we got reset while we > were reading those entries. Since x86 guarantees that writes will > become visible in the order they are executed, we only need to make > sure that the compiler doesn't rearrange things. As long as we first > read the maxMsgNum and then read the messages, we can't read garbage. > As long as we read the messages before we check resetState, we will be > sure to notice if we got reset before we read all the messages (which > is the only way that we can have read garbage messages). Sounds sensible. There're one additional hazard though - you'll also need the reads to be atomic. x86 guarantees that for up to 32 (i386) respectively 64 (x64) loads, but only for reads from properly aligned addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads). I founds that out the hard way a few days ago, again while playing with different LWLock implementations, when I botched my test setup and the proc array entries ended up being miss-aligned. Boy, was it fun to debug the random crashes caused by non-atomic pointer reads... If we widen the counter to 64-bit, reading it atomically on x86 becomes a bit of a challenge on i386, but is doable also. From what I remember, there are two options. You can either use the 8-byte compare-and-exchange operation, but it might be that only quite recent CPUs support that. The other options seems to be to use floating-point instructions. I believe the latter is what Intel's own Thread Building Blocks library does, but I'd have to re-check to be sure. It might also be that, once you starting using floating-point instructions, you find that you actually do need fencing instructions even on x86. Dunno if the weaker ordering affects only SIMD instructions or all floating point stuff... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: > Profiling this combination of patches reveals that there is still some > pretty ugly spinlock contention on sinval's msgNumLock. And it occurs > to me that on x86, we really don't need this lock ... or > SInvalReadLock ... or a per-backend mutex. The whole of > SIGetDataEntries() can pretty easily be made lock-free. The only real > changes that seem to be are needed are (1) to use a 64-bit counter, so > you never need to decrement On second thought, won't this be inadequate on 32-bit systems, where updating the 64-bit counter produces two stores? You must avoid reading it between those stores. -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote: > 1. Machines with strong memory ordering. On this category of machines > (which include x86), the CPU basically does not perform loads or > stores out of order. On some of these machines, it is apparently > possible for there to be some ordering of stores relative to loads, > but if the program stores two values or loads two values, those > operations will performed in the same order they appear in the > program. This is all correct, but... > The main thing you need to make your code work reliably on > these machines is a primitive that keeps the compiler from reordering > your code during optimization. If you're suggesting that hardware memory barriers aren't going to be needed to implement lock-free code on x86, that isn't true. Because a read can be reordered with respect to a write to a different memory location, you can still have problems. So you do still need memory barriers, just fewer of them. Dekker's algorithm is the classic example: two threads each set a flag and then check whether the other thread's flag is set. In any sequential execution, at least one should see the other's flag set, but on the x86 that doesn't always happen. One thread's read might be reordered before its write. > 2. Machines with weak memory ordering. On this category of machines > (which includes PowerPC, Dec Alpha, and maybe some others), the CPU > reorders memory accesses arbitrarily unless you explicitly issue > instructions that enforce synchronization. You still need to keep the > compiler from moving things around, too. Alpha is particularly > pernicious, because something like a->b can fetch the pointed-to value > before loading the pointer itself. This is otherwise known as "we > have basically no cache coherency circuits on this chip at all". On > these machines, you need to issue an explicit memory barrier > instruction at each sequence point, or just acquire and release a > spinlock. The Alpha is pretty much unique (thankfully!) in allowing dependent reads to be reordered. That makes it even weaker than the typical weak-ordering machine. Since reading a pointer and then dereferencing it is a pretty reasonable thing to do regularly in RCU code, you probably don't want to emit barriers in between on architectures where it's not actually necessary. That argues for another operation that's defined to be a barrier (mb) on the Alpha but a no-op elsewhere. Certainly the Linux kernel found it useful to do so (read_barrier_depends) Alternatively, one might question how important it is to support the Alpha these days... Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Jul21, 2011, at 21:15 , Robert Haas wrote: > On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote: >> Robert Haas writes: >>> ... On these machines, you need to issue an explicit memory barrier >>> instruction at each sequence point, or just acquire and release a >>> spinlock. >> >> Right, and the reason that a spinlock fixes it is that we have memory >> barrier instructions built into the spinlock code sequences on machines >> where it matters. >> >> To get to the point where we could do the sort of optimization Robert >> is talking about, someone will have to build suitable primitives for >> all the platforms we support. In the cases where we use gcc ASM in >> s_lock.h, it shouldn't be too hard to pull out the barrier >> instruction(s) ... but on platforms where we rely on OS-supplied >> functions, some research is going to be needed. > > Yeah, although falling back to SpinLockAcquire() and SpinLockRelease() > on a backend-private slock_t should work anywhere that PostgreSQL > works at all[1]. That will probably be slower than a memory fence > instruction and certainly slower than a compiler barrier, but the > point is that - right now - we're doing it the slow way everywhere. As I discovered while playing with various lockless algorithms to improve our LWLocks, spin locks aren't actually a replacement for a (full) barrier. Lock acquisition only really needs to guarantee that loads and stores which come after the acquisition operation in program order (i.e., in the instruction stream) aren't globally visible before that operation completes. This kind of barrier behaviour is often fittingly called "acquire barrier". Similarly, a lock release operation only needs to guarantee that loads and stores which occur before that operation in program order are globally visible before the release operation completes. This, again, is fittingly called "release barrier". Now assume the following code fragment global1 = 1; SpinLockAcquire(); SpinLockRelease(); global2 = 1; If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease() has "release barrier" sematics, the it's possible for the store to global1 to be delayed until after SpinLockAcquire(), and similarly for the store to global2 to be executed before SpinLockRelease() completes. In other words, what happens is SpinLockAcquire(); global1 = 1; global2 = 1; SpinLockRelease(); But once that can happens, there's no reason that it couldn't also be SpinLockAcquire(); global2 = 1; global1 = 1; SpinLockRelease(); I didn't check if any of our spin lock implementations is actually affected by this, but it doesn't seem wise to rely on them being full barriers, even if it may be true today. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch wrote: > On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: >> For the last week or so, in between various other tasks, I've been >> trying to understand why performance drops off when you run pgbench -n >> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client >> counts. The answer, in a word, is SIGetDataEntries(). I believe we >> need to bite the bullet and rewrite this using a lock-free algorithm, >> using memory barriers on processors with weak memory ordering. >> Perhaps there is another way to do it, but nothing I've tried has >> really worked so far, and I've tried quite a few things. Here's the >> data. >> >> On unpatched master, performance scales pretty much linearly out to 32 >> clients. As you add more clients, it drops off: > >> [80 clients] >> tps = 132518.586371 (including connections establishing) >> tps = 130968.749747 (including connections establishing) >> tps = 132574.338942 (including connections establishing) > >> [80 clients, with lazy vxid locks and sinval-unlocked] >> tps = 203256.701227 (including connections establishing) >> tps = 190637.957571 (including connections establishing) >> tps = 190228.617178 (including connections establishing) > > Nice numbers. The sinval-unlocked.patch implementation looks like it's > taking a > sound direction. > > In > http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com, > you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it > correct to conclude that AcceptInvalidationMessages() still reduces the > transaction rate by 5-10% with this stack of patches? Good question - I have not tested. One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe we could make AcceptInvalidationMessages() a macro, something like this: if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum) ReallyAcceptInvalidationMessages(); That ought to be extremely cheap and - if we use 64-bit counters for the message-number counters - safe. You might object that the load of maxMsgNum might migrate backward, but it can't possibly back up any further than the preceding lock acquisition, since that's required to be a full memory barrier on every architecture. And if we haven't acquired a relevant lock, then a relevant sinval message could show up an instance after we check regardless of the implementation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote: > For the last week or so, in between various other tasks, I've been > trying to understand why performance drops off when you run pgbench -n > -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client > counts. The answer, in a word, is SIGetDataEntries(). I believe we > need to bite the bullet and rewrite this using a lock-free algorithm, > using memory barriers on processors with weak memory ordering. > Perhaps there is another way to do it, but nothing I've tried has > really worked so far, and I've tried quite a few things. Here's the > data. > > On unpatched master, performance scales pretty much linearly out to 32 > clients. As you add more clients, it drops off: > [80 clients] > tps = 132518.586371 (including connections establishing) > tps = 130968.749747 (including connections establishing) > tps = 132574.338942 (including connections establishing) > [80 clients, with lazy vxid locks and sinval-unlocked] > tps = 203256.701227 (including connections establishing) > tps = 190637.957571 (including connections establishing) > tps = 190228.617178 (including connections establishing) Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a sound direction. In http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com, you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it correct to conclude that AcceptInvalidationMessages() still reduces the transaction rate by 5-10% with this stack of patches? -- Noah Mischhttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane wrote: > Robert Haas writes: >> I think the real challenge is going to be testing. If anyone has a >> machine with weak memory ordering they can give me access to, that >> would be really helpful for flushing the bugs out of this stuff. > > There are multi-CPU PPCen in the buildfarm, or at least there were last > time I broke the sinval code ;-). Note that testing on a single-core > PPC will prove nothing. Yeah, I was just thinking about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas writes: > I think the real challenge is going to be testing. If anyone has a > machine with weak memory ordering they can give me access to, that > would be really helpful for flushing the bugs out of this stuff. There are multi-CPU PPCen in the buildfarm, or at least there were last time I broke the sinval code ;-). Note that testing on a single-core PPC will prove nothing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas wrote: > On Thu, Jul 21, 2011 at 3:22 PM, Dave Page wrote: >> On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote: >>> I think the real challenge is going to be testing. If anyone has a >>> machine with weak memory ordering they can give me access to, that >>> would be really helpful for flushing the bugs out of this stuff. >>> Getting it to work on x86 is not the hard part. >> >> I believe there's a PPC box in our storage facility in NJ that we >> might be able to dig out for you. There's also a couple in our India >> office. Let me know if they'd be of help. > > Yes! > > More processors is better, of course, but having anything at all to > test on would be an improvement. OK, will check with India first, as it'll be easier for them to deploy. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page wrote: > On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote: >> I think the real challenge is going to be testing. If anyone has a >> machine with weak memory ordering they can give me access to, that >> would be really helpful for flushing the bugs out of this stuff. >> Getting it to work on x86 is not the hard part. > > I believe there's a PPC box in our storage facility in NJ that we > might be able to dig out for you. There's also a couple in our India > office. Let me know if they'd be of help. Yes! More processors is better, of course, but having anything at all to test on would be an improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote: > > I think the real challenge is going to be testing. If anyone has a > machine with weak memory ordering they can give me access to, that > would be really helpful for flushing the bugs out of this stuff. > Getting it to work on x86 is not the hard part. I believe there's a PPC box in our storage facility in NJ that we might be able to dig out for you. There's also a couple in our India office. Let me know if they'd be of help. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote: > Robert Haas writes: >> ... On these machines, you need to issue an explicit memory barrier >> instruction at each sequence point, or just acquire and release a >> spinlock. > > Right, and the reason that a spinlock fixes it is that we have memory > barrier instructions built into the spinlock code sequences on machines > where it matters. > > To get to the point where we could do the sort of optimization Robert > is talking about, someone will have to build suitable primitives for > all the platforms we support. In the cases where we use gcc ASM in > s_lock.h, it shouldn't be too hard to pull out the barrier > instruction(s) ... but on platforms where we rely on OS-supplied > functions, some research is going to be needed. Yeah, although falling back to SpinLockAcquire() and SpinLockRelease() on a backend-private slock_t should work anywhere that PostgreSQL works at all[1]. That will probably be slower than a memory fence instruction and certainly slower than a compiler barrier, but the point is that - right now - we're doing it the slow way everywhere. I think the real challenge is going to be testing. If anyone has a machine with weak memory ordering they can give me access to, that would be really helpful for flushing the bugs out of this stuff. Getting it to work on x86 is not the hard part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] This was a suggestion from Noah Misch. I wasn't quite convinced when he initially made it, but having studied the issue a lot more, I now am. The CPU doesn't know how many processes have the memory mapped into their address space. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas writes: > ... On these machines, you need to issue an explicit memory barrier > instruction at each sequence point, or just acquire and release a > spinlock. Right, and the reason that a spinlock fixes it is that we have memory barrier instructions built into the spinlock code sequences on machines where it matters. To get to the point where we could do the sort of optimization Robert is talking about, someone will have to build suitable primitives for all the platforms we support. In the cases where we use gcc ASM in s_lock.h, it shouldn't be too hard to pull out the barrier instruction(s) ... but on platforms where we rely on OS-supplied functions, some research is going to be needed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner wrote: > Very impressive! Those numbers definitely justify some #ifdef code > to provide alternatives for weak memory ordering machines versus > others. With the number of CPUs climbing as it is, this is very > important work! Thanks. I'm not thinking so much about #ifdef (although that could work, too) as I am about providing some primitives to allow this sort of code-writing to be done in a somewhat less ad-hoc fashion. It seems like there are basically two categories of machines we need to worry about. 1. Machines with strong memory ordering. On this category of machines (which include x86), the CPU basically does not perform loads or stores out of order. On some of these machines, it is apparently possible for there to be some ordering of stores relative to loads, but if the program stores two values or loads two values, those operations will performed in the same order they appear in the program. The main thing you need to make your code work reliably on these machines is a primitive that keeps the compiler from reordering your code during optimization. On x86, certain categories of exotic instructions do require 2. Machines with weak memory ordering. On this category of machines (which includes PowerPC, Dec Alpha, and maybe some others), the CPU reorders memory accesses arbitrarily unless you explicitly issue instructions that enforce synchronization. You still need to keep the compiler from moving things around, too. Alpha is particularly pernicious, because something like a->b can fetch the pointed-to value before loading the pointer itself. This is otherwise known as "we have basically no cache coherency circuits on this chip at all". On these machines, you need to issue an explicit memory barrier instruction at each sequence point, or just acquire and release a spinlock. So you can imagine a primitive that is defined to be a compiler barrier on machines with strong memory ordering, and as a memory fencing instruction on machines with weak memory ordering. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sinval synchronization considered harmful
Robert Haas wrote: > SIGetDataEntries(). I believe we need to bite the bullet and > rewrite this using a lock-free algorithm, using memory barriers on > processors with weak memory ordering. > [32 processors; 80 clients] > On unpatched master > tps = 132518.586371 (including connections establishing) > tps = 130968.749747 (including connections establishing) > tps = 132574.338942 (including connections establishing) > With the lazy vxid locks patch > tps = 119215.958372 (including connections establishing) > tps = 113056.859871 (including connections establishing) > tps = 160562.770998 (including connections establishing) > gets rid of SInvalReadLock and instead gives each backend its own > spinlock. > tps = 167392.042393 (including connections establishing) > tps = 171336.145020 (including connections establishing) > tps = 170500.529303 (including connections establishing) > SIGetDataEntries() can pretty easily be made lock-free. > tps = 203256.701227 (including connections establishing) > tps = 190637.957571 (including connections establishing) > tps = 190228.617178 (including connections establishing) > Thoughts? Comments? Ideas? Very impressive! Those numbers definitely justify some #ifdef code to provide alternatives for weak memory ordering machines versus others. With the number of CPUs climbing as it is, this is very important work! -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] sinval synchronization considered harmful
For the last week or so, in between various other tasks, I've been trying to understand why performance drops off when you run pgbench -n -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client counts. The answer, in a word, is SIGetDataEntries(). I believe we need to bite the bullet and rewrite this using a lock-free algorithm, using memory barriers on processors with weak memory ordering. Perhaps there is another way to do it, but nothing I've tried has really worked so far, and I've tried quite a few things. Here's the data. On unpatched master, performance scales pretty much linearly out to 32 clients. As you add more clients, it drops off: [1 client] tps = 4459.203159 (including connections establishing) tps = 4488.456559 (including connections establishing) tps = 4449.570530 (including connections establishing) [8 clients] tps = 33524.668762 (including connections establishing) tps = 33501.833907 (including connections establishing) tps = 33382.380908 (including connections establishing) [32 clients] tps = 178394.863906 (including connections establishing) tps = 178457.706972 (including connections establishing) tps = 179365.310179 (including connections establishing) [80 clients] tps = 132518.586371 (including connections establishing) tps = 130968.749747 (including connections establishing) tps = 132574.338942 (including connections establishing) With the lazy vxid locks patch, all of those numbers get better by a few percentage points (the more clients, the more points) except at 80 clients, where the performance is sometimes better and sometimes worse. These are 5-minute runs: [80 clients, with lazy vxid locks] tps = 119215.958372 (including connections establishing) tps = 113056.859871 (including connections establishing) tps = 160562.770998 (including connections establishing) I can't explain in detail why there is so much variation between the 5-minute runs, or why some runs perform worse than without the lazy vxid locks, but I can say that apply the first of the two attached patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid of SInvalReadLock and instead gives each backend its own spinlock. SICleanupQueue() must therefore get somewhat fuzzy answers, but it shouldn't matter. The only real problem is if you need to do the super-duper-cleanup where you subtract a constant from all the values in unison. I just got rid of that completely, by widening the counters to 64 bits. Assuming I haven't botched the implementation, this is clearly a win. There is still some performance drop-off going from 32 clients to 80 clients, but it is much less. [80 clients, with lazy vxid locks and sinval-per-backend-mutex] tps = 167392.042393 (including connections establishing) tps = 171336.145020 (including connections establishing) tps = 170500.529303 (including connections establishing) Profiling this combination of patches reveals that there is still some pretty ugly spinlock contention on sinval's msgNumLock. And it occurs to me that on x86, we really don't need this lock ... or SInvalReadLock ... or a per-backend mutex. The whole of SIGetDataEntries() can pretty easily be made lock-free. The only real changes that seem to be are needed are (1) to use a 64-bit counter, so you never need to decrement and (2) to recheck resetState after reading the entries from the queue, to see if we got reset while we were reading those entries. Since x86 guarantees that writes will become visible in the order they are executed, we only need to make sure that the compiler doesn't rearrange things. As long as we first read the maxMsgNum and then read the messages, we can't read garbage. As long as we read the messages before we check resetState, we will be sure to notice if we got reset before we read all the messages (which is the only way that we can have read garbage messages). Now this implementation (sinval-unlocked.patch) is obviously not a serious proposal as it stands, because on processors with weak memory ordering it will presumably blow up all over the place. But here are the results on x86: [80 clients, with lazy vxid locks and sinval-unlocked] tps = 203256.701227 (including connections establishing) tps = 190637.957571 (including connections establishing) tps = 190228.617178 (including connections establishing) Now, that's actually *faster* then the above 32-core numbers. Turns out, with this approach, there's essentially no degradation between 32 clients and 80 clients. It appears that even one spinlock acquisition in SIGetDataEntries() is too many. Currently, we have *three*: one to get SInvalReadLock, one to get msgnumLock, and one to release SInvalReadLock. For contrast, on a two-core MacBook Pro, I can't measure any difference at all between lazy-vxid, lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked. The last might even be a touch slower, although there's enough noise that it's hard to tell for sure, and the effect is very small if ther