Re: [PATCH v3 0/6] Static calls
On 1/14/19 9:01 PM, H. Peter Anvin wrote: > > This could be as simple as spinning for a limited time waiting for > states 0 or 3 if we are not the patching CPU. It is also not necessary > to wait for the mask to become zero for the first sync if we find > ourselves suddenly in state 4. > So this would look something like this for the #BP handler; I think this is safe. This uses the TLB miss on the write page intentionally to slow down the loop a bit to reduce the risk of livelock. Note that "bp_write_addr" here refers to the write address for the breakpoint that was taken. state = atomic_read(_poke_state); if (state == 0) return 0; /* No patching in progress */ recheck: clear bit in mask switch (state) { case 1: case 4: if (smp_processor_id() != bp_patching_cpu) { int retries = NNN; while (retries--) { invlpg if (*bp_write_addr != 0xcc) goto recheck; state = atomic_read(_poke_state); if (state != 1 && state != 4) goto recheck; } } state = cmpxchg(_poke_state, 1, 4); if (state != 1 && state != 4) goto recheck; atomic_write(bp_write_addr, bp_old_value); break; case 2: if (smp_processor_id() != bp_patching_cpu) { invlpg state = atomic_read(_poke_state); if (state != 2) goto recheck; } complete patch sequence remove breakpoint break; case 3: case 0: /* * If we are here, the #BP will go away on its * own, or we will re-take it if it was a "real" * breakpoint. */ break; } return 1;
Re: [PATCH v3 0/6] Static calls
On 1/14/19 7:05 PM, Andy Lutomirski wrote: > On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin wrote: >> >> I think this sequence ought to work (keep in mind we are already under a >> mutex, so the global data is safe even if we are preempted): > > I'm trying to wrap my head around this. The states are: > > 0: normal operation > 1: writing 0xcc, can be canceled > 2: writing final instruction. The 0xcc was definitely synced to all CPUs. > 3: patch is definitely installed but maybe not sync_cored. > 4: breakpoint has been canceled; need to redo patching. >> >> set up page table entries >> invlpg >> set up bp patching global data >> >> cpu = get_cpu() >> > So we're assuming that the state is > >> bp_old_value = atomic_read(bp_write_addr) >> >> do { > > So we're assuming that the state is 0 here. A WARN_ON_ONCE to check > that would be nice. The state here can be 0 or 4. >> atomic_write(_poke_state, 1) >> >> atomic_write(bp_write_addr, 0xcc) >> >> mask <- online_cpu_mask - self >> send IPIs >> wait for mask = 0 >> >> } while (cmpxchg(_poke_state, 1, 2) != 1); >> >> patch sites, remove breakpoints after patching each one > > Not sure what you mean by patch *sites*. As written, this only > supports one patch site at a time, since there's only one > bp_write_addr, and fixing that may be complicated. Not fixing it > might also be a scalability problem. Fixing it isn't all that complicated; we just need to have a list of patch locations (which we need anyway!) and walk (or search) it instead of checking just one; I omitted that detail for simplicity. >> atomic_write(_poke_state, 3); >> >> mask <- online_cpu_mask - self >> send IPIs >> wait for mask = 0 >> >> atomic_write(_poke_state, 0); >> >> tear down patching global data >> tear down page table entries >> >> >> >> The #BP handler would then look like: >> >> state = cmpxchg(_poke_state, 1, 4); >> switch (state) { >> case 1: >> case 4: > > What is state 4? > >> invlpg >> cmpxchg(bp_write_addr, 0xcc, bp_old_value) I'm 85% sure that the cmpxchg here is actually unnecessary, an atomic_write() is sufficient. >> break; >> case 2: >> invlpg >> complete patch sequence >> remove breakpoint >> break; > > ISTM you might as well change state to 3 here, but it's arguably unnecessary. If and only if you have only one patch location you could, but again, unnecessary. >> case 3: >> /* If we are here, the #BP will go away on its own */ >> break; >> case 0: >> /* No patching in progress!!! */ >> return 0; >> } >> >> clear bit in mask >> return 1; >> >> The IPI handler: >> >> clear bit in mask >> sync_core /* Needed if multiple IPI events are chained */ > > I really like that this doesn't require fixups -- text_poke_bp() just > works. But I'm nervous about livelocks or maybe just extreme slowness > under nasty loads. Suppose some perf NMI code does a static call or > uses a static call. Now there's a situation where, under high > frequency perf sampling, the patch process might almost always hit the > breakpoint while in state 1. It'll get reversed and done again, and > we get stuck. It would be neat if we could get the same "no > deadlocks" property while significantly reducing the chance of a > rollback. This could be as simple as spinning for a limited time waiting for states 0 or 3 if we are not the patching CPU. It is also not necessary to wait for the mask to become zero for the first sync if we find ourselves suddenly in state 4. This wouldn't reduce the livelock probability to zero, but it ought to reduce it enough that if we really are under such heavy event load we may end up getting stuck in any number of ways... > This is why I proposed something where we try to guarantee forward > progress by making sure that any NMI code that might spin and wait for > other CPUs is guaranteed to eventually sync_core(), clear its bit, and > possibly finish a patch. But this is a bit gross. Yes, this gets really grotty and who knows how many code paths it would touch. -hpa
Re: [PATCH v3 0/6] Static calls
On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin wrote: > > I think this sequence ought to work (keep in mind we are already under a > mutex, so the global data is safe even if we are preempted): I'm trying to wrap my head around this. The states are: 0: normal operation 1: writing 0xcc, can be canceled 2: writing final instruction. The 0xcc was definitely synced to all CPUs. 3: patch is definitely installed but maybe not sync_cored. > > set up page table entries > invlpg > set up bp patching global data > > cpu = get_cpu() > So we're assuming that the state is > bp_old_value = atomic_read(bp_write_addr) > > do { So we're assuming that the state is 0 here. A WARN_ON_ONCE to check that would be nice. > atomic_write(_poke_state, 1) > > atomic_write(bp_write_addr, 0xcc) > > mask <- online_cpu_mask - self > send IPIs > wait for mask = 0 > > } while (cmpxchg(_poke_state, 1, 2) != 1); > > patch sites, remove breakpoints after patching each one Not sure what you mean by patch *sites*. As written, this only supports one patch site at a time, since there's only one bp_write_addr, and fixing that may be complicated. Not fixing it might also be a scalability problem. > > atomic_write(_poke_state, 3); > > mask <- online_cpu_mask - self > send IPIs > wait for mask = 0 > > atomic_write(_poke_state, 0); > > tear down patching global data > tear down page table entries > > > > The #BP handler would then look like: > > state = cmpxchg(_poke_state, 1, 4); > switch (state) { > case 1: > case 4: What is state 4? > invlpg > cmpxchg(bp_write_addr, 0xcc, bp_old_value) > break; > case 2: > invlpg > complete patch sequence > remove breakpoint > break; ISTM you might as well change state to 3 here, but it's arguably unnecessary. > case 3: > /* If we are here, the #BP will go away on its own */ > break; > case 0: > /* No patching in progress!!! */ > return 0; > } > > clear bit in mask > return 1; > > The IPI handler: > > clear bit in mask > sync_core /* Needed if multiple IPI events are chained */ I really like that this doesn't require fixups -- text_poke_bp() just works. But I'm nervous about livelocks or maybe just extreme slowness under nasty loads. Suppose some perf NMI code does a static call or uses a static call. Now there's a situation where, under high frequency perf sampling, the patch process might almost always hit the breakpoint while in state 1. It'll get reversed and done again, and we get stuck. It would be neat if we could get the same "no deadlocks" property while significantly reducing the chance of a rollback. This is why I proposed something where we try to guarantee forward progress by making sure that any NMI code that might spin and wait for other CPUs is guaranteed to eventually sync_core(), clear its bit, and possibly finish a patch. But this is a bit gross.
Re: [PATCH v3 0/6] Static calls
On January 14, 2019 3:27:55 PM PST, Andy Lutomirski wrote: >On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote: >> >> So I was already in the middle of composing this message when Andy >posted: >> >> > I don't even think this is sufficient. I think we also need >everyone >> > who clears the bit to check if all bits are clear and, if so, >remove >> > the breakpoint. Otherwise we have a situation where, if you are in >> > text_poke_bp() and you take an NMI (or interrupt or MCE or >whatever) >> > and that interrupt then hits the breakpoint, then you deadlock >because >> > no one removes the breakpoint. >> > >> > If we do this, and if we can guarantee that all CPUs make forward >> > progress, then maybe the problem is solved. Can we guarantee >something >> > like all NMI handlers that might wait in a spinlock or for any >other >> > reason will periodically check if a sync is needed while they're >> > spinning? >> >> So the really, really nasty case is when an asynchronous event on the >> *patching* processor gets stuck spinning on a resource which is >> unavailable due to another processor spinning on the #BP. We can >disable >> interrupts, but we can't stop NMIs from coming in (although we could >> test in the NMI handler if we are in that condition and return >> immediately; I'm not sure we want to do that, and we still have to >deal >> with #MC and what not.) >> >> The fundamental problem here is that we don't see the #BP on the >> patching processor, in which case we could simply complete the >patching >> from the #BP handler on that processor. >> >> On 1/13/19 6:40 PM, H. Peter Anvin wrote: >> > On 1/13/19 6:31 PM, H. Peter Anvin wrote: >> >> >> >> static cpumask_t text_poke_cpumask; >> >> >> >> static void text_poke_sync(void) >> >> { >> >> smp_wmb(); >> >> text_poke_cpumask = cpu_online_mask; >> >> smp_wmb(); /* Should be optional on x86 */ >> >> cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); >> >> on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, >NULL, false); >> >> while (!cpumask_empty(_poke_cpumask)) { >> >> cpu_relax(); >> >> smp_rmb(); >> >> } >> >> } >> >> >> >> static void text_poke_sync_cpu(void *dummy) >> >> { >> >> (void)dummy; >> >> >> >> smp_rmb(); >> >> cpumask_clear_cpu(_bitmask, smp_processor_id()); >> >> /* >> >> * We are guaranteed to return with an IRET, either from the >> >> * IPI or the #BP handler; this provides serialization. >> >> */ >> >> } >> >> >> > >> > The invariants here are: >> > >> > 1. The patching routine must set each bit in the cpumask after each >event >> >that requires synchronization is complete. >> > 2. The bit can be (atomically) cleared on the target CPU only, and >only in a >> >place that guarantees a synchronizing event (e.g. IRET) before >it may >> >reaching the poked instruction. >> > 3. At a minimum the IPI handler and #BP handler needs to clear the >bit. It >> >*is* also possible to clear it in other places, e.g. the NMI >handler, if >> >necessary as long as condition 2 is satisfied. >> > >> >> OK, so with interrupts enabled *on the processor doing the patching* >we >> still have a problem if it takes an interrupt which in turn takes a >#BP. >> Disabling interrupts would not help, because but an NMI and #MC >could >> still cause problems unless we can guarantee that no path which may >be >> invoked by NMI/#MC can do text_poke, which seems to be a very >aggressive >> assumption. >> >> Note: I am assuming preemption is disabled. >> >> The easiest/sanest way to deal with this might be to switch the IDT >(or >> provide a hook in the generic exception entry code) on the patching >> processor, such that if an asynchronous event comes in, we either >roll >> forward or revert. This is doable because the second sync we >currently >> do is not actually necessary per the hardware guys. > >This is IMO insanely complicated. I much prefer the kind of >complexity that is more or less deterministic and easy to test to the >kind of complexity (like this) that only happens in corner cases. > >I see two solutions here: > >1. Just suck it up and emulate the CALL. And find a way to write a >test case so we know it works. > >2. Find a non-deadlocky way to make the breakpoint handler wait for >the breakpoint to get removed, without any mucking at all with the >entry code. And find a way to write a test case so we know it works. >(E.g. stick an actual static_call call site *in text_poke_bp()* that >fires once on boot so that the really awful recursive case gets >exercised all the time. > >But if we're going to do any mucking with the entry code, let's just >do the simple mucking to make emulating CALL work. > >--Andy Ugh. So much for not really proofreading. Yes, I think the second solution is the right thing since I think I figured out how to do it without deadlock; see other mail. -- Sent from my Android device with K-9 Mail.
Re: [PATCH v3 0/6] Static calls
> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski wrote: > > On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote: >> So I was already in the middle of composing this message when Andy posted: >> >>> I don't even think this is sufficient. I think we also need everyone >>> who clears the bit to check if all bits are clear and, if so, remove >>> the breakpoint. Otherwise we have a situation where, if you are in >>> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever) >>> and that interrupt then hits the breakpoint, then you deadlock because >>> no one removes the breakpoint. >>> >>> If we do this, and if we can guarantee that all CPUs make forward >>> progress, then maybe the problem is solved. Can we guarantee something >>> like all NMI handlers that might wait in a spinlock or for any other >>> reason will periodically check if a sync is needed while they're >>> spinning? >> >> So the really, really nasty case is when an asynchronous event on the >> *patching* processor gets stuck spinning on a resource which is >> unavailable due to another processor spinning on the #BP. We can disable >> interrupts, but we can't stop NMIs from coming in (although we could >> test in the NMI handler if we are in that condition and return >> immediately; I'm not sure we want to do that, and we still have to deal >> with #MC and what not.) >> >> The fundamental problem here is that we don't see the #BP on the >> patching processor, in which case we could simply complete the patching >> from the #BP handler on that processor. >> >> On 1/13/19 6:40 PM, H. Peter Anvin wrote: >>> On 1/13/19 6:31 PM, H. Peter Anvin wrote: static cpumask_t text_poke_cpumask; static void text_poke_sync(void) { smp_wmb(); text_poke_cpumask = cpu_online_mask; smp_wmb(); /* Should be optional on x86 */ cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); while (!cpumask_empty(_poke_cpumask)) { cpu_relax(); smp_rmb(); } } static void text_poke_sync_cpu(void *dummy) { (void)dummy; smp_rmb(); cpumask_clear_cpu(_bitmask, smp_processor_id()); /* * We are guaranteed to return with an IRET, either from the * IPI or the #BP handler; this provides serialization. */ } >>> >>> The invariants here are: >>> >>> 1. The patching routine must set each bit in the cpumask after each event >>> that requires synchronization is complete. >>> 2. The bit can be (atomically) cleared on the target CPU only, and only in a >>> place that guarantees a synchronizing event (e.g. IRET) before it may >>> reaching the poked instruction. >>> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It >>> *is* also possible to clear it in other places, e.g. the NMI handler, if >>> necessary as long as condition 2 is satisfied. >> >> OK, so with interrupts enabled *on the processor doing the patching* we >> still have a problem if it takes an interrupt which in turn takes a #BP. >> Disabling interrupts would not help, because but an NMI and #MC could >> still cause problems unless we can guarantee that no path which may be >> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive >> assumption. >> >> Note: I am assuming preemption is disabled. >> >> The easiest/sanest way to deal with this might be to switch the IDT (or >> provide a hook in the generic exception entry code) on the patching >> processor, such that if an asynchronous event comes in, we either roll >> forward or revert. This is doable because the second sync we currently >> do is not actually necessary per the hardware guys. > > This is IMO insanely complicated. I much prefer the kind of > complexity that is more or less deterministic and easy to test to the > kind of complexity (like this) that only happens in corner cases. > > I see two solutions here: > > 1. Just suck it up and emulate the CALL. And find a way to write a > test case so we know it works. > > 2. Find a non-deadlocky way to make the breakpoint handler wait for > the breakpoint to get removed, without any mucking at all with the > entry code. And find a way to write a test case so we know it works. > (E.g. stick an actual static_call call site *in text_poke_bp()* that > fires once on boot so that the really awful recursive case gets > exercised all the time. > > But if we're going to do any mucking with the entry code, let's just > do the simple mucking to make emulating CALL work. These two approaches still seem more complicated to me than having a trampoline as backup, which is patched dynamically. IIUC, the current pushback against this option is that it makes batching is more complicated. I am not sure it is that bad, but there are other variants of this solution, for example using an
Re: [PATCH v3 0/6] Static calls
On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote: > > So I was already in the middle of composing this message when Andy posted: > > > I don't even think this is sufficient. I think we also need everyone > > who clears the bit to check if all bits are clear and, if so, remove > > the breakpoint. Otherwise we have a situation where, if you are in > > text_poke_bp() and you take an NMI (or interrupt or MCE or whatever) > > and that interrupt then hits the breakpoint, then you deadlock because > > no one removes the breakpoint. > > > > If we do this, and if we can guarantee that all CPUs make forward > > progress, then maybe the problem is solved. Can we guarantee something > > like all NMI handlers that might wait in a spinlock or for any other > > reason will periodically check if a sync is needed while they're > > spinning? > > So the really, really nasty case is when an asynchronous event on the > *patching* processor gets stuck spinning on a resource which is > unavailable due to another processor spinning on the #BP. We can disable > interrupts, but we can't stop NMIs from coming in (although we could > test in the NMI handler if we are in that condition and return > immediately; I'm not sure we want to do that, and we still have to deal > with #MC and what not.) > > The fundamental problem here is that we don't see the #BP on the > patching processor, in which case we could simply complete the patching > from the #BP handler on that processor. > > On 1/13/19 6:40 PM, H. Peter Anvin wrote: > > On 1/13/19 6:31 PM, H. Peter Anvin wrote: > >> > >> static cpumask_t text_poke_cpumask; > >> > >> static void text_poke_sync(void) > >> { > >> smp_wmb(); > >> text_poke_cpumask = cpu_online_mask; > >> smp_wmb(); /* Should be optional on x86 */ > >> cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); > >> on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); > >> while (!cpumask_empty(_poke_cpumask)) { > >> cpu_relax(); > >> smp_rmb(); > >> } > >> } > >> > >> static void text_poke_sync_cpu(void *dummy) > >> { > >> (void)dummy; > >> > >> smp_rmb(); > >> cpumask_clear_cpu(_bitmask, smp_processor_id()); > >> /* > >> * We are guaranteed to return with an IRET, either from the > >> * IPI or the #BP handler; this provides serialization. > >> */ > >> } > >> > > > > The invariants here are: > > > > 1. The patching routine must set each bit in the cpumask after each event > >that requires synchronization is complete. > > 2. The bit can be (atomically) cleared on the target CPU only, and only in a > >place that guarantees a synchronizing event (e.g. IRET) before it may > >reaching the poked instruction. > > 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It > >*is* also possible to clear it in other places, e.g. the NMI handler, if > >necessary as long as condition 2 is satisfied. > > > > OK, so with interrupts enabled *on the processor doing the patching* we > still have a problem if it takes an interrupt which in turn takes a #BP. > Disabling interrupts would not help, because but an NMI and #MC could > still cause problems unless we can guarantee that no path which may be > invoked by NMI/#MC can do text_poke, which seems to be a very aggressive > assumption. > > Note: I am assuming preemption is disabled. > > The easiest/sanest way to deal with this might be to switch the IDT (or > provide a hook in the generic exception entry code) on the patching > processor, such that if an asynchronous event comes in, we either roll > forward or revert. This is doable because the second sync we currently > do is not actually necessary per the hardware guys. This is IMO insanely complicated. I much prefer the kind of complexity that is more or less deterministic and easy to test to the kind of complexity (like this) that only happens in corner cases. I see two solutions here: 1. Just suck it up and emulate the CALL. And find a way to write a test case so we know it works. 2. Find a non-deadlocky way to make the breakpoint handler wait for the breakpoint to get removed, without any mucking at all with the entry code. And find a way to write a test case so we know it works. (E.g. stick an actual static_call call site *in text_poke_bp()* that fires once on boot so that the really awful recursive case gets exercised all the time. But if we're going to do any mucking with the entry code, let's just do the simple mucking to make emulating CALL work. --Andy
Re: [PATCH v3 0/6] Static calls
I think this sequence ought to work (keep in mind we are already under a mutex, so the global data is safe even if we are preempted): set up page table entries invlpg set up bp patching global data cpu = get_cpu() bp_old_value = atomic_read(bp_write_addr) do { atomic_write(_poke_state, 1) atomic_write(bp_write_addr, 0xcc) mask <- online_cpu_mask - self send IPIs wait for mask = 0 } while (cmpxchg(_poke_state, 1, 2) != 1); patch sites, remove breakpoints after patching each one atomic_write(_poke_state, 3); mask <- online_cpu_mask - self send IPIs wait for mask = 0 atomic_write(_poke_state, 0); tear down patching global data tear down page table entries The #BP handler would then look like: state = cmpxchg(_poke_state, 1, 4); switch (state) { case 1: case 4: invlpg cmpxchg(bp_write_addr, 0xcc, bp_old_value) break; case 2: invlpg complete patch sequence remove breakpoint break; case 3: /* If we are here, the #BP will go away on its own */ break; case 0: /* No patching in progress!!! */ return 0; } clear bit in mask return 1; The IPI handler: clear bit in mask sync_core /* Needed if multiple IPI events are chained */
Re: [PATCH v3 0/6] Static calls
So I was already in the middle of composing this message when Andy posted: > I don't even think this is sufficient. I think we also need everyone > who clears the bit to check if all bits are clear and, if so, remove > the breakpoint. Otherwise we have a situation where, if you are in > text_poke_bp() and you take an NMI (or interrupt or MCE or whatever) > and that interrupt then hits the breakpoint, then you deadlock because > no one removes the breakpoint. > > If we do this, and if we can guarantee that all CPUs make forward > progress, then maybe the problem is solved. Can we guarantee something > like all NMI handlers that might wait in a spinlock or for any other > reason will periodically check if a sync is needed while they're > spinning? So the really, really nasty case is when an asynchronous event on the *patching* processor gets stuck spinning on a resource which is unavailable due to another processor spinning on the #BP. We can disable interrupts, but we can't stop NMIs from coming in (although we could test in the NMI handler if we are in that condition and return immediately; I'm not sure we want to do that, and we still have to deal with #MC and what not.) The fundamental problem here is that we don't see the #BP on the patching processor, in which case we could simply complete the patching from the #BP handler on that processor. On 1/13/19 6:40 PM, H. Peter Anvin wrote: > On 1/13/19 6:31 PM, H. Peter Anvin wrote: >> >> static cpumask_t text_poke_cpumask; >> >> static void text_poke_sync(void) >> { >> smp_wmb(); >> text_poke_cpumask = cpu_online_mask; >> smp_wmb(); /* Should be optional on x86 */ >> cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); >> on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); >> while (!cpumask_empty(_poke_cpumask)) { >> cpu_relax(); >> smp_rmb(); >> } >> } >> >> static void text_poke_sync_cpu(void *dummy) >> { >> (void)dummy; >> >> smp_rmb(); >> cpumask_clear_cpu(_bitmask, smp_processor_id()); >> /* >> * We are guaranteed to return with an IRET, either from the >> * IPI or the #BP handler; this provides serialization. >> */ >> } >> > > The invariants here are: > > 1. The patching routine must set each bit in the cpumask after each event >that requires synchronization is complete. > 2. The bit can be (atomically) cleared on the target CPU only, and only in a >place that guarantees a synchronizing event (e.g. IRET) before it may >reaching the poked instruction. > 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It >*is* also possible to clear it in other places, e.g. the NMI handler, if >necessary as long as condition 2 is satisfied. > OK, so with interrupts enabled *on the processor doing the patching* we still have a problem if it takes an interrupt which in turn takes a #BP. Disabling interrupts would not help, because but an NMI and #MC could still cause problems unless we can guarantee that no path which may be invoked by NMI/#MC can do text_poke, which seems to be a very aggressive assumption. Note: I am assuming preemption is disabled. The easiest/sanest way to deal with this might be to switch the IDT (or provide a hook in the generic exception entry code) on the patching processor, such that if an asynchronous event comes in, we either roll forward or revert. This is doable because the second sync we currently do is not actually necessary per the hardware guys. If we take that #BP during the breakpoint deployment phase -- that is, before the first sync has completed -- restore the previous value of the breakpoint byte. Upon return text_poke_bp() will then have to loop back to the beginning and do it again. If we take the #BP after that point, we can complete the patch in the normal manner, by writing the rest of the instruction and then removing the #BP. text_poke_bp() will complete the synchronization sequence on return, but if another processor is spinning and sees the breakpoint having been removed, it is good to go. Rignt now we do completely unnecessary setup and teardown of the PDT entries for each phase of the patching. This would have to be removed, so that the asynchronous event handler will always be able to do the roll forward/roll back as required. If this is unpalatable, the solution you touched on is probably also doable, but I need to think *really* carefully about the sequencing constraints, because now you also have to worry about events interrupting a patch in progress but not completed. It would however have the advantage that an arbitrary interrupt on the patching processor is unlikely to cause a rollback, and so would be safer to execute with interrupts enabled without causing a livelock. Now, you can't just remove the breakpoint; you have to make sure that if you do, during the first phase text_poke_bp() will loop, and in the second phase complete the
Re: [PATCH v3 0/6] Static calls
On Sun, Jan 13, 2019 at 6:41 PM H. Peter Anvin wrote: > > On 1/13/19 6:31 PM, H. Peter Anvin wrote: > > > > static cpumask_t text_poke_cpumask; > > > > static void text_poke_sync(void) > > { > > smp_wmb(); > > text_poke_cpumask = cpu_online_mask; > > smp_wmb(); /* Should be optional on x86 */ > > cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); > > on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); > > while (!cpumask_empty(_poke_cpumask)) { > > cpu_relax(); > > smp_rmb(); > > } > > } > > > > static void text_poke_sync_cpu(void *dummy) > > { > > (void)dummy; > > > > smp_rmb(); > > cpumask_clear_cpu(_bitmask, smp_processor_id()); > > /* > >* We are guaranteed to return with an IRET, either from the > >* IPI or the #BP handler; this provides serialization. > >*/ > > } > > > > The invariants here are: > > 1. The patching routine must set each bit in the cpumask after each event >that requires synchronization is complete. > 2. The bit can be (atomically) cleared on the target CPU only, and only in a >place that guarantees a synchronizing event (e.g. IRET) before it may >reaching the poked instruction. > 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It >*is* also possible to clear it in other places, e.g. the NMI handler, if >necessary as long as condition 2 is satisfied. > I don't even think this is sufficient. I think we also need everyone who clears the bit to check if all bits are clear and, if so, remove the breakpoint. Otherwise we have a situation where, if you are in text_poke_bp() and you take an NMI (or interrupt or MCE or whatever) and that interrupt then hits the breakpoint, then you deadlock because no one removes the breakpoint. If we do this, and if we can guarantee that all CPUs make forward progress, then maybe the problem is solved. Can we guarantee something like all NMI handlers that might wait in a spinlock or for any other reason will periodically check if a sync is needed while they're spinning? --Andy
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 01:05:20PM -0800, Andy Lutomirski wrote: > On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds > wrote: > > > > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > > > > > > I was referring to the fact that a single static call key update will > > > usually result in patching multiple call sites. But you're right, it's > > > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > > > we may want to batch all the writes like what Daniel has proposed for > > > jump labels, to reduce IPIs. > > > > Yeah, my suggestion doesn't allow for batching, since it would > > basically generate one trampoline for every rewritten instruction. > > Sure it does. Just make 1000 trampolines and patch 1000 sites in a > batch :) As long as the number of trampolines is smallish (e.g. fits > in a page), then we should be in good shape. Much easier still would be to make the ARCH_DEFINE_STATIC_TRAMP thing generate the two trampolines per callsite and simply keep them around. Another advantage is that you then only have to patch the JMP target, since the return address will always stay the same (since these things are generated per call-site). Anyway... the STI-shadow thing is very clever. But I'm with Josh in that I think I prefer the IRET frame offset thing -- but yes, I've read Linus' argument against that.
Re: [PATCH v3 0/6] Static calls
On 1/13/19 6:31 PM, H. Peter Anvin wrote: > > static cpumask_t text_poke_cpumask; > > static void text_poke_sync(void) > { > smp_wmb(); > text_poke_cpumask = cpu_online_mask; > smp_wmb(); /* Should be optional on x86 */ > cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); > on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); > while (!cpumask_empty(_poke_cpumask)) { > cpu_relax(); > smp_rmb(); > } > } > > static void text_poke_sync_cpu(void *dummy) > { > (void)dummy; > > smp_rmb(); > cpumask_clear_cpu(_bitmask, smp_processor_id()); > /* >* We are guaranteed to return with an IRET, either from the >* IPI or the #BP handler; this provides serialization. >*/ > } > The invariants here are: 1. The patching routine must set each bit in the cpumask after each event that requires synchronization is complete. 2. The bit can be (atomically) cleared on the target CPU only, and only in a place that guarantees a synchronizing event (e.g. IRET) before it may reaching the poked instruction. 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It *is* also possible to clear it in other places, e.g. the NMI handler, if necessary as long as condition 2 is satisfied. -hpa
Re: [PATCH v3 0/6] Static calls
On 1/11/19 11:39 AM, Jiri Kosina wrote: > On Fri, 11 Jan 2019, h...@zytor.com wrote: > >> I still don't see why can't simply spin in the #BP handler until the >> patch is complete. > > I think this brings us to the already discussed possible deadlock, when > one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside > spin_lock_irq*() and CPU#2 hits the breakpont while holding that very > 'lock'. > > Then we're stuck forever, because CPU#1 will never handle the pending > sync_core() IPI (it's not NMI). > > Or have I misunderstood what you meant? > OK, I was thinking about this quite a while ago, and even started hacking on it, but apparently I managed to forget some key details. Specifically, you do *not* want to use the acknowledgment of the IPI as the blocking condition, so don't use a waiting IPI. Instead, you want a CPU bitmap (or percpu variable) that the IPI handler clears. When you are spinning in the IPI handler, you *also* need to clear that bit. Doing so is safe even in the case of batched updates, because you are guaranteed to execute an IRET before you get to patched code. So the synchronization part of the patching routine becomes: static cpumask_t text_poke_cpumask; static void text_poke_sync(void) { smp_wmb(); text_poke_cpumask = cpu_online_mask; smp_wmb(); /* Optional on x86 */ cpumask_clear_cpu(_poke_cpumask, smp_processor_id()); on_each_cpu_mask(_poke_cpumask, text_poke_sync_cpu, NULL, false); while (!cpumask_empty(_poke_cpumask)) { cpu_relax(); smp_rmb(); } } static void text_poke_sync_cpu(void *dummy) { (void)dummy; smp_rmb(); cpumask_clear_cpu(_bitmask, smp_processor_id()); /* * We are guaranteed to return with an IRET, either from the * IPI or the #BP handler; this provides serialization. */ } The spin routine then needs add a call to do something like this. By (optionally) not comparing to a specific breakpoint address we allow for batching, but we may end up spinning on a breakpoint that is not actually a patching breakpoint until the patching is done. int poke_int3_handler(struct pt_regs *regs) { /* In the current handler, but shouldn't be needed... */ smp_rmb(); if (likely(!atomic_read(bp_patching_in_progress))) return 0; if (user_mode(regs) unlikely(!user_mode(regs) && atomic_read(_patching_in_progress)) { text_poke_sync(); regs->ip--; return 1; /* May end up retaking the trap */ } else { return 0; } } Unless I'm totally mistaken, the worst thing that will happen with this code is that it may end up taking a harmless spurious IPI at a later time. -hpa
Re: [PATCH v3 0/6] Static calls
On January 11, 2019 11:34:34 AM PST, Linus Torvalds wrote: >On Fri, Jan 11, 2019 at 11:24 AM wrote: >> >> I still don't see why can't simply spin in the #BP handler until the >patch is complete. > >So here's at least one problem: > >text_poke_bp() > text_poke(addr, , sizeof(int3)); > *interrupt* > interrupt has a static call >*BP* > poke_int3_handler > *BOOM* > >Note how at BOOM we cannot just spin (or return) to wait for the >'int3' to be switched back. Becuase it never will. Because we are >interrupting the thing that would do that switch-back. > >So we'd have to do the 'text_poke_bp()' sequence with interrupts >disabled. Which we can't do right now at least, because part of that >sequence involves that on_each_cpu(do_sync_core) thing, which needs >interrupts enabled. > >See? > >Or am I missing something? > >Linus Ok, I was thinking far more about spinning with an IRET and letting the exception be delivered. Patching with interrupts disabled have other problems... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCH v3 0/6] Static calls
On January 11, 2019 11:34:34 AM PST, Linus Torvalds wrote: >On Fri, Jan 11, 2019 at 11:24 AM wrote: >> >> I still don't see why can't simply spin in the #BP handler until the >patch is complete. > >So here's at least one problem: > >text_poke_bp() > text_poke(addr, , sizeof(int3)); > *interrupt* > interrupt has a static call >*BP* > poke_int3_handler > *BOOM* > >Note how at BOOM we cannot just spin (or return) to wait for the >'int3' to be switched back. Becuase it never will. Because we are >interrupting the thing that would do that switch-back. > >So we'd have to do the 'text_poke_bp()' sequence with interrupts >disabled. Which we can't do right now at least, because part of that >sequence involves that on_each_cpu(do_sync_core) thing, which needs >interrupts enabled. > >See? > >Or am I missing something? > >Linus Let me think about it. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 1:22 PM Josh Poimboeuf wrote: > > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote: > > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > > > > > > I was referring to the fact that a single static call key update will > > > usually result in patching multiple call sites. But you're right, it's > > > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > > > we may want to batch all the writes like what Daniel has proposed for > > > jump labels, to reduce IPIs. > > > > Yeah, my suggestion doesn't allow for batching, since it would > > basically generate one trampoline for every rewritten instruction. > > As Andy said, I think batching would still be possible, it's just that > we'd have to create multiple trampolines at a time. > > Or... we could do a hybrid approach: create a single custom trampoline > which has the call destination patched in, but put the return address in > %rax -- which is always clobbered, even for callee-saved PV ops. Like: > One think I particularly like about the current design is that there are no requirements at all on the calling convention. I think it seems fragile to add a calling convention constraint that only applies when there's a race. I'd rather do a longjmp-like hack or a stack gap adding hack than make the actual static calls more fragile.
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 1:55 PM, Steven Rostedt wrote: > > On Fri, 11 Jan 2019 15:41:22 -0600 > Josh Poimboeuf wrote: > >>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On >>> a fully preemptable kernel, you can get preempted between the push and the >>> call (jmp) or before the push. RCU-sched can then finish, and the preempted >>> task may later jump to a wrong patched-dest. >> >> Argh, I misspoke about RCU-sched. Words are hard. >> >> I meant synchronize_rcu_tasks(), which is a completely different animal. >> My understanding is that it waits until all runnable tasks (including >> preempted tasks) have gotten a chance to run. > > Not quite, but does the same thing. It waits for all tasks to either > schedule voluntarily (not preempted), or be / go into idle, or be /go > into userspace. In any case, it makes sure code is off of trampolines. > I use this before freeing trampolines used by ftrace. Interesting. So my last email is completely wrong.
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 1:41 PM, Josh Poimboeuf wrote: > > On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote: >>> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote: >>> >>> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote: On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > I was referring to the fact that a single static call key update will > usually result in patching multiple call sites. But you're right, it's > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > we may want to batch all the writes like what Daniel has proposed for > jump labels, to reduce IPIs. Yeah, my suggestion doesn't allow for batching, since it would basically generate one trampoline for every rewritten instruction. >>> >>> As Andy said, I think batching would still be possible, it's just that >>> we'd have to create multiple trampolines at a time. >>> >>> Or... we could do a hybrid approach: create a single custom trampoline >>> which has the call destination patched in, but put the return address in >>> %rax -- which is always clobbered, even for callee-saved PV ops. Like: >>> >>> trampoline: >>> push %rax >>> call patched-dest >>> >>> That way the batching could be done with a single trampoline >>> (particularly if using rcu-sched to avoid the sti hack). >> >> I don’t see RCU-sched solves the problem if you don’t disable preemption. On >> a fully preemptable kernel, you can get preempted between the push and the >> call (jmp) or before the push. RCU-sched can then finish, and the preempted >> task may later jump to a wrong patched-dest. > > Argh, I misspoke about RCU-sched. Words are hard. > > I meant synchronize_rcu_tasks(), which is a completely different animal. > My understanding is that it waits until all runnable tasks (including > preempted tasks) have gotten a chance to run. Actually, I just used the term you used, and thought about synchronize_sched(). If you look at my patch [1], you’ll see I did something similar using synchronize_sched(). But this required some delicate work of restarting any preempted “optpoline” (or whatever name you want) block. [Note that my implementation has a terrible bug in this respect]. This is required since running a preempted task to does now prevent it from being preempted again without doing any “real” progress. If we want to adapt the same solution to static_calls, this means that in retint_kernel (entry_64.S), you need check whether you got preempted inside the trampoline and change the saved RIP in such case back, before the static_call. IMHO, sti+jmp is simpler. [1] https://lore.kernel.org/lkml/20181231072112.21051-6-na...@vmware.com/
Re: [PATCH v3 0/6] Static calls
On Fri, 11 Jan 2019 15:41:22 -0600 Josh Poimboeuf wrote: > > I don’t see RCU-sched solves the problem if you don’t disable preemption. On > > a fully preemptable kernel, you can get preempted between the push and the > > call (jmp) or before the push. RCU-sched can then finish, and the preempted > > task may later jump to a wrong patched-dest. > > Argh, I misspoke about RCU-sched. Words are hard. > > I meant synchronize_rcu_tasks(), which is a completely different animal. > My understanding is that it waits until all runnable tasks (including > preempted tasks) have gotten a chance to run. Not quite, but does the same thing. It waits for all tasks to either schedule voluntarily (not preempted), or be / go into idle, or be /go into userspace. In any case, it makes sure code is off of trampolines. I use this before freeing trampolines used by ftrace. -- Steve
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote: > > On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote: > > > > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote: > >> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf > >> wrote: > >>> I was referring to the fact that a single static call key update will > >>> usually result in patching multiple call sites. But you're right, it's > >>> only 1-2 trampolines per text_poke_bp() invocation. Though eventually > >>> we may want to batch all the writes like what Daniel has proposed for > >>> jump labels, to reduce IPIs. > >> > >> Yeah, my suggestion doesn't allow for batching, since it would > >> basically generate one trampoline for every rewritten instruction. > > > > As Andy said, I think batching would still be possible, it's just that > > we'd have to create multiple trampolines at a time. > > > > Or... we could do a hybrid approach: create a single custom trampoline > > which has the call destination patched in, but put the return address in > > %rax -- which is always clobbered, even for callee-saved PV ops. Like: > > > > trampoline: > > push %rax > > call patched-dest > > > > That way the batching could be done with a single trampoline > > (particularly if using rcu-sched to avoid the sti hack). > > I don’t see RCU-sched solves the problem if you don’t disable preemption. On > a fully preemptable kernel, you can get preempted between the push and the > call (jmp) or before the push. RCU-sched can then finish, and the preempted > task may later jump to a wrong patched-dest. Argh, I misspoke about RCU-sched. Words are hard. I meant synchronize_rcu_tasks(), which is a completely different animal. My understanding is that it waits until all runnable tasks (including preempted tasks) have gotten a chance to run. -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote: > > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote: >> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: >>> I was referring to the fact that a single static call key update will >>> usually result in patching multiple call sites. But you're right, it's >>> only 1-2 trampolines per text_poke_bp() invocation. Though eventually >>> we may want to batch all the writes like what Daniel has proposed for >>> jump labels, to reduce IPIs. >> >> Yeah, my suggestion doesn't allow for batching, since it would >> basically generate one trampoline for every rewritten instruction. > > As Andy said, I think batching would still be possible, it's just that > we'd have to create multiple trampolines at a time. > > Or... we could do a hybrid approach: create a single custom trampoline > which has the call destination patched in, but put the return address in > %rax -- which is always clobbered, even for callee-saved PV ops. Like: > > trampoline: > push %rax > call patched-dest > > That way the batching could be done with a single trampoline > (particularly if using rcu-sched to avoid the sti hack). I don’t see RCU-sched solves the problem if you don’t disable preemption. On a fully preemptable kernel, you can get preempted between the push and the call (jmp) or before the push. RCU-sched can then finish, and the preempted task may later jump to a wrong patched-dest.
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 01:10:47PM -0800, Linus Torvalds wrote: > On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski wrote: > > > > > Yeah, my suggestion doesn't allow for batching, since it would > > > basically generate one trampoline for every rewritten instruction. > > > > Sure it does. Just make 1000 trampolines and patch 1000 sites in a > > batch :) As long as the number of trampolines is smallish (e.g. fits > > in a page), then we should be in good shape. > > Yeah, I guess that's true. > > But let's not worry about it. I don't think we do batching for any > alternative instruction rewriting right now, do we? Not at the moment, but Daniel has been working on some patches to do so: https://lkml.kernel.org/r/cover.1545228276.git.bris...@redhat.com -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote: > b) Adding a gap to the #DB entry stack And that should be #BP of course... Not sure why my fingers keep doing that! -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote: > trampoline: > push %rax > call patched-dest That should be a JMP of course. -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote: > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > > > > I was referring to the fact that a single static call key update will > > usually result in patching multiple call sites. But you're right, it's > > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > > we may want to batch all the writes like what Daniel has proposed for > > jump labels, to reduce IPIs. > > Yeah, my suggestion doesn't allow for batching, since it would > basically generate one trampoline for every rewritten instruction. As Andy said, I think batching would still be possible, it's just that we'd have to create multiple trampolines at a time. Or... we could do a hybrid approach: create a single custom trampoline which has the call destination patched in, but put the return address in %rax -- which is always clobbered, even for callee-saved PV ops. Like: trampoline: push %rax call patched-dest That way the batching could be done with a single trampoline (particularly if using rcu-sched to avoid the sti hack). If you don't completely hate that approach then I may try it. > > Regardless, the trampoline management seems more complex to me. But > > it's easier to argue about actual code, so maybe I'll code it up to make > > it easier to compare solutions. > > I do agree hat the stack games are likely "simpler" in one sense. I > just abhor playing those kinds of games with the entry code and entry > stack. > > A small bit of extra complexity in the code that actually does the > rewriting would be much more palatable to me than the complexity in > the entry code. I prefer seeing the onus of complexity being on the > code that introduces the problem, not on a innocent bystander. > > I'd like to say that our entry code actually looks fairly sane these > days. I'd _like_ to say that, but I'd be lying through my teeth if I > did. The macros we use make any normal persons head spin. > > The workaround for the stack corruption was fairly simple, but the > subtlety behind the *reason* for it was what made my hackles rise > about that code. > > The x86 entry code is some of the nastiest in the kernel, I feel, with > all the subtle interactions about odd stack switches, odd CPU bugs > causing odd TLB switches, NMI interactions etc etc. > > So I am fully cognizant that the workaround to shift the stack in the > entry code was just a couple of lines, and not very complicated. > > And I agree that I may be a bit oversensitive about that area, but it > really is one of those places where I go "damn, I think I know some > low-level x86 stuff better than most, but that code scares *me*". > > Which is why I'd accept a rather bigger complexity hit just about > anywhere else in the code... I agree that, to a certain extent, it can make sense to put the "onus of complexity" on the code that introduces the problem. But of course it's not an absolute rule, and should be considered in context with the relative complexities of the competing solutions. But I think where I see things differently is that: a) The entry code is, by far, in the best shape it's ever been [*], thanks to Andy's considerable efforts. I find it to be quite readable, but that might be due to many hours of intense study... [*] overlooking recent unavoidable meltdown/spectre hacks b) Adding a gap to the #DB entry stack is (in my mind) a simple localized change, which is easily understandable by a reader of the entry code -- assuming certain personality characteristics of a person whose life decisions have resulted in them reading entry code in the first place... c) Doing so is an order of magnitude simpler than the custom trampoline thing (or really any of the other many alternatives we've discussed). At least that's my feeling. -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski wrote: > > > Yeah, my suggestion doesn't allow for batching, since it would > > basically generate one trampoline for every rewritten instruction. > > Sure it does. Just make 1000 trampolines and patch 1000 sites in a > batch :) As long as the number of trampolines is smallish (e.g. fits > in a page), then we should be in good shape. Yeah, I guess that's true. But let's not worry about it. I don't think we do batching for any alternative instruction rewriting right now, do we? Linus
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds wrote: > > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > > > > I was referring to the fact that a single static call key update will > > usually result in patching multiple call sites. But you're right, it's > > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > > we may want to batch all the writes like what Daniel has proposed for > > jump labels, to reduce IPIs. > > Yeah, my suggestion doesn't allow for batching, since it would > basically generate one trampoline for every rewritten instruction. Sure it does. Just make 1000 trampolines and patch 1000 sites in a batch :) As long as the number of trampolines is smallish (e.g. fits in a page), then we should be in good shape.
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote: > > I was referring to the fact that a single static call key update will > usually result in patching multiple call sites. But you're right, it's > only 1-2 trampolines per text_poke_bp() invocation. Though eventually > we may want to batch all the writes like what Daniel has proposed for > jump labels, to reduce IPIs. Yeah, my suggestion doesn't allow for batching, since it would basically generate one trampoline for every rewritten instruction. > Regardless, the trampoline management seems more complex to me. But > it's easier to argue about actual code, so maybe I'll code it up to make > it easier to compare solutions. I do agree hat the stack games are likely "simpler" in one sense. I just abhor playing those kinds of games with the entry code and entry stack. A small bit of extra complexity in the code that actually does the rewriting would be much more palatable to me than the complexity in the entry code. I prefer seeing the onus of complexity being on the code that introduces the problem, not on a innocent bystander. I'd like to say that our entry code actually looks fairly sane these days. I'd _like_ to say that, but I'd be lying through my teeth if I did. The macros we use make any normal persons head spin. The workaround for the stack corruption was fairly simple, but the subtlety behind the *reason* for it was what made my hackles rise about that code. The x86 entry code is some of the nastiest in the kernel, I feel, with all the subtle interactions about odd stack switches, odd CPU bugs causing odd TLB switches, NMI interactions etc etc. So I am fully cognizant that the workaround to shift the stack in the entry code was just a couple of lines, and not very complicated. And I agree that I may be a bit oversensitive about that area, but it really is one of those places where I go "damn, I think I know some low-level x86 stuff better than most, but that code scares *me*". Which is why I'd accept a rather bigger complexity hit just about anywhere else in the code... Linus
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 12:12:30PM -0800, Linus Torvalds wrote: > On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf wrote: > > > > But really, to me, having to create and manage all those custom > > trampolines still feels a lot more complex than just making a gap on the > > stack. > > There are no "all those custom trampolines". > > There is literally *one* custom trampoline that you generate as you do > the rewriting. > > Well, two, since you need the version with the "sti" before the jmp. > > It would be possible to generate the custom trampoline on the fly in > the BP handler itself, and just have a magic flag for that case. But > it's probably simpler to do it in the caller, since you need to > generate that special writable and executable code sequence. You > probably don't want to do that at BP time. > > You probably want to use a FIX_TEXT_POKE2 page for the generated > sequence that just maps some generated code executably for a short > while. Or something like that. I was referring to the fact that a single static call key update will usually result in patching multiple call sites. But you're right, it's only 1-2 trampolines per text_poke_bp() invocation. Though eventually we may want to batch all the writes like what Daniel has proposed for jump labels, to reduce IPIs. Regardless, the trampoline management seems more complex to me. But it's easier to argue about actual code, so maybe I'll code it up to make it easier to compare solutions. -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf wrote: > > But really, to me, having to create and manage all those custom > trampolines still feels a lot more complex than just making a gap on the > stack. There are no "all those custom trampolines". There is literally *one* custom trampoline that you generate as you do the rewriting. Well, two, since you need the version with the "sti" before the jmp. It would be possible to generate the custom trampoline on the fly in the BP handler itself, and just have a magic flag for that case. But it's probably simpler to do it in the caller, since you need to generate that special writable and executable code sequence. You probably don't want to do that at BP time. You probably want to use a FIX_TEXT_POKE2 page for the generated sequence that just maps some generated code executably for a short while. Or something like that. Linus Linus
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 11:03:30AM -0800, Linus Torvalds wrote: > The we'd change the end of poke_int3_handler() to do something like > this instead: > > void *newip = bp_int3_handler; > .. > if (new == magic_static_call_bp_int3_handler) { > if (regs->flags _FLAGS_IF) { > newip = magic_static_call_bp_int3_handler_sti; > regs->flags &= ~X86_FLAGS_IF; > } > regs->ip = (unsigned long) newip; > return 1; > > AAND now we're *really* done. > > Does anybody see any issues in this? This sounds ok, with a possible tweak: instead of the sti tricks, couldn't we just use synchronize_rcu_tasks() (as Jason suggested), to make sure the stubs are no longer used by a preempted task? But really, to me, having to create and manage all those custom trampolines still feels a lot more complex than just making a gap on the stack. -- Josh
Re: [PATCH v3 0/6] Static calls
On Fri, 11 Jan 2019, h...@zytor.com wrote: > I still don't see why can't simply spin in the #BP handler until the > patch is complete. I think this brings us to the already discussed possible deadlock, when one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside spin_lock_irq*() and CPU#2 hits the breakpont while holding that very 'lock'. Then we're stuck forever, because CPU#1 will never handle the pending sync_core() IPI (it's not NMI). Or have I misunderstood what you meant? -- Jiri Kosina SUSE Labs
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 11:24 AM wrote: > > I still don't see why can't simply spin in the #BP handler until the patch is > complete. So here's at least one problem: text_poke_bp() text_poke(addr, , sizeof(int3)); *interrupt* interrupt has a static call *BP* poke_int3_handler *BOOM* Note how at BOOM we cannot just spin (or return) to wait for the 'int3' to be switched back. Becuase it never will. Because we are interrupting the thing that would do that switch-back. So we'd have to do the 'text_poke_bp()' sequence with interrupts disabled. Which we can't do right now at least, because part of that sequence involves that on_each_cpu(do_sync_core) thing, which needs interrupts enabled. See? Or am I missing something? Linus
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 11:23 AM, h...@zytor.com wrote: > > On January 11, 2019 11:03:30 AM PST, Linus Torvalds > wrote: >> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf >> wrote: Now, in the int3 handler can you take the faulting RIP and search >> for it in the “static-calls” table, writing the RIP+5 (offset) into R10 >> (return address) and the target into R11. You make the int3 handler to >> divert the code execution by changing pt_regs->rip to point to a new function >> that does: push R10 jmp __x86_indirect_thunk_r11 And then you are done. No? >>> >>> IIUC, that sounds pretty much like what Steven proposed: >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3Dreserved=0 >>> I liked the idea, BUT, how would it work for callee-saved PV ops? In >>> that case there's only one clobbered register to work with (rax). >> >> Actually, there's a much simpler model now that I think about it. >> >> The BP fixup just fixes up %rip to to point to "bp_int3_handler". >> >> And that's just a random text address set up by "text_poke_bp()". >> >> So how about the static call rewriting simply do this: >> >> - for each static call: >> >> 1) create a fixup code stub that does >> >> push $returnaddressforTHIScall >> jmp targetforTHIScall >> >> 2) do >> >> on_each_cpu(do_sync_core, NULL, 1); >> >>to make sure all CPU's see this generated code >> >> 3) do >> >> text_poke_bp(addr, newcode, newlen, generatedcode); >> >> Ta-daa! Done. >> >> In fact, it turns out that even the extra "do_sync_core()" in #2 is >> unnecessary, because taking the BP will be serializing on the CPU that >> takes it, so we can skip it. >> >> End result: the text_poke_bp() function will do the two do_sync_core >> IPI's that guarantee that by the time it returns, no other CPU is >> using the generated code any more, so it can be re-used for the next >> static call fixup. >> >> Notice? No odd emulation, no need to adjust the stack in the BP >> handler, just the regular "return to a different IP". >> >> Now, there is a nasty special case with that stub, though. >> >> So nasty thing with the whole "generate a stub for each call" case: >> because it's dynamic and because of the re-use of the stub, you could >> be in the situation where: >> >> CPU1 CPU2 >> >> >> generate a stub >> on_each_cpu(do_sync_core..) >> text_poke_bp() >> ... >> >> rewrite to BP >> trigger the BP >> return to the stub >> fun the first instruction of the stub >> *INTERRUPT causes rescheduling* >> >> on_each_cpu(do_sync_core..) >> rewrite to good instruction >> on_each_cpu(do_sync_core..) >> >> free or re-generate the stub >> >> !! The stub is still in use !! >> >> So that simple "just generate the stub dynamically" isn't so simple >> after all. >> >> But it turns out that that is really simple to handle too. How do we do >> that? >> >> We do that by giving the BP handler *two* code sequences, and we make >> the BP handler pick one depending on whether it is returning to a >> "interrupts disabled" or "interrupts enabled" case. >> >> So the BP handler does this: >> >> - if we're returning with interrupts disabled, pick the simple stub >> >> - if we're returning with interrupts enabled, clkear IF in the return >> %rflags, and pick a *slightly* more complex stub: >> >> push $returnaddressforTHIScall >> sti >> jmp targetforTHIScall >> >> and now the STI shadow will mean that this sequence is uninterruptible. >> >> So we'd not do complex emulation of the call instruction at BP time, >> but we'd do that *trivial* change at BP time. >> >> This seems simple, doesn't need any temporary registers at all, and >> doesn't need any extra stack magic. It literally needs just a trivial >> sequence in poke_int3_handler(). >> >> The we'd change the end of poke_int3_handler() to do something like >> this instead: >> >> void *newip = bp_int3_handler; >> .. >> if (new == magic_static_call_bp_int3_handler) { >> if (regs->flags _FLAGS_IF) { >> newip = magic_static_call_bp_int3_handler_sti; >> regs->flags &= ~X86_FLAGS_IF; >> } >> regs->ip = (unsigned long) newip; >> return 1; >> >> AAND now we're *really* done. >> >> Does anybody see any issues in this? >> >> Linus > > I still don't see why can't simply spin in the #BP handler until the patch is > complete. > > We can't have the #BP handler do any additional patching, as previously >
Re: [PATCH v3 0/6] Static calls
On January 11, 2019 11:03:30 AM PST, Linus Torvalds wrote: >On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf >wrote: >> >> > >> > Now, in the int3 handler can you take the faulting RIP and search >for it in >> > the “static-calls” table, writing the RIP+5 (offset) into R10 >(return >> > address) and the target into R11. You make the int3 handler to >divert the >> > code execution by changing pt_regs->rip to point to a new function >that does: >> > >> > push R10 >> > jmp __x86_indirect_thunk_r11 >> > >> > And then you are done. No? >> >> IIUC, that sounds pretty much like what Steven proposed: >> >> >https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home >> >> I liked the idea, BUT, how would it work for callee-saved PV ops? In >> that case there's only one clobbered register to work with (rax). > >Actually, there's a much simpler model now that I think about it. > >The BP fixup just fixes up %rip to to point to "bp_int3_handler". > >And that's just a random text address set up by "text_poke_bp()". > >So how about the static call rewriting simply do this: > > - for each static call: > > 1) create a fixup code stub that does > >push $returnaddressforTHIScall >jmp targetforTHIScall > > 2) do > >on_each_cpu(do_sync_core, NULL, 1); > > to make sure all CPU's see this generated code > > 3) do > >text_poke_bp(addr, newcode, newlen, generatedcode); > >Ta-daa! Done. > >In fact, it turns out that even the extra "do_sync_core()" in #2 is >unnecessary, because taking the BP will be serializing on the CPU that >takes it, so we can skip it. > >End result: the text_poke_bp() function will do the two do_sync_core >IPI's that guarantee that by the time it returns, no other CPU is >using the generated code any more, so it can be re-used for the next >static call fixup. > >Notice? No odd emulation, no need to adjust the stack in the BP >handler, just the regular "return to a different IP". > >Now, there is a nasty special case with that stub, though. > >So nasty thing with the whole "generate a stub for each call" case: >because it's dynamic and because of the re-use of the stub, you could >be in the situation where: > > CPU1 CPU2 > > > generate a stub > on_each_cpu(do_sync_core..) > text_poke_bp() > ... > > rewrite to BP >trigger the BP >return to the stub >fun the first instruction of the stub >*INTERRUPT causes rescheduling* > > on_each_cpu(do_sync_core..) > rewrite to good instruction > on_each_cpu(do_sync_core..) > > free or re-generate the stub > >!! The stub is still in use !! > >So that simple "just generate the stub dynamically" isn't so simple >after all. > >But it turns out that that is really simple to handle too. How do we do >that? > >We do that by giving the BP handler *two* code sequences, and we make >the BP handler pick one depending on whether it is returning to a >"interrupts disabled" or "interrupts enabled" case. > >So the BP handler does this: > > - if we're returning with interrupts disabled, pick the simple stub > > - if we're returning with interrupts enabled, clkear IF in the return >%rflags, and pick a *slightly* more complex stub: > >push $returnaddressforTHIScall >sti >jmp targetforTHIScall > >and now the STI shadow will mean that this sequence is uninterruptible. > >So we'd not do complex emulation of the call instruction at BP time, >but we'd do that *trivial* change at BP time. > >This seems simple, doesn't need any temporary registers at all, and >doesn't need any extra stack magic. It literally needs just a trivial >sequence in poke_int3_handler(). > >The we'd change the end of poke_int3_handler() to do something like >this instead: > >void *newip = bp_int3_handler; >.. >if (new == magic_static_call_bp_int3_handler) { >if (regs->flags _FLAGS_IF) { >newip = magic_static_call_bp_int3_handler_sti; >regs->flags &= ~X86_FLAGS_IF; >} >regs->ip = (unsigned long) newip; >return 1; > >AAND now we're *really* done. > >Does anybody see any issues in this? > > Linus I still don't see why can't simply spin in the #BP handler until the patch is complete. We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress. It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.) If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it. --
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 11:03 AM, Linus Torvalds > wrote: > > On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf wrote: >>> Now, in the int3 handler can you take the faulting RIP and search for it in >>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return >>> address) and the target into R11. You make the int3 handler to divert the >>> code execution by changing pt_regs->rip to point to a new function that >>> does: >>> >>> push R10 >>> jmp __x86_indirect_thunk_r11 >>> >>> And then you are done. No? >> >> IIUC, that sounds pretty much like what Steven proposed: >> >> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Cd90f2aee03854a8da5dd08d677f7866e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828302371417454sdata=KEGVjCZbl5z7cCQX7F0%2FOaWfU7l%2Bvd2YDwZMXuOhpGI%3Dreserved=0 >> >> I liked the idea, BUT, how would it work for callee-saved PV ops? In >> that case there's only one clobbered register to work with (rax). > > Actually, there's a much simpler model now that I think about it. > > The BP fixup just fixes up %rip to to point to "bp_int3_handler". > > And that's just a random text address set up by "text_poke_bp()". > > So how about the static call rewriting simply do this: > > - for each static call: > > 1) create a fixup code stub that does > >push $returnaddressforTHIScall >jmp targetforTHIScall > > 2) do > >on_each_cpu(do_sync_core, NULL, 1); > > to make sure all CPU's see this generated code > > 3) do > >text_poke_bp(addr, newcode, newlen, generatedcode); > > Ta-daa! Done. > > In fact, it turns out that even the extra "do_sync_core()" in #2 is > unnecessary, because taking the BP will be serializing on the CPU that > takes it, so we can skip it. > > End result: the text_poke_bp() function will do the two do_sync_core > IPI's that guarantee that by the time it returns, no other CPU is > using the generated code any more, so it can be re-used for the next > static call fixup. > > Notice? No odd emulation, no need to adjust the stack in the BP > handler, just the regular "return to a different IP". > > Now, there is a nasty special case with that stub, though. > > So nasty thing with the whole "generate a stub for each call" case: > because it's dynamic and because of the re-use of the stub, you could > be in the situation where: > > CPU1 CPU2 > > > generate a stub > on_each_cpu(do_sync_core..) > text_poke_bp() > ... > > rewrite to BP >trigger the BP >return to the stub >fun the first instruction of the stub >*INTERRUPT causes rescheduling* > > on_each_cpu(do_sync_core..) > rewrite to good instruction > on_each_cpu(do_sync_core..) > > free or re-generate the stub > >!! The stub is still in use !! > > So that simple "just generate the stub dynamically" isn't so simple after all. > > But it turns out that that is really simple to handle too. How do we do that? > > We do that by giving the BP handler *two* code sequences, and we make > the BP handler pick one depending on whether it is returning to a > "interrupts disabled" or "interrupts enabled" case. > > So the BP handler does this: > > - if we're returning with interrupts disabled, pick the simple stub > > - if we're returning with interrupts enabled, clkear IF in the return > %rflags, and pick a *slightly* more complex stub: > >push $returnaddressforTHIScall >sti >jmp targetforTHIScall > > and now the STI shadow will mean that this sequence is uninterruptible. > > So we'd not do complex emulation of the call instruction at BP time, > but we'd do that *trivial* change at BP time. > > This seems simple, doesn't need any temporary registers at all, and > doesn't need any extra stack magic. It literally needs just a trivial > sequence in poke_int3_handler(). > > The we'd change the end of poke_int3_handler() to do something like > this instead: > >void *newip = bp_int3_handler; >.. >if (new == magic_static_call_bp_int3_handler) { >if (regs->flags _FLAGS_IF) { >newip = magic_static_call_bp_int3_handler_sti; >regs->flags &= ~X86_FLAGS_IF; >} >regs->ip = (unsigned long) newip; >return 1; > > AAND now we're *really* done. > > Does anybody see any issues in this? I think it’s a better articulated, more detailed version to the solution I proposed in this thread (which also used a patched trampoline with STI+JMP). So obviously I like it. ;-)
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf wrote: > > > > > Now, in the int3 handler can you take the faulting RIP and search for it in > > the “static-calls” table, writing the RIP+5 (offset) into R10 (return > > address) and the target into R11. You make the int3 handler to divert the > > code execution by changing pt_regs->rip to point to a new function that > > does: > > > > push R10 > > jmp __x86_indirect_thunk_r11 > > > > And then you are done. No? > > IIUC, that sounds pretty much like what Steven proposed: > > https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home > > I liked the idea, BUT, how would it work for callee-saved PV ops? In > that case there's only one clobbered register to work with (rax). Actually, there's a much simpler model now that I think about it. The BP fixup just fixes up %rip to to point to "bp_int3_handler". And that's just a random text address set up by "text_poke_bp()". So how about the static call rewriting simply do this: - for each static call: 1) create a fixup code stub that does push $returnaddressforTHIScall jmp targetforTHIScall 2) do on_each_cpu(do_sync_core, NULL, 1); to make sure all CPU's see this generated code 3) do text_poke_bp(addr, newcode, newlen, generatedcode); Ta-daa! Done. In fact, it turns out that even the extra "do_sync_core()" in #2 is unnecessary, because taking the BP will be serializing on the CPU that takes it, so we can skip it. End result: the text_poke_bp() function will do the two do_sync_core IPI's that guarantee that by the time it returns, no other CPU is using the generated code any more, so it can be re-used for the next static call fixup. Notice? No odd emulation, no need to adjust the stack in the BP handler, just the regular "return to a different IP". Now, there is a nasty special case with that stub, though. So nasty thing with the whole "generate a stub for each call" case: because it's dynamic and because of the re-use of the stub, you could be in the situation where: CPU1 CPU2 generate a stub on_each_cpu(do_sync_core..) text_poke_bp() ... rewrite to BP trigger the BP return to the stub fun the first instruction of the stub *INTERRUPT causes rescheduling* on_each_cpu(do_sync_core..) rewrite to good instruction on_each_cpu(do_sync_core..) free or re-generate the stub !! The stub is still in use !! So that simple "just generate the stub dynamically" isn't so simple after all. But it turns out that that is really simple to handle too. How do we do that? We do that by giving the BP handler *two* code sequences, and we make the BP handler pick one depending on whether it is returning to a "interrupts disabled" or "interrupts enabled" case. So the BP handler does this: - if we're returning with interrupts disabled, pick the simple stub - if we're returning with interrupts enabled, clkear IF in the return %rflags, and pick a *slightly* more complex stub: push $returnaddressforTHIScall sti jmp targetforTHIScall and now the STI shadow will mean that this sequence is uninterruptible. So we'd not do complex emulation of the call instruction at BP time, but we'd do that *trivial* change at BP time. This seems simple, doesn't need any temporary registers at all, and doesn't need any extra stack magic. It literally needs just a trivial sequence in poke_int3_handler(). The we'd change the end of poke_int3_handler() to do something like this instead: void *newip = bp_int3_handler; .. if (new == magic_static_call_bp_int3_handler) { if (regs->flags _FLAGS_IF) { newip = magic_static_call_bp_int3_handler_sti; regs->flags &= ~X86_FLAGS_IF; } regs->ip = (unsigned long) newip; return 1; AAND now we're *really* done. Does anybody see any issues in this? Linus
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 8:07 AM, Josh Poimboeuf wrote: > > On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote: >>> I liked the idea, BUT, how would it work for callee-saved PV ops? In >>> that case there's only one clobbered register to work with (rax). >> >> That’s would be more tricky. How about using a per-CPU trampoline code to >> hold a direct call to the target and temporarily disable preemption (which >> might be simpler by disabling IRQs): >> Allow me to simplify/correct: >> Static-call modifier: >> >>1. synchronize_sched() to ensure per-cpu trampoline is not used No need for (1) (We are going to sync using IPI). >> 2. Patches the jmp in a per-cpu trampoline (see below) >> 3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) Both do not need to be per-cpu >> 4. Configures the int3 handler to use static-call int3 handler >> 5. Patches the call target (as it currently does). Note that text_poke_bp() would do eventually: on_each_cpu(do_sync_core, NULL, 1); So you should know no cores run the trampoline after (5).
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote: > > I liked the idea, BUT, how would it work for callee-saved PV ops? In > > that case there's only one clobbered register to work with (rax). > > That’s would be more tricky. How about using a per-CPU trampoline code to > hold a direct call to the target and temporarily disable preemption (which > might be simpler by disabling IRQs): > > Static-call modifier: > > 1. synchronize_sched() to ensure per-cpu trampoline is not used > 2. Patches the jmp in a per-cpu trampoline (see below) > 3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) > 4. Configures the int3 handler to use static-call int3 handler > 5. Patches the call target (as it currently does). > > Static-call int3 handler: > 1. Changes flags on the stack to keep IRQs disabled on return > 2. Jumps to per-cpu trampoline on return > > Per-cpu trampoline: > push [per-CPU scratchpad RIP] > sti > jmp [ target ] (this one is patched) > > Note that no IRQ should be possible between the STI and the JMP due to STI > blocking. > > What do you say? This could work, but it's more complex than I was hoping for. My current leading contender is to do call emulation in the #BP handler, either by making a gap or by doing Andy's longjmp-style thingie. -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 11, 2019, at 7:15 AM, Josh Poimboeuf wrote: > > On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote: >> Here is an alternative idea (although similar to Steven’s and my code). >> >> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow >> should be done by the calling convention (and gcc plugin should allow us to >> enforce). Also assume that we hold a table with all source RIP and the >> matching target. >> >> Now, in the int3 handler can you take the faulting RIP and search for it in >> the “static-calls” table, writing the RIP+5 (offset) into R10 (return >> address) and the target into R11. You make the int3 handler to divert the >> code execution by changing pt_regs->rip to point to a new function that does: >> >> push R10 >> jmp __x86_indirect_thunk_r11 >> >> And then you are done. No? > > IIUC, that sounds pretty much like what Steven proposed: > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.homedata=02%7C01%7Cnamit%40vmware.com%7Ce3f0b96a1e83417af48808d677d7a147%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828165370908292sdata=PFzrJQzoa21IRYmEuqHSSGYrNZt0zIo8TGOZa3NWbOE%3Dreserved=0 Stupid me. I’ve remembered it slightly different (the caller saving the target in a register). > I liked the idea, BUT, how would it work for callee-saved PV ops? In > that case there's only one clobbered register to work with (rax). That’s would be more tricky. How about using a per-CPU trampoline code to hold a direct call to the target and temporarily disable preemption (which might be simpler by disabling IRQs): Static-call modifier: 1. synchronize_sched() to ensure per-cpu trampoline is not used 2. Patches the jmp in a per-cpu trampoline (see below) 3. Saves the call source RIP in [per-cpu scratchpad RIP] (below) 4. Configures the int3 handler to use static-call int3 handler 5. Patches the call target (as it currently does). Static-call int3 handler: 1. Changes flags on the stack to keep IRQs disabled on return 2. Jumps to per-cpu trampoline on return Per-cpu trampoline: push [per-CPU scratchpad RIP] sti jmp [ target ] (this one is patched) Note that no IRQ should be possible between the STI and the JMP due to STI blocking. What do you say?
Re: [PATCH v3 0/6] Static calls
On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote: > Here is an alternative idea (although similar to Steven’s and my code). > > Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow > should be done by the calling convention (and gcc plugin should allow us to > enforce). Also assume that we hold a table with all source RIP and the > matching target. > > Now, in the int3 handler can you take the faulting RIP and search for it in > the “static-calls” table, writing the RIP+5 (offset) into R10 (return > address) and the target into R11. You make the int3 handler to divert the > code execution by changing pt_regs->rip to point to a new function that does: > > push R10 > jmp __x86_indirect_thunk_r11 > > And then you are done. No? IIUC, that sounds pretty much like what Steven proposed: https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home I liked the idea, BUT, how would it work for callee-saved PV ops? In that case there's only one clobbered register to work with (rax). -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 10, 2019, at 4:56 PM, Andy Lutomirski wrote: > > On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds > wrote: >> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote: >>> Right, emulating a call instruction from the #BP handler is ugly, >>> because you have to somehow grow the stack to make room for the return >>> address. Personally I liked the idea of shifting the iret frame by 16 >>> bytes in the #DB entry code, but others hated it. >> >> Yeah, I hated it. >> >> But I'm starting to think it's the simplest solution. >> >> So still not loving it, but all the other models have had huge issues too. > > Putting my maintainer hat on: > > I'm okay-ish with shifting the stack by 16 bytes. If this is done, I > want an assertion in do_int3() or wherever the fixup happens that the > write isn't overlapping pt_regs (which is easy to implement because > that code has the relevant pt_regs pointer). And I want some code > that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or > similar kernel is built so that this whole mess actually gets > exercised. Because the fixup only happens when a > really-quite-improbable race gets hit, and the issues depend on stack > alignment, which is presumably why Josh was able to submit a buggy > series without noticing. > > BUT: this is going to be utterly gross whenever anyone tries to > implement shadow stacks for the kernel, and we might need to switch to > a longjmp-like approach if that happens. Here is an alternative idea (although similar to Steven’s and my code). Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow should be done by the calling convention (and gcc plugin should allow us to enforce). Also assume that we hold a table with all source RIP and the matching target. Now, in the int3 handler can you take the faulting RIP and search for it in the “static-calls” table, writing the RIP+5 (offset) into R10 (return address) and the target into R11. You make the int3 handler to divert the code execution by changing pt_regs->rip to point to a new function that does: push R10 jmp __x86_indirect_thunk_r11 And then you are done. No?
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds wrote: > > On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote: > > > > Right, emulating a call instruction from the #BP handler is ugly, > > because you have to somehow grow the stack to make room for the return > > address. Personally I liked the idea of shifting the iret frame by 16 > > bytes in the #DB entry code, but others hated it. > > Yeah, I hated it. > > But I'm starting to think it's the simplest solution. > > So still not loving it, but all the other models have had huge issues too. > Putting my maintainer hat on: I'm okay-ish with shifting the stack by 16 bytes. If this is done, I want an assertion in do_int3() or wherever the fixup happens that the write isn't overlapping pt_regs (which is easy to implement because that code has the relevant pt_regs pointer). And I want some code that explicitly triggers the fixup when a CONFIG_DEBUG_ENTRY=y or similar kernel is built so that this whole mess actually gets exercised. Because the fixup only happens when a really-quite-improbable race gets hit, and the issues depend on stack alignment, which is presumably why Josh was able to submit a buggy series without noticing. BUT: this is going to be utterly gross whenever anyone tries to implement shadow stacks for the kernel, and we might need to switch to a longjmp-like approach if that happens. --Andy
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote: > > Right, emulating a call instruction from the #BP handler is ugly, > because you have to somehow grow the stack to make room for the return > address. Personally I liked the idea of shifting the iret frame by 16 > bytes in the #DB entry code, but others hated it. Yeah, I hated it. But I'm starting to think it's the simplest solution. So still not loving it, but all the other models have had huge issues too. Linus
Re: [PATCH v3 0/6] Static calls
> On Jan 10, 2019, at 12:57 PM, Josh Poimboeuf wrote: > > On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote: >>> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote: >>> >>> On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote: >> I’m not GCC expert either and writing this code was not making me full of >> joy, etc.. I’ll be happy that my code would be reviewed, but it does >> work. I >> don’t think an early pass is needed, as long as hardware registers were >> not >> allocated. >> >>> Would it work with more than 5 arguments, where args get passed on the >>> stack? >> >> It does. >> >>> At the very least, it would (at least partially) defeat the point of the >>> callee-saved paravirt ops. >> >> Actually, I think you can even deal with callee-saved functions and >> remove >> all the (terrible) macros. You would need to tell the extension not to >> clobber the registers through a new attribute. > > Ok, it does sound interesting then. I assume you'll be sharing the > code? Of course. If this what is going to convince, I’ll make a small version for PV callee-saved first. >>> >>> It wasn't *only* the PV callee-saved part which interested me, so if you >>> already have something which implements the other parts, I'd still like >>> to see it. >> >> Did you have a look at >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20181231072112.21051-4-namit%40vmware.com%2Fdata=02%7C01%7Cnamit%40vmware.com%7C169b737792134fc852d808d6773e454b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827506683819442sdata=FhM%2F5OD%2FNHx9Jr97iPIBNyn0BoLAlyiSv%2BT4XICBUdg%3Dreserved=0 >> ? >> >> See the changes to x86_call_markup_plugin.c . >> >> The missing part (that I just finished but need to cleanup) is attributes >> that allow you to provide key and dynamically enable the patching. > > Aha, so it's the basically the same plugin you had for optpolines. I > missed that. I'll need to stare at the code for a little bit. Pretty much. You would want to change the assembly code block, and based on the use-case (e.g., callee-saved, static-calls) clobber or set as an input more or fewer registers.
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote: > > On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote: > > > > On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote: > I’m not GCC expert either and writing this code was not making me full of > joy, etc.. I’ll be happy that my code would be reviewed, but it does > work. I > don’t think an early pass is needed, as long as hardware registers were > not > allocated. > > > Would it work with more than 5 arguments, where args get passed on the > > stack? > > It does. > > > At the very least, it would (at least partially) defeat the point of the > > callee-saved paravirt ops. > > Actually, I think you can even deal with callee-saved functions and > remove > all the (terrible) macros. You would need to tell the extension not to > clobber the registers through a new attribute. > >>> > >>> Ok, it does sound interesting then. I assume you'll be sharing the > >>> code? > >> > >> Of course. If this what is going to convince, I’ll make a small version for > >> PV callee-saved first. > > > > It wasn't *only* the PV callee-saved part which interested me, so if you > > already have something which implements the other parts, I'd still like > > to see it. > > Did you have a look at > https://lore.kernel.org/lkml/20181231072112.21051-4-na...@vmware.com/ ? > > See the changes to x86_call_markup_plugin.c . > > The missing part (that I just finished but need to cleanup) is attributes > that allow you to provide key and dynamically enable the patching. Aha, so it's the basically the same plugin you had for optpolines. I missed that. I'll need to stare at the code for a little bit. > > What if we just used a plugin in a simpler fashion -- to do call site > > alignment, if necessary, to ensure the instruction doesn't cross > > cacheline boundaries. This could be done in a later pass, with no side > > effects other than code layout. And it would allow us to avoid > > breakpoints altogether -- again, assuming somebody can verify that > > intra-cacheline call destination writes are atomic with respect to > > instruction decoder reads. > > The plugin should not be able to do so. Layout of the bytecode is done by > the assembler, so I don’t think a plugin would help you with this one. > >>> > >>> Actually I think we could use .bundle_align_mode for this purpose: > >>> > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3Dreserved=0 > >> > >> Hm… I don’t understand what you have in mind (i.e., when would this > >> assembly directives would be emitted). > > > > For example, it could replace > > > > callq static_call_tramp_my_key > > > > with > > > > .bundle_align_mode 6 > > callq static_call_tramp_my_key > > .bundle_align_mode 0 > > > > which ensures the instruction is within a cache line, aligning it with > > NOPs if necessary. That would allow my current implementation to > > upgrade out-of-line calls to inline calls 100% of the time, instead of > > 95% of the time. > > Heh. I almost wrote based no the feature description that this will add > unnecessary padding no matter what, but actually (experimentally) it works > well… Yeah, based on the poorly worded docs, I made the same assumption, until I tried it. -- Josh
Re: [PATCH v3 0/6] Static calls
On 1/10/19 9:31 AM, Linus Torvalds wrote: > On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf wrote: >> >> NOTE: At least experimentally, the call destination writes seem to be >> atomic with respect to instruction fetching. On Nehalem I can easily >> trigger crashes when writing a call destination across cachelines while >> reading the instruction on other CPU; but I get no such crashes when >> respecting cacheline boundaries. > > I still doubt ifetch is atomic on a cacheline boundary for the simple > reason that the bus between the IU and the L1 I$ is narrower in older > CPU's. > As far as I understand, on P6+ (and possibly earlier, but I don't know) it is atomic on a 16-byte fetch datum, at least for Intel CPUs. However, single byte accesses are always going to be safe. -hpa
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote: > On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote: > > With this version, I stopped trying to use text_poke_bp(), and instead > > went with a different approach: if the call site destination doesn't > > cross a cacheline boundary, just do an atomic write. Otherwise, keep > > using the trampoline indefinitely. > > > - Get rid of the use of text_poke_bp(), in favor of atomic writes. > > Out-of-line calls will be promoted to inline only if the call sites > > don't cross cache line boundaries. [Linus/Andy] > > Can we perserve why text_poke_bp() didn't work? I seem to have forgotten > again. The problem was poking the return address onto the stack from the > int3 handler, or something along those lines? Right, emulating a call instruction from the #BP handler is ugly, because you have to somehow grow the stack to make room for the return address. Personally I liked the idea of shifting the iret frame by 16 bytes in the #DB entry code, but others hated it. So many bad-but-not-completely-unacceptable options to choose from. -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote: > > On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote: I’m not GCC expert either and writing this code was not making me full of joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I don’t think an early pass is needed, as long as hardware registers were not allocated. > Would it work with more than 5 arguments, where args get passed on the > stack? It does. > At the very least, it would (at least partially) defeat the point of the > callee-saved paravirt ops. Actually, I think you can even deal with callee-saved functions and remove all the (terrible) macros. You would need to tell the extension not to clobber the registers through a new attribute. >>> >>> Ok, it does sound interesting then. I assume you'll be sharing the >>> code? >> >> Of course. If this what is going to convince, I’ll make a small version for >> PV callee-saved first. > > It wasn't *only* the PV callee-saved part which interested me, so if you > already have something which implements the other parts, I'd still like > to see it. Did you have a look at https://lore.kernel.org/lkml/20181231072112.21051-4-na...@vmware.com/ ? See the changes to x86_call_markup_plugin.c . The missing part (that I just finished but need to cleanup) is attributes that allow you to provide key and dynamically enable the patching. > What if we just used a plugin in a simpler fashion -- to do call site > alignment, if necessary, to ensure the instruction doesn't cross > cacheline boundaries. This could be done in a later pass, with no side > effects other than code layout. And it would allow us to avoid > breakpoints altogether -- again, assuming somebody can verify that > intra-cacheline call destination writes are atomic with respect to > instruction decoder reads. The plugin should not be able to do so. Layout of the bytecode is done by the assembler, so I don’t think a plugin would help you with this one. >>> >>> Actually I think we could use .bundle_align_mode for this purpose: >>> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cbc4dcc541474462da00b08d6773ab61f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827491388051263sdata=HZNPN4UygwQCqsX8dOajaNeDZyy1O0O4cYeSwu%2BIdO0%3Dreserved=0 >> >> Hm… I don’t understand what you have in mind (i.e., when would this >> assembly directives would be emitted). > > For example, it could replace > > callq static_call_tramp_my_key > > with > > .bundle_align_mode 6 > callq static_call_tramp_my_key > .bundle_align_mode 0 > > which ensures the instruction is within a cache line, aligning it with > NOPs if necessary. That would allow my current implementation to > upgrade out-of-line calls to inline calls 100% of the time, instead of > 95% of the time. Heh. I almost wrote based no the feature description that this will add unnecessary padding no matter what, but actually (experimentally) it works well…
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote: > >> I’m not GCC expert either and writing this code was not making me full of > >> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. > >> I > >> don’t think an early pass is needed, as long as hardware registers were not > >> allocated. > >> > >>> Would it work with more than 5 arguments, where args get passed on the > >>> stack? > >> > >> It does. > >> > >>> At the very least, it would (at least partially) defeat the point of the > >>> callee-saved paravirt ops. > >> > >> Actually, I think you can even deal with callee-saved functions and remove > >> all the (terrible) macros. You would need to tell the extension not to > >> clobber the registers through a new attribute. > > > > Ok, it does sound interesting then. I assume you'll be sharing the > > code? > > Of course. If this what is going to convince, I’ll make a small version for > PV callee-saved first. It wasn't *only* the PV callee-saved part which interested me, so if you already have something which implements the other parts, I'd still like to see it. > >>> What if we just used a plugin in a simpler fashion -- to do call site > >>> alignment, if necessary, to ensure the instruction doesn't cross > >>> cacheline boundaries. This could be done in a later pass, with no side > >>> effects other than code layout. And it would allow us to avoid > >>> breakpoints altogether -- again, assuming somebody can verify that > >>> intra-cacheline call destination writes are atomic with respect to > >>> instruction decoder reads. > >> > >> The plugin should not be able to do so. Layout of the bytecode is done by > >> the assembler, so I don’t think a plugin would help you with this one. > > > > Actually I think we could use .bundle_align_mode for this purpose: > > > > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3Dreserved=0 > > Hm… I don’t understand what you have in mind (i.e., when would this > assembly directives would be emitted). For example, it could replace callq static_call_tramp_my_key with .bundle_align_mode 6 callq static_call_tramp_my_key .bundle_align_mode 0 which ensures the instruction is within a cache line, aligning it with NOPs if necessary. That would allow my current implementation to upgrade out-of-line calls to inline calls 100% of the time, instead of 95% of the time. -- Josh
Re: [PATCH v3 0/6] Static calls
On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote: > With this version, I stopped trying to use text_poke_bp(), and instead > went with a different approach: if the call site destination doesn't > cross a cacheline boundary, just do an atomic write. Otherwise, keep > using the trampoline indefinitely. > - Get rid of the use of text_poke_bp(), in favor of atomic writes. > Out-of-line calls will be promoted to inline only if the call sites > don't cross cache line boundaries. [Linus/Andy] Can we perserve why text_poke_bp() didn't work? I seem to have forgotten again. The problem was poking the return address onto the stack from the int3 handler, or something along those lines?
Re: [PATCH v3 0/6] Static calls
> On Jan 10, 2019, at 10:18 AM, Josh Poimboeuf wrote: > > On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote: >>> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote: >>> >>> On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote: > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote: > > With this version, I stopped trying to use text_poke_bp(), and instead > went with a different approach: if the call site destination doesn't > cross a cacheline boundary, just do an atomic write. Otherwise, keep > using the trampoline indefinitely. > > NOTE: At least experimentally, the call destination writes seem to be > atomic with respect to instruction fetching. On Nehalem I can easily > trigger crashes when writing a call destination across cachelines while > reading the instruction on other CPU; but I get no such crashes when > respecting cacheline boundaries. > > BUT, the SDM doesn't document this approach, so it would be great if any > CPU people can confirm that it's safe! I (still) think that having a compiler plugin can make things much cleaner (as done in [1]). The callers would not need to be changed, and the key can be provided through an attribute. Using a plugin should also allow to use Steven’s proposal for doing text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done by the plugin, you can be guaranteed that registers are clobbered. Then, you can store in the assembly block the return address in one of these registers. >>> >>> I'm no GCC expert (why do I find myself saying this a lot lately?), but >>> this sounds to me like it could be tricky to get right. >>> >>> I suppose you'd have to do it in an early pass, to allow GCC to clobber >>> the registers in a later pass. So it would necessarily have side >>> effects, but I don't know what the risks are. >> >> I’m not GCC expert either and writing this code was not making me full of >> joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I >> don’t think an early pass is needed, as long as hardware registers were not >> allocated. >> >>> Would it work with more than 5 arguments, where args get passed on the >>> stack? >> >> It does. >> >>> At the very least, it would (at least partially) defeat the point of the >>> callee-saved paravirt ops. >> >> Actually, I think you can even deal with callee-saved functions and remove >> all the (terrible) macros. You would need to tell the extension not to >> clobber the registers through a new attribute. > > Ok, it does sound interesting then. I assume you'll be sharing the > code? Of course. If this what is going to convince, I’ll make a small version for PV callee-saved first. >>> What if we just used a plugin in a simpler fashion -- to do call site >>> alignment, if necessary, to ensure the instruction doesn't cross >>> cacheline boundaries. This could be done in a later pass, with no side >>> effects other than code layout. And it would allow us to avoid >>> breakpoints altogether -- again, assuming somebody can verify that >>> intra-cacheline call destination writes are atomic with respect to >>> instruction decoder reads. >> >> The plugin should not be able to do so. Layout of the bytecode is done by >> the assembler, so I don’t think a plugin would help you with this one. > > Actually I think we could use .bundle_align_mode for this purpose: > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsourceware.org%2Fbinutils%2Fdocs-2.31%2Fas%2FBundle-directives.htmldata=02%7C01%7Cnamit%40vmware.com%7Cfa29fb8be208498d039008d67727fe30%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636827411004664549sdata=elDuAVOsSlidG7pZSZfjbhrgnMOHeX6AWKs0hJM4cCE%3Dreserved=0 Hm… I don’t understand what you have in mind (i.e., when would this assembly directives would be emitted).
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote: > > On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote: > > > > On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote: > >>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote: > >>> > >>> With this version, I stopped trying to use text_poke_bp(), and instead > >>> went with a different approach: if the call site destination doesn't > >>> cross a cacheline boundary, just do an atomic write. Otherwise, keep > >>> using the trampoline indefinitely. > >>> > >>> NOTE: At least experimentally, the call destination writes seem to be > >>> atomic with respect to instruction fetching. On Nehalem I can easily > >>> trigger crashes when writing a call destination across cachelines while > >>> reading the instruction on other CPU; but I get no such crashes when > >>> respecting cacheline boundaries. > >>> > >>> BUT, the SDM doesn't document this approach, so it would be great if any > >>> CPU people can confirm that it's safe! > >> > >> I (still) think that having a compiler plugin can make things much cleaner > >> (as done in [1]). The callers would not need to be changed, and the key can > >> be provided through an attribute. > >> > >> Using a plugin should also allow to use Steven’s proposal for doing > >> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done > >> by the plugin, you can be guaranteed that registers are clobbered. Then, > >> you > >> can store in the assembly block the return address in one of these > >> registers. > > > > I'm no GCC expert (why do I find myself saying this a lot lately?), but > > this sounds to me like it could be tricky to get right. > > > > I suppose you'd have to do it in an early pass, to allow GCC to clobber > > the registers in a later pass. So it would necessarily have side > > effects, but I don't know what the risks are. > > I’m not GCC expert either and writing this code was not making me full of > joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I > don’t think an early pass is needed, as long as hardware registers were not > allocated. > > > Would it work with more than 5 arguments, where args get passed on the > > stack? > > It does. > > > > > At the very least, it would (at least partially) defeat the point of the > > callee-saved paravirt ops. > > Actually, I think you can even deal with callee-saved functions and remove > all the (terrible) macros. You would need to tell the extension not to > clobber the registers through a new attribute. Ok, it does sound interesting then. I assume you'll be sharing the code? > > What if we just used a plugin in a simpler fashion -- to do call site > > alignment, if necessary, to ensure the instruction doesn't cross > > cacheline boundaries. This could be done in a later pass, with no side > > effects other than code layout. And it would allow us to avoid > > breakpoints altogether -- again, assuming somebody can verify that > > intra-cacheline call destination writes are atomic with respect to > > instruction decoder reads. > > The plugin should not be able to do so. Layout of the bytecode is done by > the assembler, so I don’t think a plugin would help you with this one. Actually I think we could use .bundle_align_mode for this purpose: https://sourceware.org/binutils/docs-2.31/as/Bundle-directives.html -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote: > > On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote: >>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote: >>> >>> With this version, I stopped trying to use text_poke_bp(), and instead >>> went with a different approach: if the call site destination doesn't >>> cross a cacheline boundary, just do an atomic write. Otherwise, keep >>> using the trampoline indefinitely. >>> >>> NOTE: At least experimentally, the call destination writes seem to be >>> atomic with respect to instruction fetching. On Nehalem I can easily >>> trigger crashes when writing a call destination across cachelines while >>> reading the instruction on other CPU; but I get no such crashes when >>> respecting cacheline boundaries. >>> >>> BUT, the SDM doesn't document this approach, so it would be great if any >>> CPU people can confirm that it's safe! >> >> I (still) think that having a compiler plugin can make things much cleaner >> (as done in [1]). The callers would not need to be changed, and the key can >> be provided through an attribute. >> >> Using a plugin should also allow to use Steven’s proposal for doing >> text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done >> by the plugin, you can be guaranteed that registers are clobbered. Then, you >> can store in the assembly block the return address in one of these >> registers. > > I'm no GCC expert (why do I find myself saying this a lot lately?), but > this sounds to me like it could be tricky to get right. > > I suppose you'd have to do it in an early pass, to allow GCC to clobber > the registers in a later pass. So it would necessarily have side > effects, but I don't know what the risks are. I’m not GCC expert either and writing this code was not making me full of joy, etc.. I’ll be happy that my code would be reviewed, but it does work. I don’t think an early pass is needed, as long as hardware registers were not allocated. > Would it work with more than 5 arguments, where args get passed on the > stack? It does. > > At the very least, it would (at least partially) defeat the point of the > callee-saved paravirt ops. Actually, I think you can even deal with callee-saved functions and remove all the (terrible) macros. You would need to tell the extension not to clobber the registers through a new attribute. > What if we just used a plugin in a simpler fashion -- to do call site > alignment, if necessary, to ensure the instruction doesn't cross > cacheline boundaries. This could be done in a later pass, with no side > effects other than code layout. And it would allow us to avoid > breakpoints altogether -- again, assuming somebody can verify that > intra-cacheline call destination writes are atomic with respect to > instruction decoder reads. The plugin should not be able to do so. Layout of the bytecode is done by the assembler, so I don’t think a plugin would help you with this one.
Re: [PATCH v3 0/6] Static calls
On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf wrote: > > NOTE: At least experimentally, the call destination writes seem to be > atomic with respect to instruction fetching. On Nehalem I can easily > trigger crashes when writing a call destination across cachelines while > reading the instruction on other CPU; but I get no such crashes when > respecting cacheline boundaries. I still doubt ifetch is atomic on a cacheline boundary for the simple reason that the bus between the IU and the L1 I$ is narrower in older CPU's. Also, the fill of the L1 I$ from the (cache coherent L2) may not be a cacheline at a time either. That said, the fetch may be sufficiently ordered that it works in practice. It _would_ be absolutely lovely to be able to do things like this. I do agree with Nadav that if there's some way to avoid this, it would be good. I'm not in general a huge fan of compiler plugins (compiler instability is just about my worst fear, and I feel plugins tend to open up that area a lot), but it does feel like this might be something where compiler tweaking would possibly be the cleanest approach. Linus
Re: [PATCH v3 0/6] Static calls
On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote: > > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote: > > > > With this version, I stopped trying to use text_poke_bp(), and instead > > went with a different approach: if the call site destination doesn't > > cross a cacheline boundary, just do an atomic write. Otherwise, keep > > using the trampoline indefinitely. > > > > NOTE: At least experimentally, the call destination writes seem to be > > atomic with respect to instruction fetching. On Nehalem I can easily > > trigger crashes when writing a call destination across cachelines while > > reading the instruction on other CPU; but I get no such crashes when > > respecting cacheline boundaries. > > > > BUT, the SDM doesn't document this approach, so it would be great if any > > CPU people can confirm that it's safe! > > > > I (still) think that having a compiler plugin can make things much cleaner > (as done in [1]). The callers would not need to be changed, and the key can > be provided through an attribute. > > Using a plugin should also allow to use Steven’s proposal for doing > text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done > by the plugin, you can be guaranteed that registers are clobbered. Then, you > can store in the assembly block the return address in one of these > registers. I'm no GCC expert (why do I find myself saying this a lot lately?), but this sounds to me like it could be tricky to get right. I suppose you'd have to do it in an early pass, to allow GCC to clobber the registers in a later pass. So it would necessarily have side effects, but I don't know what the risks are. Would it work with more than 5 arguments, where args get passed on the stack? At the very least, it would (at least partially) defeat the point of the callee-saved paravirt ops. What if we just used a plugin in a simpler fashion -- to do call site alignment, if necessary, to ensure the instruction doesn't cross cacheline boundaries. This could be done in a later pass, with no side effects other than code layout. And it would allow us to avoid breakpoints altogether -- again, assuming somebody can verify that intra-cacheline call destination writes are atomic with respect to instruction decoder reads. -- Josh
Re: [PATCH v3 0/6] Static calls
> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote: > > With this version, I stopped trying to use text_poke_bp(), and instead > went with a different approach: if the call site destination doesn't > cross a cacheline boundary, just do an atomic write. Otherwise, keep > using the trampoline indefinitely. > > NOTE: At least experimentally, the call destination writes seem to be > atomic with respect to instruction fetching. On Nehalem I can easily > trigger crashes when writing a call destination across cachelines while > reading the instruction on other CPU; but I get no such crashes when > respecting cacheline boundaries. > > BUT, the SDM doesn't document this approach, so it would be great if any > CPU people can confirm that it's safe! > I (still) think that having a compiler plugin can make things much cleaner (as done in [1]). The callers would not need to be changed, and the key can be provided through an attribute. Using a plugin should also allow to use Steven’s proposal for doing text_poke() safely: by changing 'func()' into 'asm (“call func”)', as done by the plugin, you can be guaranteed that registers are clobbered. Then, you can store in the assembly block the return address in one of these registers. [1] https://lkml.org/lkml/2018/12/31/25
[PATCH v3 0/6] Static calls
With this version, I stopped trying to use text_poke_bp(), and instead went with a different approach: if the call site destination doesn't cross a cacheline boundary, just do an atomic write. Otherwise, keep using the trampoline indefinitely. NOTE: At least experimentally, the call destination writes seem to be atomic with respect to instruction fetching. On Nehalem I can easily trigger crashes when writing a call destination across cachelines while reading the instruction on other CPU; but I get no such crashes when respecting cacheline boundaries. BUT, the SDM doesn't document this approach, so it would be great if any CPU people can confirm that it's safe! v3: - Split up the patches a bit more so that out-of-line static calls can be separately mergeable. Inline is more controversial, and other approaches or improvements might be considered. For example, Nadav is looking at implementing it with the help of a GCC plugin to ensure the call sites don't cross cacheline boundaries. - Get rid of the use of text_poke_bp(), in favor of atomic writes. Out-of-line calls will be promoted to inline only if the call sites don't cross cache line boundaries. [Linus/Andy] - Converge the inline and out-of-line trampolines into a single implementation, which uses a direct jump. This was made possible by making static_call_update() safe to be called during early boot. - Remove trampoline poisoning for now, since trampolines may still be needed for call sites which cross cache line boundaries. - Rename CONFIG_HAVE_STATIC_CALL_OUTLINE -> CONFIG_HAVE_STATIC_CALL [Steven] - Add missing __static_call_update() call to static_call_update() [Edward] - Add missing key->func update in __static_call_update() [Edward] - Put trampoline in a separate section to prevent 2-byte tail calls [Linus] v2: - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard] - rename optimized/unoptimized -> inline/out-of-line [Ard] - tweak arch interfaces for PLT and add key->tramp field [Ard] - rename 'poison' to 'defuse' and do it after all sites have been patched [Ard] - fix .init handling [Ard, Steven] - add CONFIG_HAVE_STATIC_CALL [Steven] - make interfaces more consistent across configs to allow tracepoints to use them [Steven] - move __ADDRESSABLE() to static_call() macro [Steven] - prevent 2-byte jumps [Steven] - add offset to asm-offsets.c instead of hard coding key->func offset - add kernel_text_address() sanity check - make __ADDRESSABLE() symbols truly unique Static calls are a replacement for global function pointers. They use code patching to allow direct calls to be used instead of indirect calls. They give the flexibility of function pointers, but with improved performance. This is especially important for cases where retpolines would otherwise be used, as retpolines can significantly impact performance. The concept and code are an extension of previous work done by Ard Biesheuvel and Steven Rostedt: https://lkml.kernel.org/r/20181005081333.15018-1-ard.biesheu...@linaro.org https://lkml.kernel.org/r/20181006015110.653946...@goodmis.org There are three implementations, depending on arch support: 1) basic function pointers 2) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL) 3) inline: patched call sites (CONFIG_HAVE_STATIC_CALL_INLINE) Josh Poimboeuf (6): compiler.h: Make __ADDRESSABLE() symbol truly unique static_call: Add basic static call infrastructure x86/static_call: Add out-of-line static call implementation static_call: Add inline static call infrastructure x86/alternative: Use a single access in text_poke() where possible x86/static_call: Add inline static call implementation for x86-64 arch/Kconfig | 7 + arch/x86/Kconfig | 4 +- arch/x86/include/asm/static_call.h| 33 ++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/alternative.c | 31 +- arch/x86/kernel/static_call.c | 57 arch/x86/kernel/vmlinux.lds.S | 1 + include/asm-generic/vmlinux.lds.h | 15 + include/linux/compiler.h | 2 +- include/linux/module.h| 10 + include/linux/static_call.h | 196 +++ include/linux/static_call_types.h | 22 ++ kernel/Makefile | 1 + kernel/module.c | 5 + kernel/static_call.c | 316 ++ scripts/Makefile.build| 3 + tools/objtool/Makefile| 3 +- tools/objtool/builtin-check.c | 3 +- tools/objtool/builtin.h | 2 +- tools/objtool/check.c | 131 +++- tools/objtool/check.h | 2 + tools/objtool/elf.h | 1 +