On 1/14/19 9:01 PM, H. Peter Anvin wrote:
>
> This could be as simple as spinning for a limited time waiting for
> states 0 or 3 if we are not the patching CPU. It is also not necessary
> to wait for the mask to become zero for the first sync if we find
> ourselves suddenly in state 4.
>
So
On 1/14/19 7:05 PM, Andy Lutomirski wrote:
> On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin wrote:
>>
>> I think this sequence ought to work (keep in mind we are already under a
>> mutex, so the global data is safe even if we are preempted):
>
> I'm trying to wrap my head around this. The
On Mon, Jan 14, 2019 at 2:55 PM H. Peter Anvin wrote:
>
> I think this sequence ought to work (keep in mind we are already under a
> mutex, so the global data is safe even if we are preempted):
I'm trying to wrap my head around this. The states are:
0: normal operation
1: writing 0xcc, can be
On January 14, 2019 3:27:55 PM PST, Andy Lutomirski wrote:
>On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote:
>>
>> So I was already in the middle of composing this message when Andy
>posted:
>>
>> > I don't even think this is sufficient. I think we also need
>everyone
>> > who clears the
> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski wrote:
>
> On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote:
>> So I was already in the middle of composing this message when Andy posted:
>>
>>> I don't even think this is sufficient. I think we also need everyone
>>> who clears the bit to
On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin wrote:
>
> So I was already in the middle of composing this message when Andy posted:
>
> > I don't even think this is sufficient. I think we also need everyone
> > who clears the bit to check if all bits are clear and, if so, remove
> > the
I think this sequence ought to work (keep in mind we are already under a
mutex, so the global data is safe even if we are preempted):
set up page table entries
invlpg
set up bp patching global data
cpu = get_cpu()
bp_old_value = atomic_read(bp_write_addr)
So I was already in the middle of composing this message when Andy posted:
> I don't even think this is sufficient. I think we also need everyone
> who clears the bit to check if all bits are clear and, if so, remove
> the breakpoint. Otherwise we have a situation where, if you are in
>
On Sun, Jan 13, 2019 at 6:41 PM H. Peter Anvin wrote:
>
> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
> >
> > static cpumask_t text_poke_cpumask;
> >
> > static void text_poke_sync(void)
> > {
> > smp_wmb();
> > text_poke_cpumask = cpu_online_mask;
> > smp_wmb(); /* Should be
On Fri, Jan 11, 2019 at 01:05:20PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
> wrote:
> >
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in
On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>
> static cpumask_t text_poke_cpumask;
>
> static void text_poke_sync(void)
> {
> smp_wmb();
> text_poke_cpumask = cpu_online_mask;
> smp_wmb(); /* Should be optional on x86 */
> cpumask_clear_cpu(_poke_cpumask,
On 1/11/19 11:39 AM, Jiri Kosina wrote:
> On Fri, 11 Jan 2019, h...@zytor.com wrote:
>
>> I still don't see why can't simply spin in the #BP handler until the
>> patch is complete.
>
> I think this brings us to the already discussed possible deadlock, when
> one CPU#0 is in the middle of
On January 11, 2019 11:34:34 AM PST, Linus Torvalds
wrote:
>On Fri, Jan 11, 2019 at 11:24 AM wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
> text_poke(addr, , sizeof(int3));
>
On January 11, 2019 11:34:34 AM PST, Linus Torvalds
wrote:
>On Fri, Jan 11, 2019 at 11:24 AM wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
> text_poke(addr, , sizeof(int3));
>
On Fri, Jan 11, 2019 at 1:22 PM Josh Poimboeuf wrote:
>
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
> > >
> > > I was referring to the fact that a single static call key update will
> > > usually result in patching
> On Jan 11, 2019, at 1:55 PM, Steven Rostedt wrote:
>
> On Fri, 11 Jan 2019 15:41:22 -0600
> Josh Poimboeuf wrote:
>
>>> I don’t see RCU-sched solves the problem if you don’t disable preemption. On
>>> a fully preemptable kernel, you can get preempted between the push and the
>>> call (jmp)
> On Jan 11, 2019, at 1:41 PM, Josh Poimboeuf wrote:
>
> On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote:
>>> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote:
>>>
>>> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
On Fri, Jan 11, 2019 at 12:31 PM Josh
On Fri, 11 Jan 2019 15:41:22 -0600
Josh Poimboeuf wrote:
> > I don’t see RCU-sched solves the problem if you don’t disable preemption. On
> > a fully preemptable kernel, you can get preempted between the push and the
> > call (jmp) or before the push. RCU-sched can then finish, and the preempted
On Fri, Jan 11, 2019 at 09:36:59PM +, Nadav Amit wrote:
> > On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote:
> >
> > On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf
> >> wrote:
> >>> I was referring to the fact that a
> On Jan 11, 2019, at 1:22 PM, Josh Poimboeuf wrote:
>
> On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
>>> I was referring to the fact that a single static call key update will
>>> usually result in patching multiple
On Fri, Jan 11, 2019 at 01:10:47PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski wrote:
> >
> > > Yeah, my suggestion doesn't allow for batching, since it would
> > > basically generate one trampoline for every rewritten instruction.
> >
> > Sure it does. Just
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> b) Adding a gap to the #DB entry stack
And that should be #BP of course... Not sure why my fingers keep doing
that!
--
Josh
On Fri, Jan 11, 2019 at 03:22:10PM -0600, Josh Poimboeuf wrote:
> trampoline:
> push %rax
> call patched-dest
That should be a JMP of course.
--
Josh
On Fri, Jan 11, 2019 at 12:46:39PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites. But you're right, it's
> > only 1-2
On Fri, Jan 11, 2019 at 1:05 PM Andy Lutomirski wrote:
>
> > Yeah, my suggestion doesn't allow for batching, since it would
> > basically generate one trampoline for every rewritten instruction.
>
> Sure it does. Just make 1000 trampolines and patch 1000 sites in a
> batch :) As long as the
On Fri, Jan 11, 2019 at 12:54 PM Linus Torvalds
wrote:
>
> On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
> >
> > I was referring to the fact that a single static call key update will
> > usually result in patching multiple call sites. But you're right, it's
> > only 1-2 trampolines per
On Fri, Jan 11, 2019 at 12:31 PM Josh Poimboeuf wrote:
>
> I was referring to the fact that a single static call key update will
> usually result in patching multiple call sites. But you're right, it's
> only 1-2 trampolines per text_poke_bp() invocation. Though eventually
> we may want to
On Fri, Jan 11, 2019 at 12:12:30PM -0800, Linus Torvalds wrote:
> On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf wrote:
> >
> > But really, to me, having to create and manage all those custom
> > trampolines still feels a lot more complex than just making a gap on the
> > stack.
>
> There are
On Fri, Jan 11, 2019 at 12:04 PM Josh Poimboeuf wrote:
>
> But really, to me, having to create and manage all those custom
> trampolines still feels a lot more complex than just making a gap on the
> stack.
There are no "all those custom trampolines".
There is literally *one* custom trampoline
On Fri, Jan 11, 2019 at 11:03:30AM -0800, Linus Torvalds wrote:
> The we'd change the end of poke_int3_handler() to do something like
> this instead:
>
> void *newip = bp_int3_handler;
> ..
> if (new == magic_static_call_bp_int3_handler) {
> if (regs->flags
On Fri, 11 Jan 2019, h...@zytor.com wrote:
> I still don't see why can't simply spin in the #BP handler until the
> patch is complete.
I think this brings us to the already discussed possible deadlock, when
one CPU#0 is in the middle of text_poke_bp(), CPU#1 is spinning inside
On Fri, Jan 11, 2019 at 11:24 AM wrote:
>
> I still don't see why can't simply spin in the #BP handler until the patch is
> complete.
So here's at least one problem:
text_poke_bp()
text_poke(addr, , sizeof(int3));
*interrupt*
interrupt has a static call
*BP*
> On Jan 11, 2019, at 11:23 AM, h...@zytor.com wrote:
>
> On January 11, 2019 11:03:30 AM PST, Linus Torvalds
> wrote:
>> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf
>> wrote:
Now, in the int3 handler can you take the faulting RIP and search
>> for it in
the “static-calls” table,
On January 11, 2019 11:03:30 AM PST, Linus Torvalds
wrote:
>On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf
>wrote:
>>
>> >
>> > Now, in the int3 handler can you take the faulting RIP and search
>for it in
>> > the “static-calls” table, writing the RIP+5 (offset) into R10
>(return
>> > address)
> On Jan 11, 2019, at 11:03 AM, Linus Torvalds
> wrote:
>
> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf wrote:
>>> Now, in the int3 handler can you take the faulting RIP and search for it in
>>> the “static-calls” table, writing the RIP+5 (offset) into R10 (return
>>> address) and the
On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf wrote:
>
> >
> > Now, in the int3 handler can you take the faulting RIP and search for it in
> > the “static-calls” table, writing the RIP+5 (offset) into R10 (return
> > address) and the target into R11. You make the int3 handler to divert the
> >
> On Jan 11, 2019, at 8:07 AM, Josh Poimboeuf wrote:
>
> On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote:
>>> I liked the idea, BUT, how would it work for callee-saved PV ops? In
>>> that case there's only one clobbered register to work with (rax).
>>
>> That’s would be more tricky.
On Fri, Jan 11, 2019 at 03:48:59PM +, Nadav Amit wrote:
> > I liked the idea, BUT, how would it work for callee-saved PV ops? In
> > that case there's only one clobbered register to work with (rax).
>
> That’s would be more tricky. How about using a per-CPU trampoline code to
> hold a direct
> On Jan 11, 2019, at 7:15 AM, Josh Poimboeuf wrote:
>
> On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote:
>> Here is an alternative idea (although similar to Steven’s and my code).
>>
>> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
>> should be done by
On Fri, Jan 11, 2019 at 01:47:01AM +, Nadav Amit wrote:
> Here is an alternative idea (although similar to Steven’s and my code).
>
> Assume that we always clobber R10, R11 on static-calls explicitly, as anyhow
> should be done by the calling convention (and gcc plugin should allow us to
>
> On Jan 10, 2019, at 4:56 PM, Andy Lutomirski wrote:
>
> On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
> wrote:
>> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote:
>>> Right, emulating a call instruction from the #BP handler is ugly,
>>> because you have to somehow grow the stack to
On Thu, Jan 10, 2019 at 3:02 PM Linus Torvalds
wrote:
>
> On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote:
> >
> > Right, emulating a call instruction from the #BP handler is ugly,
> > because you have to somehow grow the stack to make room for the return
> > address. Personally I liked
On Thu, Jan 10, 2019 at 12:52 PM Josh Poimboeuf wrote:
>
> Right, emulating a call instruction from the #BP handler is ugly,
> because you have to somehow grow the stack to make room for the return
> address. Personally I liked the idea of shifting the iret frame by 16
> bytes in the #DB entry
> On Jan 10, 2019, at 12:57 PM, Josh Poimboeuf wrote:
>
> On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote:
>>> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote:
>>>
>>> On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
>> I’m not GCC expert either and writing this
On Thu, Jan 10, 2019 at 08:48:31PM +, Nadav Amit wrote:
> > On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote:
> >
> > On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
> I’m not GCC expert either and writing this code was not making me full of
> joy, etc.. I’ll be happy
On 1/10/19 9:31 AM, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf wrote:
>>
>> NOTE: At least experimentally, the call destination writes seem to be
>> atomic with respect to instruction fetching. On Nehalem I can easily
>> trigger crashes when writing a call destination
On Thu, Jan 10, 2019 at 09:30:23PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a
> On Jan 10, 2019, at 12:32 PM, Josh Poimboeuf wrote:
>
> On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
I’m not GCC expert either and writing this code was not making me full of
joy, etc.. I’ll be happy that my code would be reviewed, but it does work.
I
don’t
On Thu, Jan 10, 2019 at 07:45:26PM +, Nadav Amit wrote:
> >> I’m not GCC expert either and writing this code was not making me full of
> >> joy, etc.. I’ll be happy that my code would be reviewed, but it does work.
> >> I
> >> don’t think an early pass is needed, as long as hardware registers
On Wed, Jan 09, 2019 at 04:59:35PM -0600, Josh Poimboeuf wrote:
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write. Otherwise, keep
> using the
> On Jan 10, 2019, at 10:18 AM, Josh Poimboeuf wrote:
>
> On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote:
>>> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote:
>>>
>>> On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf
On Thu, Jan 10, 2019 at 05:32:08PM +, Nadav Amit wrote:
> > On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote:
> >
> > On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> >>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote:
> >>>
> >>> With this version, I stopped trying to use
> On Jan 10, 2019, at 8:44 AM, Josh Poimboeuf wrote:
>
> On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
>>> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote:
>>>
>>> With this version, I stopped trying to use text_poke_bp(), and instead
>>> went with a different approach: if the
On Wed, Jan 9, 2019 at 2:59 PM Josh Poimboeuf wrote:
>
> NOTE: At least experimentally, the call destination writes seem to be
> atomic with respect to instruction fetching. On Nehalem I can easily
> trigger crashes when writing a call destination across cachelines while
> reading the
On Thu, Jan 10, 2019 at 01:21:00AM +, Nadav Amit wrote:
> > On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote:
> >
> > With this version, I stopped trying to use text_poke_bp(), and instead
> > went with a different approach: if the call site destination doesn't
> > cross a cacheline
> On Jan 9, 2019, at 2:59 PM, Josh Poimboeuf wrote:
>
> With this version, I stopped trying to use text_poke_bp(), and instead
> went with a different approach: if the call site destination doesn't
> cross a cacheline boundary, just do an atomic write. Otherwise, keep
> using the trampoline
56 matches
Mail list logo