Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 04:59:41PM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 04:50 PM, Josh Poimboeuf wrote:
> > On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote:
> >> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
> >>> On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
>  On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> > Maybe we can add a new field to the alternatives entry struct which
> > specifies the offset to the CALL instruction, so apply_alternatives()
> > can find it.
>  We'd also have to assume that the restore part of an alternative entry
>  is the same size as the save part. Which is true now.
> >>> Why?
> >>>
> >> Don't you need to know the size of the instruction without save and
> >> restore part?
> >>
> >> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)
> >>
> >> Otherwise you'd need another field for the actual instruction length.
> > If we know where the CALL instruction starts, and can verify that it
> > starts with "ff 15", then we know the instruction length: 6 bytes.
> > Right?
> >
> 
> Oh, OK. Then you shouldn't need a->replacementlen test(s?) in
> apply_alternatives()?

Right.  Though in the above code it was needed for a different reason,
to make sure it wasn't reading past the end of the buffer.

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Boris Ostrovsky
On 10/17/2017 04:50 PM, Josh Poimboeuf wrote:
> On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote:
>> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
>>> On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
 On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> Maybe we can add a new field to the alternatives entry struct which
> specifies the offset to the CALL instruction, so apply_alternatives()
> can find it.
 We'd also have to assume that the restore part of an alternative entry
 is the same size as the save part. Which is true now.
>>> Why?
>>>
>> Don't you need to know the size of the instruction without save and
>> restore part?
>>
>> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)
>>
>> Otherwise you'd need another field for the actual instruction length.
> If we know where the CALL instruction starts, and can verify that it
> starts with "ff 15", then we know the instruction length: 6 bytes.
> Right?
>

Oh, OK. Then you shouldn't need a->replacementlen test(s?) in
apply_alternatives()?

-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
> > On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
> >> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> >>> Maybe we can add a new field to the alternatives entry struct which
> >>> specifies the offset to the CALL instruction, so apply_alternatives()
> >>> can find it.
> >> We'd also have to assume that the restore part of an alternative entry
> >> is the same size as the save part. Which is true now.
> > Why?
> >
> 
> Don't you need to know the size of the instruction without save and
> restore part?
> 
> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)
> 
> Otherwise you'd need another field for the actual instruction length.

If we know where the CALL instruction starts, and can verify that it
starts with "ff 15", then we know the instruction length: 6 bytes.
Right?

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Boris Ostrovsky
On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
> On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
>> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
>>> Maybe we can add a new field to the alternatives entry struct which
>>> specifies the offset to the CALL instruction, so apply_alternatives()
>>> can find it.
>> We'd also have to assume that the restore part of an alternative entry
>> is the same size as the save part. Which is true now.
> Why?
>

Don't you need to know the size of the instruction without save and
restore part?

+ if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)

Otherwise you'd need another field for the actual instruction length.

-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> >
> > Maybe we can add a new field to the alternatives entry struct which
> > specifies the offset to the CALL instruction, so apply_alternatives()
> > can find it.
> 
> We'd also have to assume that the restore part of an alternative entry
> is the same size as the save part. Which is true now.

Why?

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Boris Ostrovsky
On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
>
> Maybe we can add a new field to the alternatives entry struct which
> specifies the offset to the CALL instruction, so apply_alternatives()
> can find it.

We'd also have to assume that the restore part of an alternative entry
is the same size as the save part. Which is true now.

-boris


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 09:58:59AM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 01:24 AM, Josh Poimboeuf wrote:
> > On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:
> >> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
> >>> On 10/12/2017 03:27 PM, Andrew Cooper wrote:
>  On 12/10/17 20:11, Boris Ostrovsky wrote:
> > There is also another problem:
> >
> > [1.312425] general protection fault:  [#1] SMP
> > [1.312901] Modules linked in:
> > [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> > [1.313878] task: 88003e2c task.stack: c938c000
> > [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> > [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> > [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> > 7fcfc959f59a
> > [1.315827] RDX:  RSI:  RDI:
> > 
> > [1.316315] RBP: 000a R08: 037f R09:
> > 0064
> > [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> > 7fcfc958ad60
> > [1.317300] R13:  R14: 55f550185954 R15:
> > 1000
> > [1.317801] FS:  () GS:88003f80()
> > knlGS:
> > [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> > [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> > 00042660
> > [1.319235] Call Trace:
> > [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> > 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> > 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> > [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
> > c938ff50
> > [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> > [1.345009] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x000b
> >
> >
> > All code
> > 
> >0:51   push   %rcx
> >1:50   push   %rax
> >2:57   push   %rdi
> >3:56   push   %rsi
> >4:52   push   %rdx
> >5:51   push   %rcx
> >6:6a dapushq  $0xffda
> >8:41 50push   %r8
> >a:41 51push   %r9
> >c:41 52push   %r10
> >e:41 53push   %r11
> >   10:48 83 ec 30  sub$0x30,%rsp
> >   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
> >   1b:00 00
> >   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
> >   24:0f 85 a5 00 00 00jne0xcf
> >   2a:50   push   %rax
> >   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> > 0xffd095cd<-- trapping instruction
> >   31:58   pop%rax
> >   32:48 3d 4c 01 00 00cmp$0x14c,%rax
> >   38:77 0fja 0x49
> >   3a:4c 89 d1 mov%r10,%rcx
> >   3d:ff   .byte 0xff
> >   3e:14 c5adc$0xc5,%al
> >
> >
> > so the original 'cli' was replaced with the pv call but to me the offset
> > looks a bit off, no? Shouldn't it always be positive?
>  callq takes a 32bit signed displacement, so jumping back by up to 2G is
>  perfectly legitimate.
> >>> Yes, but
> >>>
> >>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
> >>> 817365dd t entry_SYSCALL_64_fastpath
> >>> ostr@workbase> nm vmlinux | grep " pv_irq_ops"
> >>> 81c2dbc0 D pv_irq_ops
> >>> ostr@workbase>
> >>>
> >>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
> >>> didn't mean that x86 instruction set doesn't allow negative
> >>> displacement, I was trying to say that pv_irq_ops always live further 
> >>> down)
> >> I believe the problem is this:
> >>
> >> #define PV_INDIRECT(addr)   *addr(%rip)
> >>
> >> The displacement that the linker computes will be relative to the where
> >> this instruction is placed at the time of linking, which is in
> >> .pv_altinstructions (and not .text). So when we copy it into .text the
> >> displacement becomes bogus.
> > apply_alternatives() is supposed to adjust that displacement based on
> > the new IP, though it could be messing that up somehow.  (See patch
> > 10/13.)
> >
> 
> That patch doesn't take into account the fact that replacement
> instructions may have to save/restore registers. So, for example,
> 
> 
> -if (a->replacementlen && is_jmp(replacement[0]))
> +} 

Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Boris Ostrovsky
On 10/17/2017 09:10 AM, Brian Gerst wrote:
> On Mon, Oct 16, 2017 at 2:18 PM, Boris Ostrovsky
>  wrote:
>>
>> Replacing the macro with
>>
>> #define PV_INDIRECT(addr)   *addr  // well, it's not so much
>> indirect anymore
>>
>> makes things work. Or maybe it can be adjusted top be kept truly indirect.
> That is still an indirect call, just using absolute addressing for the
> pointer instead of RIP-relative. 

Oh, right, I've got my terminology all wrong.

-boris

>  Alternatives has very limited
> relocation capabilities.  It will only handle a single call or jmp
> replacement. Using absolute addressing is slightly less efficient
> (takes one extra byte to encode, and needs a relocation for KASLR),
> but it works just as well.  You could also relocate the instruction
> manually by adding the delta between the original and replacement code
> to the displacement.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Boris Ostrovsky
On 10/17/2017 01:24 AM, Josh Poimboeuf wrote:
> On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:
>> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
>>> On 10/12/2017 03:27 PM, Andrew Cooper wrote:
 On 12/10/17 20:11, Boris Ostrovsky wrote:
> There is also another problem:
>
> [1.312425] general protection fault:  [#1] SMP
> [1.312901] Modules linked in:
> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> [1.313878] task: 88003e2c task.stack: c938c000
> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> 7fcfc959f59a
> [1.315827] RDX:  RSI:  RDI:
> 
> [1.316315] RBP: 000a R08: 037f R09:
> 0064
> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> 7fcfc958ad60
> [1.317300] R13:  R14: 55f550185954 R15:
> 1000
> [1.317801] FS:  () GS:88003f80()
> knlGS:
> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> 00042660
> [1.319235] Call Trace:
> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
> c938ff50
> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> [1.345009] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x000b
>
>
> All code
> 
>0:51   push   %rcx
>1:50   push   %rax
>2:57   push   %rdi
>3:56   push   %rsi
>4:52   push   %rdx
>5:51   push   %rcx
>6:6a dapushq  $0xffda
>8:41 50push   %r8
>a:41 51push   %r9
>c:41 52push   %r10
>e:41 53push   %r11
>   10:48 83 ec 30  sub$0x30,%rsp
>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
>   1b:00 00
>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
>   24:0f 85 a5 00 00 00jne0xcf
>   2a:50   push   %rax
>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> 0xffd095cd<-- trapping instruction
>   31:58   pop%rax
>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
>   38:77 0fja 0x49
>   3a:4c 89 d1 mov%r10,%rcx
>   3d:ff   .byte 0xff
>   3e:14 c5adc$0xc5,%al
>
>
> so the original 'cli' was replaced with the pv call but to me the offset
> looks a bit off, no? Shouldn't it always be positive?
 callq takes a 32bit signed displacement, so jumping back by up to 2G is
 perfectly legitimate.
>>> Yes, but
>>>
>>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
>>> 817365dd t entry_SYSCALL_64_fastpath
>>> ostr@workbase> nm vmlinux | grep " pv_irq_ops"
>>> 81c2dbc0 D pv_irq_ops
>>> ostr@workbase>
>>>
>>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
>>> didn't mean that x86 instruction set doesn't allow negative
>>> displacement, I was trying to say that pv_irq_ops always live further down)
>> I believe the problem is this:
>>
>> #define PV_INDIRECT(addr)   *addr(%rip)
>>
>> The displacement that the linker computes will be relative to the where
>> this instruction is placed at the time of linking, which is in
>> .pv_altinstructions (and not .text). So when we copy it into .text the
>> displacement becomes bogus.
> apply_alternatives() is supposed to adjust that displacement based on
> the new IP, though it could be messing that up somehow.  (See patch
> 10/13.)
>

That patch doesn't take into account the fact that replacement
instructions may have to save/restore registers. So, for example,


-if (a->replacementlen && is_jmp(replacement[0]))
+} else if (a->replacementlen == 6 && *insnbuf == 0xff &&
+   *(insnbuf+1) == 0x15) {
+/* indirect call */
+*(s32 *)(insnbuf + 2) += replacement - instr;
+DPRINTK("Fix indirect CALL offset: 0x%x, CALL *0x%lx",
+

Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-17 Thread Brian Gerst
On Mon, Oct 16, 2017 at 2:18 PM, Boris Ostrovsky
 wrote:
> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
>> On 10/12/2017 03:27 PM, Andrew Cooper wrote:
>>> On 12/10/17 20:11, Boris Ostrovsky wrote:
 There is also another problem:

 [1.312425] general protection fault:  [#1] SMP
 [1.312901] Modules linked in:
 [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
 [1.313878] task: 88003e2c task.stack: c938c000
 [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
 [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
 [1.315336] RAX: 000c RBX: 55f550168040 RCX:
 7fcfc959f59a
 [1.315827] RDX:  RSI:  RDI:
 
 [1.316315] RBP: 000a R08: 037f R09:
 0064
 [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
 7fcfc958ad60
 [1.317300] R13:  R14: 55f550185954 R15:
 1000
 [1.317801] FS:  () GS:88003f80()
 knlGS:
 [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
 [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
 00042660
 [1.319235] Call Trace:
 [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
 [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
 c938ff50
 [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
 [1.345009] Kernel panic - not syncing: Attempted to kill init!
 exitcode=0x000b


 All code
 
0:51   push   %rcx
1:50   push   %rax
2:57   push   %rdi
3:56   push   %rsi
4:52   push   %rdx
5:51   push   %rcx
6:6a dapushq  $0xffda
8:41 50push   %r8
a:41 51push   %r9
c:41 52push   %r10
e:41 53push   %r11
   10:48 83 ec 30  sub$0x30,%rsp
   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
   1b:00 00
   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
   24:0f 85 a5 00 00 00jne0xcf
   2a:50   push   %rax
   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
 0xffd095cd<-- trapping instruction
   31:58   pop%rax
   32:48 3d 4c 01 00 00cmp$0x14c,%rax
   38:77 0fja 0x49
   3a:4c 89 d1 mov%r10,%rcx
   3d:ff   .byte 0xff
   3e:14 c5adc$0xc5,%al


 so the original 'cli' was replaced with the pv call but to me the offset
 looks a bit off, no? Shouldn't it always be positive?
>>> callq takes a 32bit signed displacement, so jumping back by up to 2G is
>>> perfectly legitimate.
>> Yes, but
>>
>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
>> 817365dd t entry_SYSCALL_64_fastpath
>> ostr@workbase> nm vmlinux | grep " pv_irq_ops"
>> 81c2dbc0 D pv_irq_ops
>> ostr@workbase>
>>
>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
>> didn't mean that x86 instruction set doesn't allow negative
>> displacement, I was trying to say that pv_irq_ops always live further down)
>
> I believe the problem is this:
>
> #define PV_INDIRECT(addr)   *addr(%rip)
>
> The displacement that the linker computes will be relative to the where
> this instruction is placed at the time of linking, which is in
> .pv_altinstructions (and not .text). So when we copy it into .text the
> displacement becomes bogus.
>
> Replacing the macro with
>
> #define PV_INDIRECT(addr)   *addr  // well, it's not so much
> indirect anymore
>
> makes things work. Or maybe it can be adjusted top be kept truly indirect.

That is still an indirect call, just using absolute addressing for the
pointer instead of RIP-relative.  Alternatives has very limited
relocation capabilities.  It will only handle a single call or jmp
replacement. Using absolute addressing is slightly less efficient
(takes one extra byte to encode, and needs a relocation for KASLR),
but it works just as well.  You could also relocate the instruction
manually by adding the delta between the original and replacement code
to the displacement.

--
Brian Gerst


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-16 Thread Josh Poimboeuf
On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:
> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
> > On 10/12/2017 03:27 PM, Andrew Cooper wrote:
> >> On 12/10/17 20:11, Boris Ostrovsky wrote:
> >>> There is also another problem:
> >>>
> >>> [1.312425] general protection fault:  [#1] SMP
> >>> [1.312901] Modules linked in:
> >>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> >>> [1.313878] task: 88003e2c task.stack: c938c000
> >>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> >>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> >>> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> >>> 7fcfc959f59a
> >>> [1.315827] RDX:  RSI:  RDI:
> >>> 
> >>> [1.316315] RBP: 000a R08: 037f R09:
> >>> 0064
> >>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> >>> 7fcfc958ad60
> >>> [1.317300] R13:  R14: 55f550185954 R15:
> >>> 1000
> >>> [1.317801] FS:  () GS:88003f80()
> >>> knlGS:
> >>> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> >>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> >>> 00042660
> >>> [1.319235] Call Trace:
> >>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> >>> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> >>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
> >>> c938ff50
> >>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> >>> [1.345009] Kernel panic - not syncing: Attempted to kill init!
> >>> exitcode=0x000b
> >>>
> >>>
> >>> All code
> >>> 
> >>>0:51   push   %rcx
> >>>1:50   push   %rax
> >>>2:57   push   %rdi
> >>>3:56   push   %rsi
> >>>4:52   push   %rdx
> >>>5:51   push   %rcx
> >>>6:6a dapushq  $0xffda
> >>>8:41 50push   %r8
> >>>a:41 51push   %r9
> >>>c:41 52push   %r10
> >>>e:41 53push   %r11
> >>>   10:48 83 ec 30  sub$0x30,%rsp
> >>>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
> >>>   1b:00 00
> >>>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
> >>>   24:0f 85 a5 00 00 00jne0xcf
> >>>   2a:50   push   %rax
> >>>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> >>> 0xffd095cd<-- trapping instruction
> >>>   31:58   pop%rax
> >>>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
> >>>   38:77 0fja 0x49
> >>>   3a:4c 89 d1 mov%r10,%rcx
> >>>   3d:ff   .byte 0xff
> >>>   3e:14 c5adc$0xc5,%al
> >>>
> >>>
> >>> so the original 'cli' was replaced with the pv call but to me the offset
> >>> looks a bit off, no? Shouldn't it always be positive?
> >> callq takes a 32bit signed displacement, so jumping back by up to 2G is
> >> perfectly legitimate.
> > Yes, but
> >
> > ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
> > 817365dd t entry_SYSCALL_64_fastpath
> > ostr@workbase> nm vmlinux | grep " pv_irq_ops"
> > 81c2dbc0 D pv_irq_ops
> > ostr@workbase>
> >
> > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
> > didn't mean that x86 instruction set doesn't allow negative
> > displacement, I was trying to say that pv_irq_ops always live further down)
> 
> I believe the problem is this:
> 
> #define PV_INDIRECT(addr)   *addr(%rip)
> 
> The displacement that the linker computes will be relative to the where
> this instruction is placed at the time of linking, which is in
> .pv_altinstructions (and not .text). So when we copy it into .text the
> displacement becomes bogus.

apply_alternatives() is supposed to adjust that displacement based on
the new IP, though it could be messing that up somehow.  (See patch
10/13.)

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-16 Thread Boris Ostrovsky
On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
> On 10/12/2017 03:27 PM, Andrew Cooper wrote:
>> On 12/10/17 20:11, Boris Ostrovsky wrote:
>>> There is also another problem:
>>>
>>> [1.312425] general protection fault:  [#1] SMP
>>> [1.312901] Modules linked in:
>>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
>>> [1.313878] task: 88003e2c task.stack: c938c000
>>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
>>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
>>> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
>>> 7fcfc959f59a
>>> [1.315827] RDX:  RSI:  RDI:
>>> 
>>> [1.316315] RBP: 000a R08: 037f R09:
>>> 0064
>>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
>>> 7fcfc958ad60
>>> [1.317300] R13:  R14: 55f550185954 R15:
>>> 1000
>>> [1.317801] FS:  () GS:88003f80()
>>> knlGS:
>>> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
>>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
>>> 00042660
>>> [1.319235] Call Trace:
>>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
>>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
>>> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
>>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50
>>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
>>> [1.345009] Kernel panic - not syncing: Attempted to kill init!
>>> exitcode=0x000b
>>>
>>>
>>> All code
>>> 
>>>0:51   push   %rcx
>>>1:50   push   %rax
>>>2:57   push   %rdi
>>>3:56   push   %rsi
>>>4:52   push   %rdx
>>>5:51   push   %rcx
>>>6:6a dapushq  $0xffda
>>>8:41 50push   %r8
>>>a:41 51push   %r9
>>>c:41 52push   %r10
>>>e:41 53push   %r11
>>>   10:48 83 ec 30  sub$0x30,%rsp
>>>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
>>>   1b:00 00
>>>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
>>>   24:0f 85 a5 00 00 00jne0xcf
>>>   2a:50   push   %rax
>>>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
>>> 0xffd095cd<-- trapping instruction
>>>   31:58   pop%rax
>>>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
>>>   38:77 0fja 0x49
>>>   3a:4c 89 d1 mov%r10,%rcx
>>>   3d:ff   .byte 0xff
>>>   3e:14 c5adc$0xc5,%al
>>>
>>>
>>> so the original 'cli' was replaced with the pv call but to me the offset
>>> looks a bit off, no? Shouldn't it always be positive?
>> callq takes a 32bit signed displacement, so jumping back by up to 2G is
>> perfectly legitimate.
> Yes, but
>
> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
> 817365dd t entry_SYSCALL_64_fastpath
> ostr@workbase> nm vmlinux | grep " pv_irq_ops"
> 81c2dbc0 D pv_irq_ops
> ostr@workbase>
>
> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
> didn't mean that x86 instruction set doesn't allow negative
> displacement, I was trying to say that pv_irq_ops always live further down)

I believe the problem is this:

#define PV_INDIRECT(addr)   *addr(%rip)

The displacement that the linker computes will be relative to the where
this instruction is placed at the time of linking, which is in
.pv_altinstructions (and not .text). So when we copy it into .text the
displacement becomes bogus.

Replacing the macro with

#define PV_INDIRECT(addr)   *addr  // well, it's not so much
indirect anymore

makes things work. Or maybe it can be adjusted top be kept truly indirect.

-boris



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-12 Thread Boris Ostrovsky
On 10/12/2017 03:27 PM, Andrew Cooper wrote:
> On 12/10/17 20:11, Boris Ostrovsky wrote:
>> On 10/06/2017 10:32 AM, Josh Poimboeuf wrote:
>>> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
>  #ifdef CONFIG_PARAVIRT
> +/*
> + * Paravirt alternatives are applied much earlier than normal 
> alternatives.
> + * They are only applied when running on a hypervisor.  They replace some
> + * native instructions with calls to pv ops.
> + */
> +void __init apply_pv_alternatives(void)
> +{
> + setup_force_cpu_cap(X86_FEATURE_PV_OPS);
 Not for Xen HVM guests.
>>> From what I can tell, HVM guests still use pv_time_ops and
>>> pv_mmu_ops.exit_mmap, right?
>>>
> + apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
> +}
 This is a problem (at least for Xen PV guests):
 apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.
>>> Ah, right.
>>>
 It might be possible not to turn off/on the interrupts in this
 particular case since the guest probably won't be able to handle an
 interrupt at this point anyway.
>>> Yeah, that should work.  For Xen and for the other hypervisors, this is
>>> called well before irq init, so interrupts can't be handled yet anyway.
>> There is also another problem:
>>
>> [1.312425] general protection fault:  [#1] SMP
>> [1.312901] Modules linked in:
>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
>> [1.313878] task: 88003e2c task.stack: c938c000
>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
>> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
>> 7fcfc959f59a
>> [1.315827] RDX:  RSI:  RDI:
>> 
>> [1.316315] RBP: 000a R08: 037f R09:
>> 0064
>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
>> 7fcfc958ad60
>> [1.317300] R13:  R14: 55f550185954 R15:
>> 1000
>> [1.317801] FS:  () GS:88003f80()
>> knlGS:
>> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
>> 00042660
>> [1.319235] Call Trace:
>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
>> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50
>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
>> [1.345009] Kernel panic - not syncing: Attempted to kill init!
>> exitcode=0x000b
>>
>>
>> All code
>> 
>>0:51   push   %rcx
>>1:50   push   %rax
>>2:57   push   %rdi
>>3:56   push   %rsi
>>4:52   push   %rdx
>>5:51   push   %rcx
>>6:6a dapushq  $0xffda
>>8:41 50push   %r8
>>a:41 51push   %r9
>>c:41 52push   %r10
>>e:41 53push   %r11
>>   10:48 83 ec 30  sub$0x30,%rsp
>>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
>>   1b:00 00
>>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
>>   24:0f 85 a5 00 00 00jne0xcf
>>   2a:50   push   %rax
>>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
>> 0xffd095cd<-- trapping instruction
>>   31:58   pop%rax
>>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
>>   38:77 0fja 0x49
>>   3a:4c 89 d1 mov%r10,%rcx
>>   3d:ff   .byte 0xff
>>   3e:14 c5adc$0xc5,%al
>>
>>
>> so the original 'cli' was replaced with the pv call but to me the offset
>> looks a bit off, no? Shouldn't it always be positive?
> callq takes a 32bit signed displacement, so jumping back by up to 2G is
> perfectly legitimate.

Yes, but

ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
817365dd t entry_SYSCALL_64_fastpath
ostr@workbase> nm vmlinux | grep " pv_irq_ops"
81c2dbc0 D pv_irq_ops
ostr@workbase>

so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
didn't mean that x86 instruction set doesn't allow negative
displacement, I was trying to say that pv_irq_ops always live further down)


>
> The #GP[0] however means that whatever 8 byte value was found at
> -0x2f6a64(%rip) was a non-canonical address.
>
> One option is that the pvops structure hasn't been initialised 

Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-12 Thread Andrew Cooper
On 12/10/17 20:11, Boris Ostrovsky wrote:
> On 10/06/2017 10:32 AM, Josh Poimboeuf wrote:
>> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
  #ifdef CONFIG_PARAVIRT
 +/*
 + * Paravirt alternatives are applied much earlier than normal 
 alternatives.
 + * They are only applied when running on a hypervisor.  They replace some
 + * native instructions with calls to pv ops.
 + */
 +void __init apply_pv_alternatives(void)
 +{
 +  setup_force_cpu_cap(X86_FEATURE_PV_OPS);
>>> Not for Xen HVM guests.
>> From what I can tell, HVM guests still use pv_time_ops and
>> pv_mmu_ops.exit_mmap, right?
>>
 +  apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
 +}
>>> This is a problem (at least for Xen PV guests):
>>> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.
>> Ah, right.
>>
>>> It might be possible not to turn off/on the interrupts in this
>>> particular case since the guest probably won't be able to handle an
>>> interrupt at this point anyway.
>> Yeah, that should work.  For Xen and for the other hypervisors, this is
>> called well before irq init, so interrupts can't be handled yet anyway.
> There is also another problem:
>
> [1.312425] general protection fault:  [#1] SMP
> [1.312901] Modules linked in:
> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> [1.313878] task: 88003e2c task.stack: c938c000
> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> 7fcfc959f59a
> [1.315827] RDX:  RSI:  RDI:
> 
> [1.316315] RBP: 000a R08: 037f R09:
> 0064
> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> 7fcfc958ad60
> [1.317300] R13:  R14: 55f550185954 R15:
> 1000
> [1.317801] FS:  () GS:88003f80()
> knlGS:
> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> 00042660
> [1.319235] Call Trace:
> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50
> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> [1.345009] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x000b
>
>
> All code
> 
>0:51   push   %rcx
>1:50   push   %rax
>2:57   push   %rdi
>3:56   push   %rsi
>4:52   push   %rdx
>5:51   push   %rcx
>6:6a dapushq  $0xffda
>8:41 50push   %r8
>a:41 51push   %r9
>c:41 52push   %r10
>e:41 53push   %r11
>   10:48 83 ec 30  sub$0x30,%rsp
>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
>   1b:00 00
>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
>   24:0f 85 a5 00 00 00jne0xcf
>   2a:50   push   %rax
>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> 0xffd095cd<-- trapping instruction
>   31:58   pop%rax
>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
>   38:77 0fja 0x49
>   3a:4c 89 d1 mov%r10,%rcx
>   3d:ff   .byte 0xff
>   3e:14 c5adc$0xc5,%al
>
>
> so the original 'cli' was replaced with the pv call but to me the offset
> looks a bit off, no? Shouldn't it always be positive?

callq takes a 32bit signed displacement, so jumping back by up to 2G is
perfectly legitimate.

The #GP[0] however means that whatever 8 byte value was found at
-0x2f6a64(%rip) was a non-canonical address.

One option is that the pvops structure hasn't been initialised properly,
but an alternative is that the relocation wasn't processed correctly,
and the code is trying to reference something which isn't a function
pointer.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-12 Thread Boris Ostrovsky
On 10/06/2017 10:32 AM, Josh Poimboeuf wrote:
> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
>>>  #ifdef CONFIG_PARAVIRT
>>> +/*
>>> + * Paravirt alternatives are applied much earlier than normal alternatives.
>>> + * They are only applied when running on a hypervisor.  They replace some
>>> + * native instructions with calls to pv ops.
>>> + */
>>> +void __init apply_pv_alternatives(void)
>>> +{
>>> +   setup_force_cpu_cap(X86_FEATURE_PV_OPS);
>> Not for Xen HVM guests.
> From what I can tell, HVM guests still use pv_time_ops and
> pv_mmu_ops.exit_mmap, right?
>
>>> +   apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
>>> +}
>>
>> This is a problem (at least for Xen PV guests):
>> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.
> Ah, right.
>
>> It might be possible not to turn off/on the interrupts in this
>> particular case since the guest probably won't be able to handle an
>> interrupt at this point anyway.
> Yeah, that should work.  For Xen and for the other hypervisors, this is
> called well before irq init, so interrupts can't be handled yet anyway.

There is also another problem:

[1.312425] general protection fault:  [#1] SMP
[1.312901] Modules linked in:
[1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
[1.313878] task: 88003e2c task.stack: c938c000
[1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
[1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
[1.315336] RAX: 000c RBX: 55f550168040 RCX:
7fcfc959f59a
[1.315827] RDX:  RSI:  RDI:

[1.316315] RBP: 000a R08: 037f R09:
0064
[1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
7fcfc958ad60
[1.317300] R13:  R14: 55f550185954 R15:
1000
[1.317801] FS:  () GS:88003f80()
knlGS:
[1.318267] CS:  e033 DS:  ES:  CR0: 80050033
[1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
00042660
[1.319235] Call Trace:
[1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
[1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50
[1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
[1.345009] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x000b


All code

   0:51   push   %rcx
   1:50   push   %rax
   2:57   push   %rdi
   3:56   push   %rsi
   4:52   push   %rdx
   5:51   push   %rcx
   6:6a dapushq  $0xffda
   8:41 50push   %r8
   a:41 51push   %r9
   c:41 52push   %r10
   e:41 53push   %r11
  10:48 83 ec 30  sub$0x30,%rsp
  14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
  1b:00 00
  1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
  24:0f 85 a5 00 00 00jne0xcf
  2a:50   push   %rax
  2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
0xffd095cd<-- trapping instruction
  31:58   pop%rax
  32:48 3d 4c 01 00 00cmp$0x14c,%rax
  38:77 0fja 0x49
  3a:4c 89 d1 mov%r10,%rcx
  3d:ff   .byte 0xff
  3e:14 c5adc$0xc5,%al


so the original 'cli' was replaced with the pv call but to me the offset
looks a bit off, no? Shouldn't it always be positive?


-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-06 Thread Josh Poimboeuf
On Fri, Oct 06, 2017 at 11:29:52AM -0400, Boris Ostrovsky wrote:
> >>> +
> >>>  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >>>struct paravirt_patch_site *end)
> >>>  {
> >>> diff --git a/arch/x86/kernel/cpu/hypervisor.c 
> >>> b/arch/x86/kernel/cpu/hypervisor.c
> >>> index 4fa90006ac68..17243fe0f5ce 100644
> >>> --- a/arch/x86/kernel/cpu/hypervisor.c
> >>> +++ b/arch/x86/kernel/cpu/hypervisor.c
> >>> @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
> >>>   if (!x86_hyper)
> >>>   return;
> >>>  
> >>> + apply_pv_alternatives();
> >> Not for Xen PV guests who have already done this.
> > I think it would be harmless, but yeah, it's probably best to only write
> > it once.
> 
> I also wonder whether calling apply_pv_alternatives() here before
> x86_hyper->init_platform() will work since the latter may be setting
> those op. In fact, that's what Xen HVM does for pv_mmu_ops.exit_mmap.

apply_pv_alternatives() changes:

  (native code)

to

  call *pv_whatever_ops.whatever

So apply_pv_alternatives() should be called *before* any of the ops are
set up.

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-06 Thread Boris Ostrovsky
On 10/06/2017 10:32 AM, Josh Poimboeuf wrote:
> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
>>>  #ifdef CONFIG_PARAVIRT
>>> +/*
>>> + * Paravirt alternatives are applied much earlier than normal alternatives.
>>> + * They are only applied when running on a hypervisor.  They replace some
>>> + * native instructions with calls to pv ops.
>>> + */
>>> +void __init apply_pv_alternatives(void)
>>> +{
>>> +   setup_force_cpu_cap(X86_FEATURE_PV_OPS);
>> Not for Xen HVM guests.
> From what I can tell, HVM guests still use pv_time_ops and
> pv_mmu_ops.exit_mmap, right?


Right, I forgot about that one.

>>> +
>>>  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
>>>  struct paravirt_patch_site *end)
>>>  {
>>> diff --git a/arch/x86/kernel/cpu/hypervisor.c 
>>> b/arch/x86/kernel/cpu/hypervisor.c
>>> index 4fa90006ac68..17243fe0f5ce 100644
>>> --- a/arch/x86/kernel/cpu/hypervisor.c
>>> +++ b/arch/x86/kernel/cpu/hypervisor.c
>>> @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
>>> if (!x86_hyper)
>>> return;
>>>  
>>> +   apply_pv_alternatives();
>> Not for Xen PV guests who have already done this.
> I think it would be harmless, but yeah, it's probably best to only write
> it once.

I also wonder whether calling apply_pv_alternatives() here before
x86_hyper->init_platform() will work since the latter may be setting
those op. In fact, that's what Xen HVM does for pv_mmu_ops.exit_mmap.

-boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-06 Thread Josh Poimboeuf
On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
> 
> >  #ifdef CONFIG_PARAVIRT
> > +/*
> > + * Paravirt alternatives are applied much earlier than normal alternatives.
> > + * They are only applied when running on a hypervisor.  They replace some
> > + * native instructions with calls to pv ops.
> > + */
> > +void __init apply_pv_alternatives(void)
> > +{
> > +   setup_force_cpu_cap(X86_FEATURE_PV_OPS);
> 
> Not for Xen HVM guests.

From what I can tell, HVM guests still use pv_time_ops and
pv_mmu_ops.exit_mmap, right?

> > +   apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
> > +}
> 
> 
> This is a problem (at least for Xen PV guests):
> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.

Ah, right.

> It might be possible not to turn off/on the interrupts in this
> particular case since the guest probably won't be able to handle an
> interrupt at this point anyway.

Yeah, that should work.  For Xen and for the other hypervisors, this is
called well before irq init, so interrupts can't be handled yet anyway.

> > +
> >  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >  struct paravirt_patch_site *end)
> >  {
> > diff --git a/arch/x86/kernel/cpu/hypervisor.c 
> > b/arch/x86/kernel/cpu/hypervisor.c
> > index 4fa90006ac68..17243fe0f5ce 100644
> > --- a/arch/x86/kernel/cpu/hypervisor.c
> > +++ b/arch/x86/kernel/cpu/hypervisor.c
> > @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
> > if (!x86_hyper)
> > return;
> >  
> > +   apply_pv_alternatives();
> 
> Not for Xen PV guests who have already done this.

I think it would be harmless, but yeah, it's probably best to only write
it once.

Thanks for the review!

-- 
Josh

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-05 Thread Boris Ostrovsky

>  #ifdef CONFIG_PARAVIRT
> +/*
> + * Paravirt alternatives are applied much earlier than normal alternatives.
> + * They are only applied when running on a hypervisor.  They replace some
> + * native instructions with calls to pv ops.
> + */
> +void __init apply_pv_alternatives(void)
> +{
> + setup_force_cpu_cap(X86_FEATURE_PV_OPS);

Not for Xen HVM guests.

> + apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
> +}


This is a problem (at least for Xen PV guests):
apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.

It might be possible not to turn off/on the interrupts in this
particular case since the guest probably won't be able to handle an
interrupt at this point anyway.


> +
>  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
>struct paravirt_patch_site *end)
>  {
> diff --git a/arch/x86/kernel/cpu/hypervisor.c 
> b/arch/x86/kernel/cpu/hypervisor.c
> index 4fa90006ac68..17243fe0f5ce 100644
> --- a/arch/x86/kernel/cpu/hypervisor.c
> +++ b/arch/x86/kernel/cpu/hypervisor.c
> @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
>   if (!x86_hyper)
>   return;
>  
> + apply_pv_alternatives();

Not for Xen PV guests who have already done this.

-boris


> +
>   if (x86_hyper->init_platform)
>   x86_hyper->init_platform();
>  }
>


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-04 Thread Josh Poimboeuf
With CONFIG_PARAVIRT, the kernel .text is littered with a bunch of calls
to pv_irq_ops function pointers, like:

  callq  *0x81e3a400 (pv_irq_ops.save_fl)

In non-Xen paravirt environments -- including native, KVM, Hyper-V, and
VMware -- the above code gets patched by native_patch() to look like
this instead:

   pushfq
   pop%rax
   nopl   0x0(%rax,%rax,1)

So in most scenarios, there's a mismatch between what vmlinux shows and
the actual runtime code.  This mismatch hurts debuggability and makes
the assembly code harder to understand.

It also causes the ORC unwinder to produce KASAN warnings like:

  BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140

This warning is due to the fact that objtool doesn't know about
parainstructions, so it doesn't know about the "pushfq; pop %rax"
sequence above.

Prepare to fix both of these issues (debuggability and ORC KASAN
warnings) by adding a paravirt alternatives infrastructure to put the
native instructions in .text by default.  Then, when booting on a
hypervisor, replace the native instructions with pv ops calls.

The pv ops calls need to be available much earlier than when
alternatives are normally applied.  So put these alternatives in a
dedicated ".pv_alternatives" section.

So now these instructions may be patched twice:

- in apply_pv_alternatives(), to allow the kernel to boot in the
  virtualized environment;

- and again in apply_paravirt(), to enable performance improvements
  (e.g., replacing an indirect call with a direct call).

That's a bit more complex, but overall this approach should cause less
confusion than before because the vmlinux code is now much more likely
to represent the actual runtime state of the code in the most common
paravirt cases (everything except Xen and vSMP).

It could be simplified by redesigning the paravirt patching code such
that it uses alternatives for all of its patching.  Instead of using pv
ops to specify which functions they need, they would instead set CPU
feature bits, which would then be used by the alternatives to decide
what to replace the native code with.  Then each site would only be
patched once.

But that's going to be a bit more work.  At least this patch creates a
good foundation for eventually getting rid of .parainstructions and pv
ops completely.

Suggested-by: Andy Lutomirski 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/alternative-asm.h |  9 +++-
 arch/x86/include/asm/alternative.h | 12 +++--
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/paravirt-asm.h| 10 
 arch/x86/include/asm/paravirt_types.h  | 84 ++
 arch/x86/kernel/alternative.c  | 13 ++
 arch/x86/kernel/cpu/hypervisor.c   |  2 +
 arch/x86/kernel/module.c   | 11 -
 arch/x86/kernel/vmlinux.lds.S  |  6 +++
 arch/x86/xen/enlighten_pv.c|  1 +
 10 files changed, 141 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/alternative-asm.h 
b/arch/x86/include/asm/alternative-asm.h
index 60073947350d..0ced2e3d0a30 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -39,14 +39,14 @@
  * @newinstr. ".skip" directive takes care of proper instruction padding
  * in case @newinstr is longer than @oldinstr.
  */
-#define ALTERNATIVE(oldinstr, newinstr, feature)   \
+#define __ALTERNATIVE(section, oldinstr, newinstr, feature)\
 140:;  \
oldinstr;   \
 141:;  \
.skip -(((144f-143f)-(141b-140b)) > 0) *\
((144f-143f)-(141b-140b)),0x90; \
 142:;  \
-   .pushsection .altinstructions, "a"; \
+   .pushsection section, "a";  \
altinstruction_entry 140b,143f,feature,142b-140b,144f-143f,142b-141b;\
.popsection;\
.pushsection .altinstr_replacement, "ax";   \
@@ -55,6 +55,11 @@
 144:;  \
.popsection
 
+#define ARGS(args...) args
+
+#define ALTERNATIVE(oldinstr, newinstr, feature)   \
+   __ALTERNATIVE(.altinstructions, ARGS(oldinstr), ARGS(newinstr), feature)
+
 #define old_len141b-140b
 #define new_len1   144f-143f
 #define new_len2   145f-144f
diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index c096624137ae..8482f90d5078 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -61,6 +61,7 @@