Re: Compat 32-bit syscall entry from 64-bit task!?

2017-03-08 Thread Andrew Lutomirski
On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin  wrote:
> Hi,
>
> On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
>> Hi Linus,
>>
>> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
>>  wrote:
>> >> Please look at strace source, get_scno() function, where
>> >> it reads syscall no and parameters. Let's see
>> >> - POWERPC: has 32-bit and 64-bit mode
>> >> - X86_64: has 32-bit and 64-bit mode
>> >> - IA64: has i386-compat mode
>> >> - ARM: has more than one ABI
>> >> - SPARC: has 32-bit and 64-bit mode
>> >>
>> >> Do you want to re-invent a different arch-specific way to report
>> >> syscall type for each of these arches?
>> >
>> > I think an arch-specific one is better than trying to make some
>> > generic one that is messy.
>> >
>> > As you say, many architectures have multiple system call ABIs.
>> >
>> > But they tend to be very *different* issues. They can be about
>> > multiple ABI's, as you mention, and even when they *look* similar
>> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
>> > [skip]
>>
>> I don't have a particular attachment to my solution,
>> and I think we already talk about this problem for
>> far too long.
>>
>> Looks like nobody is _strongly_ opposed to your patch
>> which uses a few bits in eflags to report bitness
>> of the x86 syscall.
>>
>> Lets just do that already. If you commit it to kernel git,
>> I will immediately change strace accordingly.
>
> Is there any progress with this (or any alternative) solution?
>
> I see the kernel side has changed a bit, and the strace part
> is in a better shape than 5 years ago (although I'm biased of course),
> but I don't see any kernel interface that would allow strace to reliably
> recognize this 0x80 case.

I am strongly opposed to fudging registers to half-arsedly slightly
improve the epicly crappy ptrace(2) interface for syscalls.

To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
to, in one shot, read out all the syscall details.  This means: arch,
no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
this structure:

struct ptrace_syscall_info {
  u8 op;  /* 0 for entry, 1 for exit */
  u8 pad0;
  u16 pad1;
  u32 pad2;
  union {
struct seccomp_data syscall_entry;
s64 syscall_exit_retval;
  };
};

because struct seccomp_data already gets this right.  There's plenty
of opportunity to fine-tune this.  Now it works on all architectures.

Since struct seccomp_data may be extended in the future, the operation
should be:

ptrace(PTRACE_GET_SYSCALL_INFO, pid, (void *)sizeof(struct
ptrace_syscall_info), );

returns 0 on success and some error code if, for example, the current
ptrace stop isn't a syscall entry or exit.

--Andy


Re: Compat 32-bit syscall entry from 64-bit task!?

2017-03-08 Thread Andrew Lutomirski
On Wed, Mar 8, 2017 at 3:41 PM, Dmitry V. Levin  wrote:
> Hi,
>
> On Thu, Jan 26, 2012 at 07:03:43PM +0100, Denys Vlasenko wrote:
>> Hi Linus,
>>
>> On Thu, Jan 26, 2012 at 4:47 AM, Linus Torvalds
>>  wrote:
>> >> Please look at strace source, get_scno() function, where
>> >> it reads syscall no and parameters. Let's see
>> >> - POWERPC: has 32-bit and 64-bit mode
>> >> - X86_64: has 32-bit and 64-bit mode
>> >> - IA64: has i386-compat mode
>> >> - ARM: has more than one ABI
>> >> - SPARC: has 32-bit and 64-bit mode
>> >>
>> >> Do you want to re-invent a different arch-specific way to report
>> >> syscall type for each of these arches?
>> >
>> > I think an arch-specific one is better than trying to make some
>> > generic one that is messy.
>> >
>> > As you say, many architectures have multiple system call ABIs.
>> >
>> > But they tend to be very *different* issues. They can be about
>> > multiple ABI's, as you mention, and even when they *look* similar
>> > (32-bit vs 64-bit ABI's) they are actually totally different issues.
>> > [skip]
>>
>> I don't have a particular attachment to my solution,
>> and I think we already talk about this problem for
>> far too long.
>>
>> Looks like nobody is _strongly_ opposed to your patch
>> which uses a few bits in eflags to report bitness
>> of the x86 syscall.
>>
>> Lets just do that already. If you commit it to kernel git,
>> I will immediately change strace accordingly.
>
> Is there any progress with this (or any alternative) solution?
>
> I see the kernel side has changed a bit, and the strace part
> is in a better shape than 5 years ago (although I'm biased of course),
> but I don't see any kernel interface that would allow strace to reliably
> recognize this 0x80 case.

I am strongly opposed to fudging registers to half-arsedly slightly
improve the epicly crappy ptrace(2) interface for syscalls.

To fix this right, please just add PTRACE_GET_SYSCALL_INFO or similar
to, in one shot, read out all the syscall details.  This means: arch,
no, arg0..arg5, and *whether it's entry or exit*.  I propose returning
this structure:

struct ptrace_syscall_info {
  u8 op;  /* 0 for entry, 1 for exit */
  u8 pad0;
  u16 pad1;
  u32 pad2;
  union {
struct seccomp_data syscall_entry;
s64 syscall_exit_retval;
  };
};

because struct seccomp_data already gets this right.  There's plenty
of opportunity to fine-tune this.  Now it works on all architectures.

Since struct seccomp_data may be extended in the future, the operation
should be:

ptrace(PTRACE_GET_SYSCALL_INFO, pid, (void *)sizeof(struct
ptrace_syscall_info), );

returns 0 on success and some error code if, for example, the current
ptrace stop isn't a syscall entry or exit.

--Andy


Re: [RFC] capabilities: Ambient capabilities

2015-03-12 Thread Andrew Lutomirski
On Thu, Mar 12, 2015 at 3:10 PM, Andrew G. Morgan  wrote:
> I'm unclear why you refer to the inheritable set in this test:
>
> +   } else {
> +   if (arg2 == PR_CAP_AMBIENT_RAISE &&
> +   (!cap_raised(current_cred()->cap_permitted, arg3) 
> ||
> +!cap_raised(current_cred()->cap_inheritable,
> +arg3)))
> +   return -EPERM;

It's to preserve the invariant that pA is always a subset of pI.

>
> I'm also unclear how you can turn off this new 'feature' for a process
> tree? As it is, the code creates an exploit path for a capable (pP !=
> 0) program with an exploitable flaw to create a privilege escalation
> for an arbitrary child program.

Huh?  If you exploit the parent, you already win.  Yes, if a kiddie
injects shellcode that does system("/bin/bash") into some pP != 0
program, they don't actually elevate their privileges.  On the other
hand, by the time an attacker injected shellcode for:

prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
system("/bin/bash");

into a target, they can already do whatever they want.

> While I understand that everyone
> 'knows what they are doing' in implementing this change, I'm convinced
> that folk that are up to no good also do... Why not provide a lockable
> secure bit to selectively disable this support?

Show me a legitimate use case and I'll gladly implement a secure bit.
In the mean time, I don't even believe that there's a legitimate use
for any of the other secure bits (except keepcaps, and I don't know
why that's a securebit in the first place).

In the mean time, see CVE-2014-3215 for an example of why securebits
are probably more trouble than they're worth.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] capabilities: Ambient capabilities

2015-03-12 Thread Andrew Lutomirski
On Thu, Mar 12, 2015 at 3:10 PM, Andrew G. Morgan mor...@kernel.org wrote:
 I'm unclear why you refer to the inheritable set in this test:

 +   } else {
 +   if (arg2 == PR_CAP_AMBIENT_RAISE 
 +   (!cap_raised(current_cred()-cap_permitted, arg3) 
 ||
 +!cap_raised(current_cred()-cap_inheritable,
 +arg3)))
 +   return -EPERM;

It's to preserve the invariant that pA is always a subset of pI.


 I'm also unclear how you can turn off this new 'feature' for a process
 tree? As it is, the code creates an exploit path for a capable (pP !=
 0) program with an exploitable flaw to create a privilege escalation
 for an arbitrary child program.

Huh?  If you exploit the parent, you already win.  Yes, if a kiddie
injects shellcode that does system(/bin/bash) into some pP != 0
program, they don't actually elevate their privileges.  On the other
hand, by the time an attacker injected shellcode for:

prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
system(/bin/bash);

into a target, they can already do whatever they want.

 While I understand that everyone
 'knows what they are doing' in implementing this change, I'm convinced
 that folk that are up to no good also do... Why not provide a lockable
 secure bit to selectively disable this support?

Show me a legitimate use case and I'll gladly implement a secure bit.
In the mean time, I don't even believe that there's a legitimate use
for any of the other secure bits (except keepcaps, and I don't know
why that's a securebit in the first place).

In the mean time, see CVE-2014-3215 for an example of why securebits
are probably more trouble than they're worth.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-15 Thread Andrew Lutomirski
On Jan 15, 2015 8:43 AM, "Kirill A. Shutemov"  wrote:
>
> On Tue, Jan 13, 2015 at 10:04:55AM -0800, Andrew Lutomirski wrote:
> > On Tue, Jan 13, 2015 at 9:56 AM, Darren Hart  wrote:
> > > On Mon, Jan 12, 2015 at 02:12:44PM -0800, Andrew Lutomirski wrote:
> > >> On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
> > >>  wrote:
> > >> > On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
> > >> >> On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
> > >> >>  wrote:
> > >> >> > On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
> > >> >> >> On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart 
> > >> >> >>  wrote:
> > >> >> >> > On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov 
> > >> >> >> > wrote:
> > >> >> >> >> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
> > >> >> >> >> > thinkpad-acpi using software mute simplifies the driver and 
> > >> >> >> >> > the user experience
> > >> >> >> >> > significantly.
> > >> >> >> >>
> > >> >> >> >> Except when it doesn't.
> > >> >> >> >>
> > >> >> >> >> I'm probably in minority, but I don't use fancy userspace to 
> > >> >> >> >> mess with my
> > >> >> >> >> mixer and the mute button worked just fine for me before the 
> > >> >> >> >> change.
> > >> >> >> >> Wasted half an hour to find out what happened is not a pure win 
> > >> >> >> >> from user
> > >> >> >> >> experience point of view.
> > >> >> >> >>
> > >> >> >> >> Is it really necessary to have software_mute_requested == true 
> > >> >> >> >> by default?
> > >> >> >> >> Can fancy userspace ask for desired behaviour instead and 
> > >> >> >> >> change kernel to
> > >> >> >> >> not send hotkeys change notification until software_mute is 
> > >> >> >> >> enabled?
> > >> >> >> >>
> > >> >> >> >> --
> > >> >> >> >>  Kirill A. Shutemov
> > >> >> >> >>
> > >> >> >> >
> > >> >> >> > Thanks for the report Kirill,
> > >> >> >> >
> > >> >> >> > Andy, we're at RC4, so if we need to fix (or revert) this fix, 
> > >> >> >> > we only have a
> > >> >> >> > couple weeks to do so.
> > >> >> >> >
> > >> >> >> > Kirill, to define the scope of the problem, if you specify
> > >> >> >> > software_mute_requested as false on the kernel command line, 
> > >> >> >> > does your system
> > >> >> >> > function as expected?
> > >> >> >>
> > >> >> >> If I understood Kirill's email correctly, the only issue is that he
> > >> >> >> liked the old behavior.  Kirill, is that correct?
> > >> >> >
> > >> >> > Yes. For now I use thinkpad_acpi.software_mute=0 to get old 
> > >> >> > behaviour.
> > >> >> >
> > >> >>
> > >> >> What aspect of the old behavior is better than the new default 
> > >> >> behavior?
> > >> >
> > >> > It's not necessary better. The new behavior just broke my use-case.
> > >> >
> > >> > I don't have anything in my system what would react on KEY_MUTE and I 
> > >> > used
> > >> > the functionality platform provides until it suddenly stopped working.
> > >> >
> > >> > I personally can live with new thinkpad_acpi. I'll probably update my
> > >> > configuration to react on the software button.
> > >> >
> > >> > But who else will get frustrated after update to v3.19?
> > >>
> > >> Can you try the equivalent of:
> > >>
> > >> bindsym XF86AudioMute exec amixer -q set Master toggle
> > >>
> 

Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-15 Thread Andrew Lutomirski
On Jan 15, 2015 8:43 AM, Kirill A. Shutemov kir...@shutemov.name wrote:

 On Tue, Jan 13, 2015 at 10:04:55AM -0800, Andrew Lutomirski wrote:
  On Tue, Jan 13, 2015 at 9:56 AM, Darren Hart dvh...@infradead.org wrote:
   On Mon, Jan 12, 2015 at 02:12:44PM -0800, Andrew Lutomirski wrote:
   On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
   kir...@shutemov.name wrote:
On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
 On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart 
 dvh...@infradead.org wrote:
  On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov 
  wrote:
  On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
   thinkpad-acpi using software mute simplifies the driver and 
   the user experience
   significantly.
 
  Except when it doesn't.
 
  I'm probably in minority, but I don't use fancy userspace to 
  mess with my
  mixer and the mute button worked just fine for me before the 
  change.
  Wasted half an hour to find out what happened is not a pure win 
  from user
  experience point of view.
 
  Is it really necessary to have software_mute_requested == true 
  by default?
  Can fancy userspace ask for desired behaviour instead and 
  change kernel to
  not send hotkeys change notification until software_mute is 
  enabled?
 
  --
   Kirill A. Shutemov
 
 
  Thanks for the report Kirill,
 
  Andy, we're at RC4, so if we need to fix (or revert) this fix, 
  we only have a
  couple weeks to do so.
 
  Kirill, to define the scope of the problem, if you specify
  software_mute_requested as false on the kernel command line, 
  does your system
  function as expected?

 If I understood Kirill's email correctly, the only issue is that he
 liked the old behavior.  Kirill, is that correct?

 Yes. For now I use thinkpad_acpi.software_mute=0 to get old 
 behaviour.

   
What aspect of the old behavior is better than the new default 
behavior?
   
It's not necessary better. The new behavior just broke my use-case.
   
I don't have anything in my system what would react on KEY_MUTE and I 
used
the functionality platform provides until it suddenly stopped working.
   
I personally can live with new thinkpad_acpi. I'll probably update my
configuration to react on the software button.
   
But who else will get frustrated after update to v3.19?
  
   Can you try the equivalent of:
  
   bindsym XF86AudioMute exec amixer -q set Master toggle
  
   for your desktop environment?  Note that this is exactly what you'd do
   if you were using any laptop that wasn't a thinkpad.
  
   Unless we hear back from Kirill on why this isn't workable, I'm inclined 
   to
   leave this as is. Between the kernel module option and this keybinding,
   equivalent behavior can be easily achieved, and the general use case is
   significantly improved.
  
   The current gap appears to be the mute LED in the button itself, which 
   can be
   addressed moving forward.
 
  I should have clarified this better.  The command:
 
  amixer -q set Master toggle
 
  should toggle the mute LED in sync with the master mixer on all
  ThinkPad models with a mute LED.  This works on my X220, and Borislav
  confirmed off-list that it works on his X230.  If there's a ThinkPad
  with a mute LED on which this doesn't work, then IMO it's a bug and
  should be fixed.

 Okay. Mute and mute led works. Any idea how to get mic mute led work?
 What happend to your patch on the topic?

 http://permalink.gmane.org/gmane.linux.drivers.platform.x86.devel/1962

It was addressed differently -- the mic mute works like the mute
button as a result of:

b317b032d2dc ALSA: hda - Split Thinkpad ACPI-related code
b67ae3f1c97e ALSA: hda - Enable Thinkpad mute/micmute LEDs for Realtek
08cf680ccafd ALSA: hda - add connection to thinkpad_acpi to control
mute/micmute LEDs

On my X220, amixer set Capture toggle toggles the mute LED.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-13 Thread Andrew Lutomirski
On Tue, Jan 13, 2015 at 9:56 AM, Darren Hart  wrote:
> On Mon, Jan 12, 2015 at 02:12:44PM -0800, Andrew Lutomirski wrote:
>> On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
>>  wrote:
>> > On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
>> >> On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
>> >>  wrote:
>> >> > On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
>> >> >> On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart  
>> >> >> wrote:
>> >> >> > On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
>> >> >> >> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> >> >> >> > thinkpad-acpi using software mute simplifies the driver and the 
>> >> >> >> > user experience
>> >> >> >> > significantly.
>> >> >> >>
>> >> >> >> Except when it doesn't.
>> >> >> >>
>> >> >> >> I'm probably in minority, but I don't use fancy userspace to mess 
>> >> >> >> with my
>> >> >> >> mixer and the mute button worked just fine for me before the change.
>> >> >> >> Wasted half an hour to find out what happened is not a pure win 
>> >> >> >> from user
>> >> >> >> experience point of view.
>> >> >> >>
>> >> >> >> Is it really necessary to have software_mute_requested == true by 
>> >> >> >> default?
>> >> >> >> Can fancy userspace ask for desired behaviour instead and change 
>> >> >> >> kernel to
>> >> >> >> not send hotkeys change notification until software_mute is enabled?
>> >> >> >>
>> >> >> >> --
>> >> >> >>  Kirill A. Shutemov
>> >> >> >>
>> >> >> >
>> >> >> > Thanks for the report Kirill,
>> >> >> >
>> >> >> > Andy, we're at RC4, so if we need to fix (or revert) this fix, we 
>> >> >> > only have a
>> >> >> > couple weeks to do so.
>> >> >> >
>> >> >> > Kirill, to define the scope of the problem, if you specify
>> >> >> > software_mute_requested as false on the kernel command line, does 
>> >> >> > your system
>> >> >> > function as expected?
>> >> >>
>> >> >> If I understood Kirill's email correctly, the only issue is that he
>> >> >> liked the old behavior.  Kirill, is that correct?
>> >> >
>> >> > Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
>> >> >
>> >>
>> >> What aspect of the old behavior is better than the new default behavior?
>> >
>> > It's not necessary better. The new behavior just broke my use-case.
>> >
>> > I don't have anything in my system what would react on KEY_MUTE and I used
>> > the functionality platform provides until it suddenly stopped working.
>> >
>> > I personally can live with new thinkpad_acpi. I'll probably update my
>> > configuration to react on the software button.
>> >
>> > But who else will get frustrated after update to v3.19?
>>
>> Can you try the equivalent of:
>>
>> bindsym XF86AudioMute exec amixer -q set Master toggle
>>
>> for your desktop environment?  Note that this is exactly what you'd do
>> if you were using any laptop that wasn't a thinkpad.
>
> Unless we hear back from Kirill on why this isn't workable, I'm inclined to
> leave this as is. Between the kernel module option and this keybinding,
> equivalent behavior can be easily achieved, and the general use case is
> significantly improved.
>
> The current gap appears to be the mute LED in the button itself, which can be
> addressed moving forward.

I should have clarified this better.  The command:

amixer -q set Master toggle

should toggle the mute LED in sync with the master mixer on all
ThinkPad models with a mute LED.  This works on my X220, and Borislav
confirmed off-list that it works on his X230.  If there's a ThinkPad
with a mute LED on which this doesn't work, then IMO it's a bug and
should be fixed.

--Andy

>
> --
> Darren Hart
> Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-13 Thread Andrew Lutomirski
On Tue, Jan 13, 2015 at 9:56 AM, Darren Hart dvh...@infradead.org wrote:
 On Mon, Jan 12, 2015 at 02:12:44PM -0800, Andrew Lutomirski wrote:
 On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
 kir...@shutemov.name wrote:
  On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
  On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
  kir...@shutemov.name wrote:
   On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
   On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart dvh...@infradead.org 
   wrote:
On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
 thinkpad-acpi using software mute simplifies the driver and the 
 user experience
 significantly.
   
Except when it doesn't.
   
I'm probably in minority, but I don't use fancy userspace to mess 
with my
mixer and the mute button worked just fine for me before the change.
Wasted half an hour to find out what happened is not a pure win 
from user
experience point of view.
   
Is it really necessary to have software_mute_requested == true by 
default?
Can fancy userspace ask for desired behaviour instead and change 
kernel to
not send hotkeys change notification until software_mute is enabled?
   
--
 Kirill A. Shutemov
   
   
Thanks for the report Kirill,
   
Andy, we're at RC4, so if we need to fix (or revert) this fix, we 
only have a
couple weeks to do so.
   
Kirill, to define the scope of the problem, if you specify
software_mute_requested as false on the kernel command line, does 
your system
function as expected?
  
   If I understood Kirill's email correctly, the only issue is that he
   liked the old behavior.  Kirill, is that correct?
  
   Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
  
 
  What aspect of the old behavior is better than the new default behavior?
 
  It's not necessary better. The new behavior just broke my use-case.
 
  I don't have anything in my system what would react on KEY_MUTE and I used
  the functionality platform provides until it suddenly stopped working.
 
  I personally can live with new thinkpad_acpi. I'll probably update my
  configuration to react on the software button.
 
  But who else will get frustrated after update to v3.19?

 Can you try the equivalent of:

 bindsym XF86AudioMute exec amixer -q set Master toggle

 for your desktop environment?  Note that this is exactly what you'd do
 if you were using any laptop that wasn't a thinkpad.

 Unless we hear back from Kirill on why this isn't workable, I'm inclined to
 leave this as is. Between the kernel module option and this keybinding,
 equivalent behavior can be easily achieved, and the general use case is
 significantly improved.

 The current gap appears to be the mute LED in the button itself, which can be
 addressed moving forward.

I should have clarified this better.  The command:

amixer -q set Master toggle

should toggle the mute LED in sync with the master mixer on all
ThinkPad models with a mute LED.  This works on my X220, and Borislav
confirmed off-list that it works on his X230.  If there's a ThinkPad
with a mute LED on which this doesn't work, then IMO it's a bug and
should be fixed.

--Andy


 --
 Darren Hart
 Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
 wrote:
> On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
>> On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
>>  wrote:
>> > On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
>> >> On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart  
>> >> wrote:
>> >> > On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
>> >> >> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> >> >> > thinkpad-acpi using software mute simplifies the driver and the user 
>> >> >> > experience
>> >> >> > significantly.
>> >> >>
>> >> >> Except when it doesn't.
>> >> >>
>> >> >> I'm probably in minority, but I don't use fancy userspace to mess with 
>> >> >> my
>> >> >> mixer and the mute button worked just fine for me before the change.
>> >> >> Wasted half an hour to find out what happened is not a pure win from 
>> >> >> user
>> >> >> experience point of view.
>> >> >>
>> >> >> Is it really necessary to have software_mute_requested == true by 
>> >> >> default?
>> >> >> Can fancy userspace ask for desired behaviour instead and change 
>> >> >> kernel to
>> >> >> not send hotkeys change notification until software_mute is enabled?
>> >> >>
>> >> >> --
>> >> >>  Kirill A. Shutemov
>> >> >>
>> >> >
>> >> > Thanks for the report Kirill,
>> >> >
>> >> > Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
>> >> > have a
>> >> > couple weeks to do so.
>> >> >
>> >> > Kirill, to define the scope of the problem, if you specify
>> >> > software_mute_requested as false on the kernel command line, does your 
>> >> > system
>> >> > function as expected?
>> >>
>> >> If I understood Kirill's email correctly, the only issue is that he
>> >> liked the old behavior.  Kirill, is that correct?
>> >
>> > Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
>> >
>>
>> What aspect of the old behavior is better than the new default behavior?
>
> It's not necessary better. The new behavior just broke my use-case.
>
> I don't have anything in my system what would react on KEY_MUTE and I used
> the functionality platform provides until it suddenly stopped working.
>
> I personally can live with new thinkpad_acpi. I'll probably update my
> configuration to react on the software button.
>
> But who else will get frustrated after update to v3.19?

Can you try the equivalent of:

bindsym XF86AudioMute exec amixer -q set Master toggle

for your desktop environment?  Note that this is exactly what you'd do
if you were using any laptop that wasn't a thinkpad.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
 wrote:
> On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
>> On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
>>  wrote:
>> > On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
>> >> On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart  
>> >> wrote:
>> >> > On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
>> >> >> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> >> >> > thinkpad-acpi using software mute simplifies the driver and the user 
>> >> >> > experience
>> >> >> > significantly.
>> >> >>
>> >> >> Except when it doesn't.
>> >> >>
>> >> >> I'm probably in minority, but I don't use fancy userspace to mess with 
>> >> >> my
>> >> >> mixer and the mute button worked just fine for me before the change.
>> >> >> Wasted half an hour to find out what happened is not a pure win from 
>> >> >> user
>> >> >> experience point of view.
>> >> >>
>> >> >> Is it really necessary to have software_mute_requested == true by 
>> >> >> default?
>> >> >> Can fancy userspace ask for desired behaviour instead and change 
>> >> >> kernel to
>> >> >> not send hotkeys change notification until software_mute is enabled?
>> >> >>
>> >> >> --
>> >> >>  Kirill A. Shutemov
>> >> >>
>> >> >
>> >> > Thanks for the report Kirill,
>> >> >
>> >> > Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
>> >> > have a
>> >> > couple weeks to do so.
>> >> >
>> >> > Kirill, to define the scope of the problem, if you specify
>> >> > software_mute_requested as false on the kernel command line, does your 
>> >> > system
>> >> > function as expected?
>> >>
>> >> If I understood Kirill's email correctly, the only issue is that he
>> >> liked the old behavior.  Kirill, is that correct?
>> >
>> > Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
>> >
>>
>> What aspect of the old behavior is better than the new default behavior?
>
> It's not necessary better. The new behavior just broke my use-case.
>
> I don't have anything in my system what would react on KEY_MUTE and I used
> the functionality platform provides until it suddenly stopped working.
>
> I personally can live with new thinkpad_acpi. I'll probably update my
> configuration to react on the software button.

You'd have to do that anyway if you switched to a non-ThinkPad, for
better or for worse.

>
> But who else will get frustrated after update to v3.19?

Hopefully fewer people than got frustrated by the old, broken
behavior.  (It probably wasn't obviously broken for you, because you
didn't have software that reacted to KEY_MUTE.  But, if you did, you
might have noticed all kinds of strange behavior.)

--Andy

>
> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:26 PM, Borislav Petkov  wrote:
> On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
>> What aspect of the old behavior is better than the new default behavior?
>
> Btw, in my case, if I boot without the thinkpad_acpi.software_mute=0
> thing, the small control light in the mute button doesn't light up to
> show that mute is enabled. It still mutes properly in both cases though
> so it is only a feedback thing which doesn't work anymore...

This is supposed to work, and it works fine on my X220.  What Thinkpad
do you have, and what distro are you running?

Can you run alsamixer (with software_mute=1) and watch the controls to
see what changes when you press mute?  IIRC, when the master volume
goes to 0 or is muted, the magic ALSA hooks will tell thinkpad_acpi so
that thinkpad_acpi can adjust the LED.

--Andy

>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
 wrote:
> On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
>> On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart  wrote:
>> > On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
>> >> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> >> > thinkpad-acpi using software mute simplifies the driver and the user 
>> >> > experience
>> >> > significantly.
>> >>
>> >> Except when it doesn't.
>> >>
>> >> I'm probably in minority, but I don't use fancy userspace to mess with my
>> >> mixer and the mute button worked just fine for me before the change.
>> >> Wasted half an hour to find out what happened is not a pure win from user
>> >> experience point of view.
>> >>
>> >> Is it really necessary to have software_mute_requested == true by default?
>> >> Can fancy userspace ask for desired behaviour instead and change kernel to
>> >> not send hotkeys change notification until software_mute is enabled?
>> >>
>> >> --
>> >>  Kirill A. Shutemov
>> >>
>> >
>> > Thanks for the report Kirill,
>> >
>> > Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
>> > have a
>> > couple weeks to do so.
>> >
>> > Kirill, to define the scope of the problem, if you specify
>> > software_mute_requested as false on the kernel command line, does your 
>> > system
>> > function as expected?
>>
>> If I understood Kirill's email correctly, the only issue is that he
>> liked the old behavior.  Kirill, is that correct?
>
> Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
>

What aspect of the old behavior is better than the new default behavior?

--Andy

> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart  wrote:
> On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
>> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> > thinkpad-acpi using software mute simplifies the driver and the user 
>> > experience
>> > significantly.
>>
>> Except when it doesn't.
>>
>> I'm probably in minority, but I don't use fancy userspace to mess with my
>> mixer and the mute button worked just fine for me before the change.
>> Wasted half an hour to find out what happened is not a pure win from user
>> experience point of view.
>>
>> Is it really necessary to have software_mute_requested == true by default?
>> Can fancy userspace ask for desired behaviour instead and change kernel to
>> not send hotkeys change notification until software_mute is enabled?
>>
>> --
>>  Kirill A. Shutemov
>>
>
> Thanks for the report Kirill,
>
> Andy, we're at RC4, so if we need to fix (or revert) this fix, we only have a
> couple weeks to do so.
>
> Kirill, to define the scope of the problem, if you specify
> software_mute_requested as false on the kernel command line, does your system
> function as expected?

If I understood Kirill's email correctly, the only issue is that he
liked the old behavior.  Kirill, is that correct?

--Andy

>
> Thanks,
>
> --
> Darren Hart
> Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
 On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart dvh...@infradead.org wrote:
  On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
  On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
   thinkpad-acpi using software mute simplifies the driver and the user 
   experience
   significantly.
 
  Except when it doesn't.
 
  I'm probably in minority, but I don't use fancy userspace to mess with my
  mixer and the mute button worked just fine for me before the change.
  Wasted half an hour to find out what happened is not a pure win from user
  experience point of view.
 
  Is it really necessary to have software_mute_requested == true by default?
  Can fancy userspace ask for desired behaviour instead and change kernel to
  not send hotkeys change notification until software_mute is enabled?
 
  --
   Kirill A. Shutemov
 
 
  Thanks for the report Kirill,
 
  Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
  have a
  couple weeks to do so.
 
  Kirill, to define the scope of the problem, if you specify
  software_mute_requested as false on the kernel command line, does your 
  system
  function as expected?

 If I understood Kirill's email correctly, the only issue is that he
 liked the old behavior.  Kirill, is that correct?

 Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.


What aspect of the old behavior is better than the new default behavior?

--Andy

 --
  Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
 On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
 kir...@shutemov.name wrote:
  On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
  On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart dvh...@infradead.org 
  wrote:
   On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
   On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
thinkpad-acpi using software mute simplifies the driver and the user 
experience
significantly.
  
   Except when it doesn't.
  
   I'm probably in minority, but I don't use fancy userspace to mess with 
   my
   mixer and the mute button worked just fine for me before the change.
   Wasted half an hour to find out what happened is not a pure win from 
   user
   experience point of view.
  
   Is it really necessary to have software_mute_requested == true by 
   default?
   Can fancy userspace ask for desired behaviour instead and change 
   kernel to
   not send hotkeys change notification until software_mute is enabled?
  
   --
Kirill A. Shutemov
  
  
   Thanks for the report Kirill,
  
   Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
   have a
   couple weeks to do so.
  
   Kirill, to define the scope of the problem, if you specify
   software_mute_requested as false on the kernel command line, does your 
   system
   function as expected?
 
  If I understood Kirill's email correctly, the only issue is that he
  liked the old behavior.  Kirill, is that correct?
 
  Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
 

 What aspect of the old behavior is better than the new default behavior?

 It's not necessary better. The new behavior just broke my use-case.

 I don't have anything in my system what would react on KEY_MUTE and I used
 the functionality platform provides until it suddenly stopped working.

 I personally can live with new thinkpad_acpi. I'll probably update my
 configuration to react on the software button.

You'd have to do that anyway if you switched to a non-ThinkPad, for
better or for worse.


 But who else will get frustrated after update to v3.19?

Hopefully fewer people than got frustrated by the old, broken
behavior.  (It probably wasn't obviously broken for you, because you
didn't have software that reacted to KEY_MUTE.  But, if you did, you
might have noticed all kinds of strange behavior.)

--Andy


 --
  Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:26 PM, Borislav Petkov b...@alien8.de wrote:
 On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
 What aspect of the old behavior is better than the new default behavior?

 Btw, in my case, if I boot without the thinkpad_acpi.software_mute=0
 thing, the small control light in the mute button doesn't light up to
 show that mute is enabled. It still mutes properly in both cases though
 so it is only a feedback thing which doesn't work anymore...

This is supposed to work, and it works fine on my X220.  What Thinkpad
do you have, and what distro are you running?

Can you run alsamixer (with software_mute=1) and watch the controls to
see what changes when you press mute?  IIRC, when the master volume
goes to 0 or is muted, the magic ALSA hooks will tell thinkpad_acpi so
that thinkpad_acpi can adjust the LED.

--Andy


 --
 Regards/Gruss,
 Boris.

 Sent from a fat crate under my desk. Formatting is fine.
 --
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 12:30 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On Mon, Jan 12, 2015 at 12:05:45PM -0800, Andrew Lutomirski wrote:
 On Mon, Jan 12, 2015 at 12:03 PM, Kirill A. Shutemov
 kir...@shutemov.name wrote:
  On Mon, Jan 12, 2015 at 10:42:11AM -0800, Andrew Lutomirski wrote:
  On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart dvh...@infradead.org 
  wrote:
   On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
   On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
thinkpad-acpi using software mute simplifies the driver and the user 
experience
significantly.
  
   Except when it doesn't.
  
   I'm probably in minority, but I don't use fancy userspace to mess with 
   my
   mixer and the mute button worked just fine for me before the change.
   Wasted half an hour to find out what happened is not a pure win from 
   user
   experience point of view.
  
   Is it really necessary to have software_mute_requested == true by 
   default?
   Can fancy userspace ask for desired behaviour instead and change 
   kernel to
   not send hotkeys change notification until software_mute is enabled?
  
   --
Kirill A. Shutemov
  
  
   Thanks for the report Kirill,
  
   Andy, we're at RC4, so if we need to fix (or revert) this fix, we only 
   have a
   couple weeks to do so.
  
   Kirill, to define the scope of the problem, if you specify
   software_mute_requested as false on the kernel command line, does your 
   system
   function as expected?
 
  If I understood Kirill's email correctly, the only issue is that he
  liked the old behavior.  Kirill, is that correct?
 
  Yes. For now I use thinkpad_acpi.software_mute=0 to get old behaviour.
 

 What aspect of the old behavior is better than the new default behavior?

 It's not necessary better. The new behavior just broke my use-case.

 I don't have anything in my system what would react on KEY_MUTE and I used
 the functionality platform provides until it suddenly stopped working.

 I personally can live with new thinkpad_acpi. I'll probably update my
 configuration to react on the software button.

 But who else will get frustrated after update to v3.19?

Can you try the equivalent of:

bindsym XF86AudioMute exec amixer -q set Master toggle

for your desktop environment?  Note that this is exactly what you'd do
if you were using any laptop that wasn't a thinkpad.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-12 Thread Andrew Lutomirski
On Mon, Jan 12, 2015 at 10:38 AM, Darren Hart dvh...@infradead.org wrote:
 On Mon, Jan 12, 2015 at 12:58:02AM +0200, Kirill A. Shutemov wrote:
 On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
  thinkpad-acpi using software mute simplifies the driver and the user 
  experience
  significantly.

 Except when it doesn't.

 I'm probably in minority, but I don't use fancy userspace to mess with my
 mixer and the mute button worked just fine for me before the change.
 Wasted half an hour to find out what happened is not a pure win from user
 experience point of view.

 Is it really necessary to have software_mute_requested == true by default?
 Can fancy userspace ask for desired behaviour instead and change kernel to
 not send hotkeys change notification until software_mute is enabled?

 --
  Kirill A. Shutemov


 Thanks for the report Kirill,

 Andy, we're at RC4, so if we need to fix (or revert) this fix, we only have a
 couple weeks to do so.

 Kirill, to define the scope of the problem, if you specify
 software_mute_requested as false on the kernel command line, does your system
 function as expected?

If I understood Kirill's email correctly, the only issue is that he
liked the old behavior.  Kirill, is that correct?

--Andy


 Thanks,

 --
 Darren Hart
 Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-11 Thread Andrew Lutomirski
On Sun, Jan 11, 2015 at 2:58 PM, Kirill A. Shutemov
 wrote:
> On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
>> thinkpad-acpi using software mute simplifies the driver and the user 
>> experience
>> significantly.
>
> Except when it doesn't.
>
> I'm probably in minority, but I don't use fancy userspace to mess with my
> mixer and the mute button worked just fine for me before the change.
> Wasted half an hour to find out what happened is not a pure win from user
> experience point of view.
>
> Is it really necessary to have software_mute_requested == true by default?
> Can fancy userspace ask for desired behaviour instead and change kernel to
> not send hotkeys change notification until software_mute is enabled?

The problem is that fancy userspace (which isn't really very fancy,
nor does it need to be new) expects KEY_MUTE to be a normal keypress.
It turns out to be a normal keypress on every single computer in
existence AFAICT except ThinkPads.  As a result, any remotely portable
user program that handled volume hotkeys got confused on ThinkPads.

I think the real solution is to have some little daemon that handles
KEY_MUTE and changes the ALSA state if that's the behavior you want.
Having the EC change the sound output *without* changing the ALSA
state or even giving a notification (the pre-3.19 behavior) really
doesn't seem like a sensible default to me.

(It may be that on newer pre-3.19 kernels, it wasn't quite as bad as
it was on much older kernels, since the ALSA hack that tries to
control the mute LED seems to feed back into the EC control of the
hardware mute switch, so at least the light would stay in sync with
everything.)

--Andy

>
> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] platform-drivers-x86 for 3.19

2015-01-11 Thread Andrew Lutomirski
On Sun, Jan 11, 2015 at 2:58 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On Thu, Dec 18, 2014 at 09:51:27AM -0800, Darren Hart wrote:
 thinkpad-acpi using software mute simplifies the driver and the user 
 experience
 significantly.

 Except when it doesn't.

 I'm probably in minority, but I don't use fancy userspace to mess with my
 mixer and the mute button worked just fine for me before the change.
 Wasted half an hour to find out what happened is not a pure win from user
 experience point of view.

 Is it really necessary to have software_mute_requested == true by default?
 Can fancy userspace ask for desired behaviour instead and change kernel to
 not send hotkeys change notification until software_mute is enabled?

The problem is that fancy userspace (which isn't really very fancy,
nor does it need to be new) expects KEY_MUTE to be a normal keypress.
It turns out to be a normal keypress on every single computer in
existence AFAICT except ThinkPads.  As a result, any remotely portable
user program that handled volume hotkeys got confused on ThinkPads.

I think the real solution is to have some little daemon that handles
KEY_MUTE and changes the ALSA state if that's the behavior you want.
Having the EC change the sound output *without* changing the ALSA
state or even giving a notification (the pre-3.19 behavior) really
doesn't seem like a sensible default to me.

(It may be that on newer pre-3.19 kernels, it wasn't quite as bad as
it was on much older kernels, since the ALSA hack that tries to
control the mute LED seems to feed back into the EC control of the
hardware mute switch, so at least the light would stay in sync with
everything.)

--Andy


 --
  Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: DO NOT APPLY: x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack

2014-07-15 Thread Andrew Lutomirski
On Tue, Jul 15, 2014 at 4:52 PM, Greg KH  wrote:
> On Tue, Jul 15, 2014 at 04:21:46PM -0700, Andrew Lutomirski wrote:
>> On Tue, Jul 15, 2014 at 2:28 PM, Kamal Mostafa  wrote:
>> > 3.13.11.5 -stable review patch.  If anyone has any objections, please let 
>> > me know.
>> >
>> > --
>> >
>> > From: "H. Peter Anvin" 
>> >
>> > commit 3891a04aafd668686239349ea58f3314ea2af86b upstream.
>>
>> Do not apply to any -stable release yet.  This causes nasty regressions on 
>> Xen.
>
> I thought you all found the Xen-regression-fix patch a few hours ago,
> right?

That patch is insufficient: Xen guests still fail to initialize
espfix64 correctly on SMP.  It's currently unclear that espfix64 can
work at all on Xen -- it's may be rather fundamentally incompatible
with the Xen hypercall IRET mechanism.  So it might need to be
disabled entirely on Xen (and maybe Xen will fix the info leak in the
hypervisor).

>
> This means that Linus's tree is also broken?

Yes.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


DO NOT APPLY: x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack

2014-07-15 Thread Andrew Lutomirski
On Tue, Jul 15, 2014 at 2:28 PM, Kamal Mostafa  wrote:
> 3.13.11.5 -stable review patch.  If anyone has any objections, please let me 
> know.
>
> --
>
> From: "H. Peter Anvin" 
>
> commit 3891a04aafd668686239349ea58f3314ea2af86b upstream.

Do not apply to any -stable release yet.  This causes nasty regressions on Xen.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


DO NOT APPLY: x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack

2014-07-15 Thread Andrew Lutomirski
On Tue, Jul 15, 2014 at 2:28 PM, Kamal Mostafa ka...@canonical.com wrote:
 3.13.11.5 -stable review patch.  If anyone has any objections, please let me 
 know.

 --

 From: H. Peter Anvin h...@linux.intel.com

 commit 3891a04aafd668686239349ea58f3314ea2af86b upstream.

Do not apply to any -stable release yet.  This causes nasty regressions on Xen.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: DO NOT APPLY: x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack

2014-07-15 Thread Andrew Lutomirski
On Tue, Jul 15, 2014 at 4:52 PM, Greg KH g...@kroah.com wrote:
 On Tue, Jul 15, 2014 at 04:21:46PM -0700, Andrew Lutomirski wrote:
 On Tue, Jul 15, 2014 at 2:28 PM, Kamal Mostafa ka...@canonical.com wrote:
  3.13.11.5 -stable review patch.  If anyone has any objections, please let 
  me know.
 
  --
 
  From: H. Peter Anvin h...@linux.intel.com
 
  commit 3891a04aafd668686239349ea58f3314ea2af86b upstream.

 Do not apply to any -stable release yet.  This causes nasty regressions on 
 Xen.

 I thought you all found the Xen-regression-fix patch a few hours ago,
 right?

That patch is insufficient: Xen guests still fail to initialize
espfix64 correctly on SMP.  It's currently unclear that espfix64 can
work at all on Xen -- it's may be rather fundamentally incompatible
with the Xen hypercall IRET mechanism.  So it might need to be
disabled entirely on Xen (and maybe Xen will fix the info leak in the
hypervisor).


 This means that Linus's tree is also broken?

Yes.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Andrew Lutomirski
On Thu, Jun 5, 2014 at 12:24 PM, Lennart Poettering
 wrote:
> On Thu, 05.06.14 20:01, Luis R. Rodriguez (mcg...@suse.com) wrote:
>
>> > Hmm? You should "exec" the real daemon binary at the end, not just fork
>> > it off. That wait the shell script process is replaced by the daemon
>> > binary, which is what you want.
>>
>> I tried both just running it and also running exec foo; both presented
>> the same issue given that shell exec does not really execve.
>
> Hmmm? You shell's "exec" command doesn't actually execve()? What are you
> using? This doesn't sound very accurate...

$ strace -e execve /bin/sh -c 'exec /bin/echo test'
execve("/bin/sh", ["/bin/sh", "-c", "exec /bin/echo test"], [/* 54 vars */]) = 0
execve("/bin/echo", ["/bin/echo", "test"], [/* 54 vars */]) = 0
test
+++ exited with 0 +++

I get similar results on Ubuntu using dash.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Andrew Lutomirski
On Thu, Jun 5, 2014 at 12:24 PM, Lennart Poettering
mzxre...@0pointer.de wrote:
 On Thu, 05.06.14 20:01, Luis R. Rodriguez (mcg...@suse.com) wrote:

  Hmm? You should exec the real daemon binary at the end, not just fork
  it off. That wait the shell script process is replaced by the daemon
  binary, which is what you want.

 I tried both just running it and also running exec foo; both presented
 the same issue given that shell exec does not really execve.

 Hmmm? You shell's exec command doesn't actually execve()? What are you
 using? This doesn't sound very accurate...

$ strace -e execve /bin/sh -c 'exec /bin/echo test'
execve(/bin/sh, [/bin/sh, -c, exec /bin/echo test], [/* 54 vars */]) = 0
execve(/bin/echo, [/bin/echo, test], [/* 54 vars */]) = 0
test
+++ exited with 0 +++

I get similar results on Ubuntu using dash.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski  wrote:
> On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin  wrote:
>> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>>
>>> So I tried writing this bit up, but it fails in some rather spectacular
>>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>>> breakpoints don't work right (common Qemu problem, sadly.)
>>>
>>> The kernel code is at:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>>
>>> There are two tests:
>>>
>>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>>> ./run16 test/hello.elf
>>> http://www.zytor.com/~hpa/ldttest.c
>>>
>>> The former will exercise the irq_return_ldt path, but not the fault
>>> path; the latter will exercise the fault path, but doesn't actually use
>>> a 16-bit segment.
>>>
>>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>>> latter should pass.
>>>
>>
>> Current status of the above code: if I remove the randomization in
>> espfix_64.c then the first test passes; the second generally crashes the
>> machine.  With the randomization there, both generally crash the machine.
>>
>> All my testing so far has been under KVM or Qemu, so there is always the
>> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
>> something simpler than that.
>
> I'm compiling your branch.  In the mean time, two possibly stupid questions:
>
> What's the assembly code in the double-fault entry for?
>
> Have you tried hbreak in qemu?  I've had better luck with hbreak than
> regular break in the past.
>

ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
on your branch.  It even said this:

qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset

I have no idea what an uncluncked fd is :)

hello.elf fails to sigbus.  weird.  gdb says:

1: x/i $pc
=> 0x8170559c :
jmp0x81705537 
(gdb) si

1: x/i $pc
=> 0x81705537 :iretq
(gdb) si
Cannot access memory at address 0xf000f
(gdb) info registers
rax0xffe4000f1000-7881234923384832
rbx0x10001068719476752
rcx0xffe4f558f000-7611541041909760
rdx0x805d000134598656
rsi0x10217ffe3283772784279523
rdi0xf000764424509447
rbp0xf000f0xf000f
rsp0xf000f0xf000f
r8 0x00
r9 0x00
r100x00
r110x00
r120x00
r130x00
r140x00
r150x00
rip0x00x0 
eflags 0x0[ ]
cs 0x00
ss 0x37f895
ds 0x00
es 0x00
fs 0x00
---Type  to continue, or q  to quit---
gs 0x00

I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s

This set of registers looks thoroughly bogus.  I don't trust it.  I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here.  Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.

I was not able to get a breakpoint on __do_double_fault to hit.

FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin  wrote:
> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>
>> So I tried writing this bit up, but it fails in some rather spectacular
>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>> breakpoints don't work right (common Qemu problem, sadly.)
>>
>> The kernel code is at:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>
>> There are two tests:
>>
>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>> ./run16 test/hello.elf
>> http://www.zytor.com/~hpa/ldttest.c
>>
>> The former will exercise the irq_return_ldt path, but not the fault
>> path; the latter will exercise the fault path, but doesn't actually use
>> a 16-bit segment.
>>
>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>> latter should pass.
>>
>
> Current status of the above code: if I remove the randomization in
> espfix_64.c then the first test passes; the second generally crashes the
> machine.  With the randomization there, both generally crash the machine.
>
> All my testing so far has been under KVM or Qemu, so there is always the
> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
> something simpler than that.

I'm compiling your branch.  In the mean time, two possibly stupid questions:

What's the assembly code in the double-fault entry for?

Have you tried hbreak in qemu?  I've had better luck with hbreak than
regular break in the past.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/28/2014 04:05 PM, H. Peter Anvin wrote:

 So I tried writing this bit up, but it fails in some rather spectacular
 ways.  Furthermore, I have been unable to debug it under Qemu, because
 breakpoints don't work right (common Qemu problem, sadly.)

 The kernel code is at:

 https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

 There are two tests:

 git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
 ./run16 test/hello.elf
 http://www.zytor.com/~hpa/ldttest.c

 The former will exercise the irq_return_ldt path, but not the fault
 path; the latter will exercise the fault path, but doesn't actually use
 a 16-bit segment.

 Under the 3.14 stock kernel, the former should die with SIGBUS and the
 latter should pass.


 Current status of the above code: if I remove the randomization in
 espfix_64.c then the first test passes; the second generally crashes the
 machine.  With the randomization there, both generally crash the machine.

 All my testing so far has been under KVM or Qemu, so there is always the
 possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
 something simpler than that.

I'm compiling your branch.  In the mean time, two possibly stupid questions:

What's the assembly code in the double-fault entry for?

Have you tried hbreak in qemu?  I've had better luck with hbreak than
regular break in the past.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski aml...@gmail.com wrote:
 On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/28/2014 04:05 PM, H. Peter Anvin wrote:

 So I tried writing this bit up, but it fails in some rather spectacular
 ways.  Furthermore, I have been unable to debug it under Qemu, because
 breakpoints don't work right (common Qemu problem, sadly.)

 The kernel code is at:

 https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

 There are two tests:

 git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
 ./run16 test/hello.elf
 http://www.zytor.com/~hpa/ldttest.c

 The former will exercise the irq_return_ldt path, but not the fault
 path; the latter will exercise the fault path, but doesn't actually use
 a 16-bit segment.

 Under the 3.14 stock kernel, the former should die with SIGBUS and the
 latter should pass.


 Current status of the above code: if I remove the randomization in
 espfix_64.c then the first test passes; the second generally crashes the
 machine.  With the randomization there, both generally crash the machine.

 All my testing so far has been under KVM or Qemu, so there is always the
 possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
 something simpler than that.

 I'm compiling your branch.  In the mean time, two possibly stupid questions:

 What's the assembly code in the double-fault entry for?

 Have you tried hbreak in qemu?  I've had better luck with hbreak than
 regular break in the past.


ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
on your branch.  It even said this:

qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset

I have no idea what an uncluncked fd is :)

hello.elf fails to sigbus.  weird.  gdb says:

1: x/i $pc
= 0x8170559c irq_return_ldt+90:
jmp0x81705537 irq_return_iret
(gdb) si
signal handler called
1: x/i $pc
= 0x81705537 irq_return_iret:iretq
(gdb) si
Cannot access memory at address 0xf000f
(gdb) info registers
rax0xffe4000f1000-7881234923384832
rbx0x10001068719476752
rcx0xffe4f558f000-7611541041909760
rdx0x805d000134598656
rsi0x10217ffe3283772784279523
rdi0xf000764424509447
rbp0xf000f0xf000f
rsp0xf000f0xf000f
r8 0x00
r9 0x00
r100x00
r110x00
r120x00
r130x00
r140x00
r150x00
rip0x00x0 irq_stack_union
eflags 0x0[ ]
cs 0x00
ss 0x37f895
ds 0x00
es 0x00
fs 0x00
---Type return to continue, or q return to quit---
gs 0x00

I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s

This set of registers looks thoroughly bogus.  I don't trust it.  I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here.  Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.

I was not able to get a breakpoint on __do_double_fault to hit.

FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin  wrote:
> On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
>>
>> I was imagining just randomizing a couple of high bits so the whole
>> espfix area moves as a unit.
>>
>>> We could XOR with a random constant with no penalty at all.  Only
>>> problem is that this happens early, so the entropy system is not yet
>>> available.  Fine if we have RDRAND, but...
>>
>> How many people have SMAP and not RDRAND?  I think this is a complete
>> nonissue for non-SMAP systems.
>>
>
> Most likely none, unless some "clever" virtualizer turns off RDRAND out
> of spite.
>
>>>> Peter, is this idea completely nuts?  The only exceptions that can
>>>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>>>> so they won't double-fault.
>>>
>>> It is completely nuts, but sometimes completely nuts is actually useful.
>>>  It is more complexity, to be sure, but it doesn't seem completely out
>>> of the realm of reason, and avoids having to unwind the ministack except
>>> in the normally-fatal #DF handler.  #DFs are documented as not
>>> recoverable, but we might be able to do something here.
>>>
>>> The only real disadvantage I see is the need for more bookkeeping
>>> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
>>> an array, plus we need a second percpu variable.  Given that if
>>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
>>
>> Doing something in #DF needs percpu data?  What am I missing?
>
> You need the second percpu variable in the espfix setup code so you have
> both the write address and the target rsp (read address).
>

Duh. :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin  wrote:
> On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>>
>>> - The user can put arbitrary data in registers before returning to the
>>> LDT in order to get it saved at a known address accessible from the
>>> kernel.  With SMAP and KASLR this might otherwise be difficult.
>>
>> For one thing, this only matters on Haswell.  Otherwise the user can
>> put arbitrary data in userspace.
>>
>> On Haswell, the HPET fixmap is currently a much simpler vector that
>> can do much the same thing, as long as you're willing to wait for the
>> HPET counter to contain some particular value.  I have patches that
>> will fix that as a side effect.
>>
>> Would it pay to randomize the location of the espfix area?  Another
>> somewhat silly idea is to add some random offset to the CPU number mod
>> NR_CPUS so that at attacker won't know which ministack is which.
>
> Since we store the espfix stack location explicitly, as long as the
> scrambling happens in the initialization code that's fine.  However, we
> don't want to reduce locality lest we massively blow up the memory
> requirements.

I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.

>
> We could XOR with a random constant with no penalty at all.  Only
> problem is that this happens early, so the entropy system is not yet
> available.  Fine if we have RDRAND, but...

How many people have SMAP and not RDRAND?  I think this is a complete
nonissue for non-SMAP systems.

>> Peter, is this idea completely nuts?  The only exceptions that can
>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>> so they won't double-fault.
>
> It is completely nuts, but sometimes completely nuts is actually useful.
>  It is more complexity, to be sure, but it doesn't seem completely out
> of the realm of reason, and avoids having to unwind the ministack except
> in the normally-fatal #DF handler.  #DFs are documented as not
> recoverable, but we might be able to do something here.
>
> The only real disadvantage I see is the need for more bookkeeping
> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
> an array, plus we need a second percpu variable.  Given that if
> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

Doing something in #DF needs percpu data?  What am I missing?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:

 - The user can put arbitrary data in registers before returning to the
 LDT in order to get it saved at a known address accessible from the
 kernel.  With SMAP and KASLR this might otherwise be difficult.

 For one thing, this only matters on Haswell.  Otherwise the user can
 put arbitrary data in userspace.

 On Haswell, the HPET fixmap is currently a much simpler vector that
 can do much the same thing, as long as you're willing to wait for the
 HPET counter to contain some particular value.  I have patches that
 will fix that as a side effect.

 Would it pay to randomize the location of the espfix area?  Another
 somewhat silly idea is to add some random offset to the CPU number mod
 NR_CPUS so that at attacker won't know which ministack is which.

 Since we store the espfix stack location explicitly, as long as the
 scrambling happens in the initialization code that's fine.  However, we
 don't want to reduce locality lest we massively blow up the memory
 requirements.

I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.


 We could XOR with a random constant with no penalty at all.  Only
 problem is that this happens early, so the entropy system is not yet
 available.  Fine if we have RDRAND, but...

How many people have SMAP and not RDRAND?  I think this is a complete
nonissue for non-SMAP systems.

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

 It is completely nuts, but sometimes completely nuts is actually useful.
  It is more complexity, to be sure, but it doesn't seem completely out
 of the realm of reason, and avoids having to unwind the ministack except
 in the normally-fatal #DF handler.  #DFs are documented as not
 recoverable, but we might be able to do something here.

 The only real disadvantage I see is the need for more bookkeeping
 metadata.  Basically the bitmask in espfix_64.c now needs to turn into
 an array, plus we need a second percpu variable.  Given that if
 CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

Doing something in #DF needs percpu data?  What am I missing?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin h...@zytor.com wrote:
 On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:

 I was imagining just randomizing a couple of high bits so the whole
 espfix area moves as a unit.

 We could XOR with a random constant with no penalty at all.  Only
 problem is that this happens early, so the entropy system is not yet
 available.  Fine if we have RDRAND, but...

 How many people have SMAP and not RDRAND?  I think this is a complete
 nonissue for non-SMAP systems.


 Most likely none, unless some clever virtualizer turns off RDRAND out
 of spite.

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

 It is completely nuts, but sometimes completely nuts is actually useful.
  It is more complexity, to be sure, but it doesn't seem completely out
 of the realm of reason, and avoids having to unwind the ministack except
 in the normally-fatal #DF handler.  #DFs are documented as not
 recoverable, but we might be able to do something here.

 The only real disadvantage I see is the need for more bookkeeping
 metadata.  Basically the bitmask in espfix_64.c now needs to turn into
 an array, plus we need a second percpu variable.  Given that if
 CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

 Doing something in #DF needs percpu data?  What am I missing?

 You need the second percpu variable in the espfix setup code so you have
 both the write address and the target rsp (read address).


Duh. :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 9:13 PM, comex  wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin  wrote:
>> This is a prototype of espfix for the 64-bit kernel.  espfix is a
>> workaround for the architectural definition of IRET, which fails to
>> restore bits [31:16] of %esp when returning to a 16-bit stack
>> segment.  We have a workaround for the 32-bit kernel, but that
>> implementation doesn't work for 64 bits.
>
> Hi,
>
> A comment: The main purpose of espfix is to prevent attackers from
> learning sensitive addresses, right?  But as far as I can tell, this
> mini-stack becomes itself somewhat sensitive:
>
> - The user can put arbitrary data in registers before returning to the
> LDT in order to get it saved at a known address accessible from the
> kernel.  With SMAP and KASLR this might otherwise be difficult.

For one thing, this only matters on Haswell.  Otherwise the user can
put arbitrary data in userspace.

On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value.  I have patches that
will fix that as a side effect.

Would it pay to randomize the location of the espfix area?  Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.

> - If the iret faults, kernel addresses will get stored there (and not
> cleared).  If a vulnerability could return data from an arbitrary
> specified address to the user, this would be harmful.

Can this be fixed by clearing the ministack in bad_iret?  There will
still be a window in which the kernel address is in there, but it'll
be short.

>
> I guess with the current KASLR implementation you could get the same
> effects via brute force anyway, by filling up and browsing memory,
> respectively, but ideally there wouldn't be any virtual addresses
> guaranteed not to fault.
>
> - If a vulnerability allowed overwriting data at an arbitrary
> specified address, the exception frame could get overwritten at
> exactly the right moment between the copy and iret (or right after the
> iret to mess up fixup_exception)?  You probably know better than I
> whether or not caches prevent this from actually being possible.

To attack this, you'd change the saved CS value.  I don't think caches
would make a difference.

This particular vector hurts: you can safely keep trying until it works.

This just gave me an evil idea: what if we make the whole espfix area
read-only?  This has some weird effects.  To switch to the espfix
stack, you have to write to an alias.  That's a little strange but
harmless and barely complicates the implementation.  If the iret
faults, though, I think the result will be a #DF.  This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does.  This could
even be written in C :)

Peter, is this idea completely nuts?  The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
so they won't double-fault.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin  wrote:
> On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
>> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin  wrote:
>>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>>
>>>> The only way I can see to trigger the race is with sigreturn, but it's
>>>> still there.  Sigh.
>>>
>>> I don't see why sigreturn needs to be involved... all you need is
>>> modify_ldt() on one CPU while the other is in the middle of an IRET
>>> return.  Small window, so hard to hit, but still.
>>
>> If you set the flag as soon as anyone calls modify_ldt, before any
>> descriptor is installed, then I don't think this can happen.  But
>> there's still sigreturn, and I don't think this is worth all the
>> complexity to save a single branch on #GP.
>>
>
> Who cares?  Since we only need to enter the fixup path for LDT
> selectors, anything that is dependent on having called modify_ldt() is
> already redundant.

But you still have to test this, and folding it into the existing
check for thread flags would eliminate that.  Still, I think this
would not be worth it, even if it were correct.

>
> In some ways that is the saving grace.  SS being an LDT selector is
> fortunately a rare case.
>
>> I do mean intra-kernel.  And yes, this has nothing to do with espfix,
>> but it would make write_msr_safe fail more quickly :)
>
> And, pray tell, how important is that?

Not very.

Page faults may be a different story for some workloads, particularly
if they are IO-heavy.  Returning to preempted kernel threads may also
matter.

For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.

--Andy

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin  wrote:
> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>
>> The only way I can see to trigger the race is with sigreturn, but it's
>> still there.  Sigh.
>>
>
> I don't see why sigreturn needs to be involved... all you need is
> modify_ldt() on one CPU while the other is in the middle of an IRET
> return.  Small window, so hard to hit, but still.
>

If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen.  But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.

>> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
>> iret entirely.  It could be something like:
>>
>> SS
>> RSP
>> EFLAGS
>> CS
>> RIP
>>
>> push 16($rsp)
>> popfq [does this need to force rex.w somehow?]
>> ret $64
>
> When you say return to CPL 0 you mean intra-kernel return?  That isn't
> really the problem here, though.  I think this will also break the
> kernel debugger since it will have the wrong behavior for TF and RF.

I do mean intra-kernel.  And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin  wrote:
> On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>>> Ideally the tests should be doable such that on a normal machine the
>>> tests can be overlapped with the other things we have to do on that
>>> path.  The exit branch will be strongly predicted in the negative
>>> direction, so it shouldn't be a significant problem.
>>>
>>> Again, this is not the case in the current prototype.
>>
>> Or you make sure that you switch to those code paths only after software
>> has executed syscalls that make it possible it will use a 16bit ss.
>>
>
> Which, again, would introduce a race, I believe, at least if we have an
> LDT at all (and since we only enter these code paths for LDT descriptors
> in the first place, it is equivalent to the current code minus the filters.)

The only way I can see to trigger the race is with sigreturn, but it's
still there.  Sigh.

Here are two semi-related things:

1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode.  Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.

2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely.  It could be something like:

SS
RSP
EFLAGS
CS
RIP

push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64

This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq.  I'm not quite sure how that works.

I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.

>
>> The other question I have is - is there any reason we can't fix up the
>> IRET to do a 32bit return into a vsyscall type userspace page which then
>> does a long jump or retf to the right place ?
>
> I did a writeup on this a while ago.  It does have the problem that you
> need additional memory in userspace, which is per-thread and in the
> right region of userspace; this pretty much means you have to muck about
> with the user space stack when user space is running in weird modes.
> This gets complex very quickly and does have some "footprint".
> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
> still a way to leak bits [63:32].  I believe the in-kernel solution is
> actually simpler.
>

There's also no real guarantee that user code won't unmap the vdso.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
 Ideally the tests should be doable such that on a normal machine the
 tests can be overlapped with the other things we have to do on that
 path.  The exit branch will be strongly predicted in the negative
 direction, so it shouldn't be a significant problem.

 Again, this is not the case in the current prototype.

 Or you make sure that you switch to those code paths only after software
 has executed syscalls that make it possible it will use a 16bit ss.


 Which, again, would introduce a race, I believe, at least if we have an
 LDT at all (and since we only enter these code paths for LDT descriptors
 in the first place, it is equivalent to the current code minus the filters.)

The only way I can see to trigger the race is with sigreturn, but it's
still there.  Sigh.

Here are two semi-related things:

1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode.  Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.

2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely.  It could be something like:

SS
RSP
EFLAGS
CS
RIP

push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64

This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq.  I'm not quite sure how that works.

I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.


 The other question I have is - is there any reason we can't fix up the
 IRET to do a 32bit return into a vsyscall type userspace page which then
 does a long jump or retf to the right place ?

 I did a writeup on this a while ago.  It does have the problem that you
 need additional memory in userspace, which is per-thread and in the
 right region of userspace; this pretty much means you have to muck about
 with the user space stack when user space is running in weird modes.
 This gets complex very quickly and does have some footprint.
 Furthermore, on some CPUs (not including any recent Intel CPUs) there is
 still a way to leak bits [63:32].  I believe the in-kernel solution is
 actually simpler.


There's also no real guarantee that user code won't unmap the vdso.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.


 I don't see why sigreturn needs to be involved... all you need is
 modify_ldt() on one CPU while the other is in the middle of an IRET
 return.  Small window, so hard to hit, but still.


If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen.  But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.

 2. I've often pondered changing the way we return *to* CPL 0 to bypass
 iret entirely.  It could be something like:

 SS
 RSP
 EFLAGS
 CS
 RIP

 push 16($rsp)
 popfq [does this need to force rex.w somehow?]
 ret $64

 When you say return to CPL 0 you mean intra-kernel return?  That isn't
 really the problem here, though.  I think this will also break the
 kernel debugger since it will have the wrong behavior for TF and RF.

I do mean intra-kernel.  And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
 On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.

 I don't see why sigreturn needs to be involved... all you need is
 modify_ldt() on one CPU while the other is in the middle of an IRET
 return.  Small window, so hard to hit, but still.

 If you set the flag as soon as anyone calls modify_ldt, before any
 descriptor is installed, then I don't think this can happen.  But
 there's still sigreturn, and I don't think this is worth all the
 complexity to save a single branch on #GP.


 Who cares?  Since we only need to enter the fixup path for LDT
 selectors, anything that is dependent on having called modify_ldt() is
 already redundant.

But you still have to test this, and folding it into the existing
check for thread flags would eliminate that.  Still, I think this
would not be worth it, even if it were correct.


 In some ways that is the saving grace.  SS being an LDT selector is
 fortunately a rare case.

 I do mean intra-kernel.  And yes, this has nothing to do with espfix,
 but it would make write_msr_safe fail more quickly :)

 And, pray tell, how important is that?

Not very.

Page faults may be a different story for some workloads, particularly
if they are IO-heavy.  Returning to preempted kernel threads may also
matter.

For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.

--Andy

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 9:13 PM, comex com...@gmail.com wrote:
 On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin h...@linux.intel.com wrote:
 This is a prototype of espfix for the 64-bit kernel.  espfix is a
 workaround for the architectural definition of IRET, which fails to
 restore bits [31:16] of %esp when returning to a 16-bit stack
 segment.  We have a workaround for the 32-bit kernel, but that
 implementation doesn't work for 64 bits.

 Hi,

 A comment: The main purpose of espfix is to prevent attackers from
 learning sensitive addresses, right?  But as far as I can tell, this
 mini-stack becomes itself somewhat sensitive:

 - The user can put arbitrary data in registers before returning to the
 LDT in order to get it saved at a known address accessible from the
 kernel.  With SMAP and KASLR this might otherwise be difficult.

For one thing, this only matters on Haswell.  Otherwise the user can
put arbitrary data in userspace.

On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value.  I have patches that
will fix that as a side effect.

Would it pay to randomize the location of the espfix area?  Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.

 - If the iret faults, kernel addresses will get stored there (and not
 cleared).  If a vulnerability could return data from an arbitrary
 specified address to the user, this would be harmful.

Can this be fixed by clearing the ministack in bad_iret?  There will
still be a window in which the kernel address is in there, but it'll
be short.


 I guess with the current KASLR implementation you could get the same
 effects via brute force anyway, by filling up and browsing memory,
 respectively, but ideally there wouldn't be any virtual addresses
 guaranteed not to fault.

 - If a vulnerability allowed overwriting data at an arbitrary
 specified address, the exception frame could get overwritten at
 exactly the right moment between the copy and iret (or right after the
 iret to mess up fixup_exception)?  You probably know better than I
 whether or not caches prevent this from actually being possible.

To attack this, you'd change the saved CS value.  I don't think caches
would make a difference.

This particular vector hurts: you can safely keep trying until it works.

This just gave me an evil idea: what if we make the whole espfix area
read-only?  This has some weird effects.  To switch to the espfix
stack, you have to write to an alias.  That's a little strange but
harmless and barely complicates the implementation.  If the iret
faults, though, I think the result will be a #DF.  This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does.  This could
even be written in C :)

Peter, is this idea completely nuts?  The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
so they won't double-fault.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 6:17 PM, H. Peter Anvin  wrote:
> Another spin of the prototype.  This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
>
> It still does redundant SWAPGS on the slow path.  I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths.  Some of that
> restructuring might actually be beneficial, but still...
>

What's the to_dmesg thing for?

It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin  wrote:
> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski  wrote:
>>>
>>>>
>>>> Anyway, if done correctly, this whole espfix should be totally free
>>>> for normal processes, since it should only trigger if SS is a LDT
>>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>>> should just have a simple test for that.
>>>
>>> How?  Doesn't something still need to check whether SS is funny before
>>> doing iret?
>>
>> Just test bit #2. Don't do anything else if it's clear, because you
>> should be done. You don't need to do anything special if it's clear,
>> because I don't *think* we have any 16-bit data segments in the GDT on
>> x86-64.
>>
>
> And we don't (neither do we on i386, and we depend on that invariance.)
>
> Hence:
>
>  irq_return:
> +   /*
> +* Are we returning to the LDT?  Note: in 64-bit mode
> +* SS:RSP on the exception stack is always valid.
> +*/
> +   testb $4,(SS-RIP)(%rsp)
> +   jnz irq_return_ldt
> +
> +irq_return_iret:
> INTERRUPT_RETURN
> -   _ASM_EXTABLE(irq_return, bad_iret)
> +   _ASM_EXTABLE(irq_return_iret, bad_iret)
>
>
> That is the whole impact of the IRET path.
>
> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
> make sure there is absolutely no way we could end up nested) then the
> rest of the fixup code can go away and we kill the common path
> exception-handling overhead; the only new overhead is the IST
> indirection for #GP, which isn't a performance-critical fault (good
> thing, because untangling #GP faults is a major effort.)

I'd be a bit nervous about read_msr_safe and friends.  Also, what
happens if userspace triggers a #GP and the signal stack setup causes
a page fault?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:26 AM, Borislav Petkov  wrote:
> On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
>> The fastpath interference is:
>>
>> 1. Testing for an LDT SS selector before IRET.  This is actually easier
>> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
>>
>> 2. Testing for an RSP inside the espfix region on exception entry, so we
>> can switch back the stack.  This has to be done very early in the
>> exception entry since the espfix stack is so small.  If NMI and #MC
>> didn't use IST it wouldn't work at all (but neither would SYSCALL).
>>
>> #2 currently saves/restores %rdi, which is also saved further down.
>> This is obviously wasteful.
>
> Btw, can we runtime-patch the fastpath interference chunk the moment we
> see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
> the hunk you've removed in there and enable the espfix checks there the
> moment we load a 16-bit segment.
>
> I know, I know, this is not so important right now but let me put it out
> there just the same.

Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
16-bit LDT entry.

But I think it makes sense to nail down everything else first.  I
suspect that a single test-and-branch in the iret path will be lost in
the noise from iret itself.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:09 AM, H. Peter Anvin  wrote:
>
> As for Andy's questions:
>
>> What happens on the IST entries?  If I've read your patch right,
>> you're still switching back to the normal stack, which looks
>> questionable.
>
> No, in that case %rsp won't point into the espfix region, and the switch
> will be bypassed.  We will resume back into the espfix region on IRET,
> which is actually required e.g. if we take an NMI in the middle of the
> espfix setup.

Aha.  I misread that.  Would it be worth adding a comment along the lines of

/* Check whether we are running on the espfix stack.  This is
different from checking whether we faulted from the espfix stack,
since an ist exception will have switched us off of the espfix stack.
*/

>
>> Also, if you want to same some register abuse on each exception entry,
>> could you check the saved RIP instead of the current RSP?  I.e. use
>> the test instruction with offset(%rsp)?  Maybe there are multiple
>> possible values, though, and just testing some bits doesn't help.
>
> I don't see how that would work.

It won't, given the above.  I misunderstood what you were checking.

It still seems to me that only #GP needs this special handling.  The
IST entries should never run on the espfix stack, and #MC, #DB, #NM,
and #SS (I missed that one earlier) use IST.

Would it ever make sense to make #GP use IST as well?  That might
allow espfix_adjust_stack to be removed entirely.  I don't know how
much other fiddling would be needed to make that work.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:04 AM, Linus Torvalds
 wrote:
> On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski  wrote:
>>
>> My point is that it may be safe to remove the special espfix fixup
>> from #PF, which is probably the most performance-critical piece here,
>> aside from iret itself.
>
> Actually, even that is unsafe.
>
> Why?
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.

Let me try that again: I think it should be safe to remove the check
for "did we fault from the espfix stack" from the #PF entry.  You can
certainly have all kinds of weird things happen on return from #PF,
but the overhead that I'm talking about is a test on exception *entry*
to see whether the fault happened on the espfix stack so that we can
switch back to running on a real stack.

If the espfix code and the iret at the end can't cause #PF, then the
check in #PF entry can be removed, I think.

>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.

How?  Doesn't something still need to check whether SS is funny before
doing iret?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:43 AM, Linus Torvalds
 wrote:
> On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski  wrote:
>>
>> For the espfix_adjust_stack thing, when can it actually need to do
>> anything?  irqs should be off, I think, and MCE, NMI, and debug
>> exceptions use ist, so that leaves just #SS and #GP, I think.  How can
>> those actually occur?  Is there a way to trigger them deliberately
>> from userspace?  Why do you have three espfix_adjust_stack
>
> Yes, you can very much trigger GP deliberately.
>
> The way to do it is to just make an invalid segment descriptor on the
> iret stack. Or make it a valid 16-bit one, but make it a code segment
> for the stack pointer, or read-only, or whatever. All of which is
> trivial to do with a sigretun system call. But you can do it other
> ways too - enter with a SS that is valid, but do a load_ldt() system
> call that makes it invalid, so that by the time you exit it is no
> longer valid etc.
>
> There's a reason we mark that "iretq" as taking faults with that
>
> _ASM_EXTABLE(native_iret, bad_iret)
>
> and that "bad_iret" creates a GP fault.
>
> And that's a lot of kernel stack. The whole initial GP fault path,
> which goes to the C code that finds the exception table etc. See
> do_general_protection_fault() and fixup_exception().

My point is that it may be safe to remove the special espfix fixup
from #PF, which is probably the most performance-critical piece here,
aside from iret itself.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:10 AM, H. Peter Anvin  wrote:
> Honestly, guys... you're painting the bikeshed at the moment.
>
> Initialization is the easiest bit of all this code.  The tricky part is
> *the rest of the code*, i.e. the stuff in entry_64.S.

That's because the initialization code is much simpler, so it's easy
to pick on :)  Sorry.

For the espfix_adjust_stack thing, when can it actually need to do
anything?  irqs should be off, I think, and MCE, NMI, and debug
exceptions use ist, so that leaves just #SS and #GP, I think.  How can
those actually occur?  Is there a way to trigger them deliberately
from userspace?  Why do you have three espfix_adjust_stack

What happens on the IST entries?  If I've read your patch right,
you're still switching back to the normal stack, which looks
questionable.

Also, if you want to same some register abuse on each exception entry,
could you check the saved RIP instead of the current RSP?  I.e. use
the test instruction with offset(%rsp)?  Maybe there are multiple
possible values, though, and just testing some bits doesn't help.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov  wrote:
> On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
>> I wonder if it would be workable to use a bit in the espfix PGD to
>> denote that it has been initialized already... I hear, near NX there's
>> some room :-)
>
> Ok, I realized this won't work when I hit send... Oh well.
>
> Anyway, another dumb idea: have we considered making this lazy? I.e.,
> preallocate pages to fit the stack of NR_CPUS after smp init is done but
> not setup the percpu espfix stack. Only do that in espfix_fix_stack the
> first time we land there and haven't been setup yet on this cpu.
>
> This should cover the 1% out there who still use 16-bit segments and the
> rest simply doesn't use it and get to save themselves the PT-walk in
> start_secondary().
>
> Hmmm...

I'm going to try to do the math to see what's actually going on.

Each 4G slice contains 64kB of ministacks, which corresponds to 1024
ministacks.  Virtual addresses are divided up as:

12 bits (0..11): address within page.
9 bits (12..20): identifies the PTE within the level 1 directory
9 bits (21..29): identifies the level 1 directory (pmd?) within the
level 2 directory
9 bits (30..38): identifies the level 2 directory (pud) within the
level 3 directory

Critically, each 1024 CPUs can share the same level 1 directory --
there are just a bunch of copies of the same thing in there.
Similarly, they can share the same level 2 directory, and each slot in
that directory will point to the same level 1 directory.

For the level 3 directory, there is only one globally.  It needs 8
entries per 1024 CPUs.

I imagine there's a scalability problem here, too: it's okay if each
of a very large number of CPUs waits while shared structures are
allocated, but owners of big systems won't like it if they all
serialize on the way out.

So maybe it would make sense to refactor this into two separate
functions.  First, before we start the first non-boot CPU:

static pte_t *slice_pte_tables[NR_CPUS / 1024];
Allocate and initialize them all;

It might even make sense to do this at build time instead of run time.
 I can't imagine that parallelizing this would provide any benefit
unless it were done *very* carefully and there were hundreds of
thousands of CPUs.  At worst, we're wasting 4 bytes per CPU not
present.

Then, for the per-CPU part, have one init-once structure (please tell
me the kernel has one of these) per 64 possible CPUs.  Each CPU will
make sure that its group of 64 cpus is initialized, using the init
once mechanism, and then it will set its percpu variable accordingly.

There are only 64 CPUs per slice, so mutexes may no be so bad here.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov b...@alien8.de wrote:
 On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
 I wonder if it would be workable to use a bit in the espfix PGD to
 denote that it has been initialized already... I hear, near NX there's
 some room :-)

 Ok, I realized this won't work when I hit send... Oh well.

 Anyway, another dumb idea: have we considered making this lazy? I.e.,
 preallocate pages to fit the stack of NR_CPUS after smp init is done but
 not setup the percpu espfix stack. Only do that in espfix_fix_stack the
 first time we land there and haven't been setup yet on this cpu.

 This should cover the 1% out there who still use 16-bit segments and the
 rest simply doesn't use it and get to save themselves the PT-walk in
 start_secondary().

 Hmmm...

I'm going to try to do the math to see what's actually going on.

Each 4G slice contains 64kB of ministacks, which corresponds to 1024
ministacks.  Virtual addresses are divided up as:

12 bits (0..11): address within page.
9 bits (12..20): identifies the PTE within the level 1 directory
9 bits (21..29): identifies the level 1 directory (pmd?) within the
level 2 directory
9 bits (30..38): identifies the level 2 directory (pud) within the
level 3 directory

Critically, each 1024 CPUs can share the same level 1 directory --
there are just a bunch of copies of the same thing in there.
Similarly, they can share the same level 2 directory, and each slot in
that directory will point to the same level 1 directory.

For the level 3 directory, there is only one globally.  It needs 8
entries per 1024 CPUs.

I imagine there's a scalability problem here, too: it's okay if each
of a very large number of CPUs waits while shared structures are
allocated, but owners of big systems won't like it if they all
serialize on the way out.

So maybe it would make sense to refactor this into two separate
functions.  First, before we start the first non-boot CPU:

static pte_t *slice_pte_tables[NR_CPUS / 1024];
Allocate and initialize them all;

It might even make sense to do this at build time instead of run time.
 I can't imagine that parallelizing this would provide any benefit
unless it were done *very* carefully and there were hundreds of
thousands of CPUs.  At worst, we're wasting 4 bytes per CPU not
present.

Then, for the per-CPU part, have one init-once structure (please tell
me the kernel has one of these) per 64 possible CPUs.  Each CPU will
make sure that its group of 64 cpus is initialized, using the init
once mechanism, and then it will set its percpu variable accordingly.

There are only 64 CPUs per slice, so mutexes may no be so bad here.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:10 AM, H. Peter Anvin h...@zytor.com wrote:
 Honestly, guys... you're painting the bikeshed at the moment.

 Initialization is the easiest bit of all this code.  The tricky part is
 *the rest of the code*, i.e. the stuff in entry_64.S.

That's because the initialization code is much simpler, so it's easy
to pick on :)  Sorry.

For the espfix_adjust_stack thing, when can it actually need to do
anything?  irqs should be off, I think, and MCE, NMI, and debug
exceptions use ist, so that leaves just #SS and #GP, I think.  How can
those actually occur?  Is there a way to trigger them deliberately
from userspace?  Why do you have three espfix_adjust_stack

What happens on the IST entries?  If I've read your patch right,
you're still switching back to the normal stack, which looks
questionable.

Also, if you want to same some register abuse on each exception entry,
could you check the saved RIP instead of the current RSP?  I.e. use
the test instruction with offset(%rsp)?  Maybe there are multiple
possible values, though, and just testing some bits doesn't help.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:43 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski aml...@gmail.com wrote:

 For the espfix_adjust_stack thing, when can it actually need to do
 anything?  irqs should be off, I think, and MCE, NMI, and debug
 exceptions use ist, so that leaves just #SS and #GP, I think.  How can
 those actually occur?  Is there a way to trigger them deliberately
 from userspace?  Why do you have three espfix_adjust_stack

 Yes, you can very much trigger GP deliberately.

 The way to do it is to just make an invalid segment descriptor on the
 iret stack. Or make it a valid 16-bit one, but make it a code segment
 for the stack pointer, or read-only, or whatever. All of which is
 trivial to do with a sigretun system call. But you can do it other
 ways too - enter with a SS that is valid, but do a load_ldt() system
 call that makes it invalid, so that by the time you exit it is no
 longer valid etc.

 There's a reason we mark that iretq as taking faults with that

 _ASM_EXTABLE(native_iret, bad_iret)

 and that bad_iret creates a GP fault.

 And that's a lot of kernel stack. The whole initial GP fault path,
 which goes to the C code that finds the exception table etc. See
 do_general_protection_fault() and fixup_exception().

My point is that it may be safe to remove the special espfix fixup
from #PF, which is probably the most performance-critical piece here,
aside from iret itself.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:04 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski aml...@gmail.com wrote:

 My point is that it may be safe to remove the special espfix fixup
 from #PF, which is probably the most performance-critical piece here,
 aside from iret itself.

 Actually, even that is unsafe.

 Why?

 The segment table is shared for a process. So you can have one thread
 doing a load_ldt() that invalidates a segment, while another thread is
 busy taking a page fault. The segment was valid at page fault time and
 is saved on the kernel stack, but by the time the page fault returns,
 it is no longer valid and the iretq will fault.

Let me try that again: I think it should be safe to remove the check
for did we fault from the espfix stack from the #PF entry.  You can
certainly have all kinds of weird things happen on return from #PF,
but the overhead that I'm talking about is a test on exception *entry*
to see whether the fault happened on the espfix stack so that we can
switch back to running on a real stack.

If the espfix code and the iret at the end can't cause #PF, then the
check in #PF entry can be removed, I think.


 Anyway, if done correctly, this whole espfix should be totally free
 for normal processes, since it should only trigger if SS is a LDT
 entry (bit #2 set in the segment descriptor). So the normal fast-path
 should just have a simple test for that.

How?  Doesn't something still need to check whether SS is funny before
doing iret?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:09 AM, H. Peter Anvin h...@zytor.com wrote:

 As for Andy's questions:

 What happens on the IST entries?  If I've read your patch right,
 you're still switching back to the normal stack, which looks
 questionable.

 No, in that case %rsp won't point into the espfix region, and the switch
 will be bypassed.  We will resume back into the espfix region on IRET,
 which is actually required e.g. if we take an NMI in the middle of the
 espfix setup.

Aha.  I misread that.  Would it be worth adding a comment along the lines of

/* Check whether we are running on the espfix stack.  This is
different from checking whether we faulted from the espfix stack,
since an ist exception will have switched us off of the espfix stack.
*/


 Also, if you want to same some register abuse on each exception entry,
 could you check the saved RIP instead of the current RSP?  I.e. use
 the test instruction with offset(%rsp)?  Maybe there are multiple
 possible values, though, and just testing some bits doesn't help.

 I don't see how that would work.

It won't, given the above.  I misunderstood what you were checking.

It still seems to me that only #GP needs this special handling.  The
IST entries should never run on the espfix stack, and #MC, #DB, #NM,
and #SS (I missed that one earlier) use IST.

Would it ever make sense to make #GP use IST as well?  That might
allow espfix_adjust_stack to be removed entirely.  I don't know how
much other fiddling would be needed to make that work.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:26 AM, Borislav Petkov b...@alien8.de wrote:
 On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
 The fastpath interference is:

 1. Testing for an LDT SS selector before IRET.  This is actually easier
 than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.

 2. Testing for an RSP inside the espfix region on exception entry, so we
 can switch back the stack.  This has to be done very early in the
 exception entry since the espfix stack is so small.  If NMI and #MC
 didn't use IST it wouldn't work at all (but neither would SYSCALL).

 #2 currently saves/restores %rdi, which is also saved further down.
 This is obviously wasteful.

 Btw, can we runtime-patch the fastpath interference chunk the moment we
 see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
 the hunk you've removed in there and enable the espfix checks there the
 moment we load a 16-bit segment.

 I know, I know, this is not so important right now but let me put it out
 there just the same.

Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
16-bit LDT entry.

But I think it makes sense to nail down everything else first.  I
suspect that a single test-and-branch in the iret path will be lost in
the noise from iret itself.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/22/2014 10:19 AM, Linus Torvalds wrote:
 On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski aml...@gmail.com wrote:


 Anyway, if done correctly, this whole espfix should be totally free
 for normal processes, since it should only trigger if SS is a LDT
 entry (bit #2 set in the segment descriptor). So the normal fast-path
 should just have a simple test for that.

 How?  Doesn't something still need to check whether SS is funny before
 doing iret?

 Just test bit #2. Don't do anything else if it's clear, because you
 should be done. You don't need to do anything special if it's clear,
 because I don't *think* we have any 16-bit data segments in the GDT on
 x86-64.


 And we don't (neither do we on i386, and we depend on that invariance.)

 Hence:

  irq_return:
 +   /*
 +* Are we returning to the LDT?  Note: in 64-bit mode
 +* SS:RSP on the exception stack is always valid.
 +*/
 +   testb $4,(SS-RIP)(%rsp)
 +   jnz irq_return_ldt
 +
 +irq_return_iret:
 INTERRUPT_RETURN
 -   _ASM_EXTABLE(irq_return, bad_iret)
 +   _ASM_EXTABLE(irq_return_iret, bad_iret)


 That is the whole impact of the IRET path.

 If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
 make sure there is absolutely no way we could end up nested) then the
 rest of the fixup code can go away and we kill the common path
 exception-handling overhead; the only new overhead is the IST
 indirection for #GP, which isn't a performance-critical fault (good
 thing, because untangling #GP faults is a major effort.)

I'd be a bit nervous about read_msr_safe and friends.  Also, what
happens if userspace triggers a #GP and the signal stack setup causes
a page fault?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 6:17 PM, H. Peter Anvin h...@linux.intel.com wrote:
 Another spin of the prototype.  This one avoids the espfix for anything
 but #GP, and avoids save/restore/saving registers... one can wonder,
 though, how much that actually matters in practice.

 It still does redundant SWAPGS on the slow path.  I'm not sure I
 personally care enough to optimize that, as it means some fairly
 significant restructuring of some of the code paths.  Some of that
 restructuring might actually be beneficial, but still...


What's the to_dmesg thing for?

It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin  wrote:
> Race condition (although with x86 being globally ordered, it probably can't 
> actually happen.) The bitmask is probably the way to go.

Does the race matter?  In the worst case you take the lock
unnecessarily.  But yes, the bitmask is easy.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin  wrote:
> I wanted to avoid the "another cpu made this allocation, now I have to free" 
> crap, but I also didn't want to grab the lock if there was no work needed.

I guess you also want to avoid bouncing all these cachelines around on
boot on bit multicore machines.

I'd advocate using the bitmap approach or simplifying the existing
code.  For example:

+   for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
+   pud = ACCESS_ONCE(pud_p[n]);
+   if (!pud_present(pud))
+   return false;
+   }

I don't see why that needs to be a loop.  How can one clone exist but
not the others?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin  wrote:
> Well, if 2^17 CPUs are allocated we might 2K pages allocated.  We could 
> easily do a bitmap here, of course.  NR_CPUS/64 is a small number, and would 
> reduce the code complexity.
>

Even simpler: just get rid of the check entirely.  That is, break out
of the higher level loops once one of them is set (this should be a
big speedup regardless) and don't allocate the page if the first PTE
is already pointing at something.

After all, espfix_already_there is mostly a duplicate of init_espfix_cpu.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin  wrote:
> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>
>> Hahaha! :)
>>
>> Some comments:
>>
>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>
> There is no such thing.  With a 64-bit CS, the flags on SS are ignored
> (although you still have to have a non-null SS... the conditions are a
> bit complex.)
>
>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>
> It does not, at least to the best of my knowledge (it is controlled by
> the SS size, not the CS size.)
>
> I'm going to double-check the corner cases just out of healthy paranoia,
> but I'm 98% sure this is correct (and if not, the 32-bit code needs to
> be fixed, too.)
>
>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>>  * So pretend we completed the iret and took the #GPF in user mode.
>>>  *
>>>  * We are now running with the kernel GS after exception recovery.
>>> +* Exception entry will have removed us from the espfix stack.
>>>  * But error_entry expects us to have user GS to match the user %cs,
>>>  * so swap back.
>>>  */
>>
>> What is that referring to?
>
> It means that we have already switched back from the espfix stack to the
> real stack.
>
>>> +   /*
>>> +* Switch from the espfix stack to the proper stack: tricky stuff.
>>> +* On the stack right now is 5 words of exception frame,
>>> +* error code/oldeax, RDI, and the return value, so no additional
>>> +* stack is available.
>>> +*
>>> +* We will always be using the user space GS on entry.
>>> +   */
>>> +ENTRY(espfix_fix_stack)
>>> +   SWAPGS
>>> +   cld
>>> +   movq PER_CPU_VAR(kernel_stack),%rdi
>>> +   subq $8*8,%rdi
>>> +   /* Use the real stack to hold these registers for now */
>>> +   movq %rsi,-8(%rdi)
>>> +   movq %rcx,-16(%rdi)
>>> +   movq %rsp,%rsi
>>> +   movl $8,%ecx
>>> +   rep;movsq
>>> +   leaq -(10*8)(%rdi),%rsp
>>> +   popq %rcx
>>> +   popq %rsi
>>> +   SWAPGS
>>> +   retq
>>>
>>
>> Is it guaranteed that the userspace thread that caused this is dead?
>> If not, do you need to change RIP so that espfix gets invoked again
>> when you return from the exception?
>
> It is not guaranteed to be dead at all.  Why would you need to change
> RIP, though?

Oh.  You're not changing the RSP that you return to.  So this should be okay.

>
>>> +
>>> +void init_espfix_cpu(void)
>>> +{
>>> +   int cpu = smp_processor_id();
>>> +   unsigned long addr;
>>> +   pgd_t pgd, *pgd_p;
>>> +   pud_t pud, *pud_p;
>>> +   pmd_t pmd, *pmd_p;
>>> +   pte_t pte, *pte_p;
>>> +   int n;
>>> +   void *stack_page;
>>> +
>>> +   cpu = smp_processor_id();
>>> +   BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>> +
>>> +   /* We only have to do this once... */
>>> +   if (likely(this_cpu_read(espfix_stack)))
>>> +   return; /* Already initialized */
>>> +
>>> +   addr = espfix_base_addr(cpu);
>>> +
>>> +   /* Did another CPU already set this up? */
>>> +   if (likely(espfix_already_there(addr)))
>>> +   goto done;
>>> +
>>> +   mutex_lock(_init_mutex);
>>> +
>>> +   if (unlikely(espfix_already_there(addr)))
>>> +   goto unlock_done;
>>
>> Wouldn't it be simpler to just have a single static bool to indicate
>> whether espfix is initialized?
>
> No, you would have to allocate memory for every possible CPU, which I
> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
> already done that for percpu, but we *should* if we haven't yet.)
>
>> Even better: why not separate the percpu init from the pagetable init
>> and just do the pagetable init once from main or even modify_ldt?
>
> It needs to be done once per CPU.  I wanted to do it late enough that
> the page allocator is fully functional, so we don't have to do the ugly
> hacks to call one allocator or another as the percpu initialization code
> does (otherwise it would have made a lot of sense to co-locate with percpu.)

Hmm.  I guess espfix_already_there isn't so bad.  Given that, in the
worst case, I think there are 16 pages allocated, it might make sense
to just track which of those 16 pages have been allocated in some
array.  That whole array would probably be shorter than the test of
espfix_already_there.  Or am I still failing to understand how this
works?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 3:47 PM, H. Peter Anvin  wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
>
> The 64-bit implementation works like this:
>
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
>
> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
>
> THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
>
> Furthermore, this code adds unnecessary instructions to the common
> path.  For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi.  Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
>
> Putting this version out there for people to look at/laugh at/play
> with.

Hahaha! :)

Some comments:

Does returning to 64-bit CS with 16-bit SS not need espfix?
Conversely, does 16-bit CS and 32-bit SS need espfix?


> @@ -1058,6 +1095,7 @@ bad_iret:
>  * So pretend we completed the iret and took the #GPF in user mode.
>  *
>  * We are now running with the kernel GS after exception recovery.
> +* Exception entry will have removed us from the espfix stack.
>  * But error_entry expects us to have user GS to match the user %cs,
>  * so swap back.
>  */

What is that referring to?


> +   /*
> +* Switch from the espfix stack to the proper stack: tricky stuff.
> +* On the stack right now is 5 words of exception frame,
> +* error code/oldeax, RDI, and the return value, so no additional
> +* stack is available.
> +*
> +* We will always be using the user space GS on entry.
> +   */
> +ENTRY(espfix_fix_stack)
> +   SWAPGS
> +   cld
> +   movq PER_CPU_VAR(kernel_stack),%rdi
> +   subq $8*8,%rdi
> +   /* Use the real stack to hold these registers for now */
> +   movq %rsi,-8(%rdi)
> +   movq %rcx,-16(%rdi)
> +   movq %rsp,%rsi
> +   movl $8,%ecx
> +   rep;movsq
> +   leaq -(10*8)(%rdi),%rsp
> +   popq %rcx
> +   popq %rsi
> +   SWAPGS
> +   retq
>

Is it guaranteed that the userspace thread that caused this is dead?
If not, do you need to change RIP so that espfix gets invoked again
when you return from the exception?

> +
> +void init_espfix_cpu(void)
> +{
> +   int cpu = smp_processor_id();
> +   unsigned long addr;
> +   pgd_t pgd, *pgd_p;
> +   pud_t pud, *pud_p;
> +   pmd_t pmd, *pmd_p;
> +   pte_t pte, *pte_p;
> +   int n;
> +   void *stack_page;
> +
> +   cpu = smp_processor_id();
> +   BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
> +
> +   /* We only have to do this once... */
> +   if (likely(this_cpu_read(espfix_stack)))
> +   return; /* Already initialized */
> +
> +   addr = espfix_base_addr(cpu);
> +
> +   /* Did another CPU already set this up? */
> +   if (likely(espfix_already_there(addr)))
> +   goto done;
> +
> +   mutex_lock(_init_mutex);
> +
> +   if (unlikely(espfix_already_there(addr)))
> +   goto unlock_done;

Wouldn't it be simpler to just have a single static bool to indicate
whether espfix is initialized?

Even better: why not separate the percpu init from the pagetable init
and just do the pagetable init once from main or even modify_ldt?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 3:47 PM, H. Peter Anvin h...@linux.intel.com wrote:
 This is a prototype of espfix for the 64-bit kernel.  espfix is a
 workaround for the architectural definition of IRET, which fails to
 restore bits [31:16] of %esp when returning to a 16-bit stack
 segment.  We have a workaround for the 32-bit kernel, but that
 implementation doesn't work for 64 bits.

 The 64-bit implementation works like this:

 Set up a ministack for each CPU, which is then mapped 65536 times
 using the page tables.  This implementation uses the second-to-last
 PGD slot for this; with a 64-byte espfix stack this is sufficient for
 2^18 CPUs (currently we support a max of 2^13 CPUs.)

 64 bytes appear to be sufficient, because NMI and #MC cause a task
 switch.

 THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
 code paths that can interrupt userspace execute this code.
 Fortunately we never need to use the espfix stack for nested faults,
 so one per CPU is guaranteed to be safe.

 Furthermore, this code adds unnecessary instructions to the common
 path.  For example, on exception entry we push %rdi, pop %rdi, and
 then save away %rdi.  Ideally we should do this in such a way that we
 avoid unnecessary swapgs, especially on the IRET path (the exception
 path is going to be very rare, and so is less critical.)

 Putting this version out there for people to look at/laugh at/play
 with.

Hahaha! :)

Some comments:

Does returning to 64-bit CS with 16-bit SS not need espfix?
Conversely, does 16-bit CS and 32-bit SS need espfix?


 @@ -1058,6 +1095,7 @@ bad_iret:
  * So pretend we completed the iret and took the #GPF in user mode.
  *
  * We are now running with the kernel GS after exception recovery.
 +* Exception entry will have removed us from the espfix stack.
  * But error_entry expects us to have user GS to match the user %cs,
  * so swap back.
  */

What is that referring to?


 +   /*
 +* Switch from the espfix stack to the proper stack: tricky stuff.
 +* On the stack right now is 5 words of exception frame,
 +* error code/oldeax, RDI, and the return value, so no additional
 +* stack is available.
 +*
 +* We will always be using the user space GS on entry.
 +   */
 +ENTRY(espfix_fix_stack)
 +   SWAPGS
 +   cld
 +   movq PER_CPU_VAR(kernel_stack),%rdi
 +   subq $8*8,%rdi
 +   /* Use the real stack to hold these registers for now */
 +   movq %rsi,-8(%rdi)
 +   movq %rcx,-16(%rdi)
 +   movq %rsp,%rsi
 +   movl $8,%ecx
 +   rep;movsq
 +   leaq -(10*8)(%rdi),%rsp
 +   popq %rcx
 +   popq %rsi
 +   SWAPGS
 +   retq


Is it guaranteed that the userspace thread that caused this is dead?
If not, do you need to change RIP so that espfix gets invoked again
when you return from the exception?

 +
 +void init_espfix_cpu(void)
 +{
 +   int cpu = smp_processor_id();
 +   unsigned long addr;
 +   pgd_t pgd, *pgd_p;
 +   pud_t pud, *pud_p;
 +   pmd_t pmd, *pmd_p;
 +   pte_t pte, *pte_p;
 +   int n;
 +   void *stack_page;
 +
 +   cpu = smp_processor_id();
 +   BUG_ON(cpu = (8  20)/ESPFIX_STACK_SIZE);
 +
 +   /* We only have to do this once... */
 +   if (likely(this_cpu_read(espfix_stack)))
 +   return; /* Already initialized */
 +
 +   addr = espfix_base_addr(cpu);
 +
 +   /* Did another CPU already set this up? */
 +   if (likely(espfix_already_there(addr)))
 +   goto done;
 +
 +   mutex_lock(espfix_init_mutex);
 +
 +   if (unlikely(espfix_already_there(addr)))
 +   goto unlock_done;

Wouldn't it be simpler to just have a single static bool to indicate
whether espfix is initialized?

Even better: why not separate the percpu init from the pagetable init
and just do the pagetable init once from main or even modify_ldt?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin h...@zytor.com wrote:
 On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:

 Hahaha! :)

 Some comments:

 Does returning to 64-bit CS with 16-bit SS not need espfix?

 There is no such thing.  With a 64-bit CS, the flags on SS are ignored
 (although you still have to have a non-null SS... the conditions are a
 bit complex.)

 Conversely, does 16-bit CS and 32-bit SS need espfix?

 It does not, at least to the best of my knowledge (it is controlled by
 the SS size, not the CS size.)

 I'm going to double-check the corner cases just out of healthy paranoia,
 but I'm 98% sure this is correct (and if not, the 32-bit code needs to
 be fixed, too.)

 @@ -1058,6 +1095,7 @@ bad_iret:
  * So pretend we completed the iret and took the #GPF in user mode.
  *
  * We are now running with the kernel GS after exception recovery.
 +* Exception entry will have removed us from the espfix stack.
  * But error_entry expects us to have user GS to match the user %cs,
  * so swap back.
  */

 What is that referring to?

 It means that we have already switched back from the espfix stack to the
 real stack.

 +   /*
 +* Switch from the espfix stack to the proper stack: tricky stuff.
 +* On the stack right now is 5 words of exception frame,
 +* error code/oldeax, RDI, and the return value, so no additional
 +* stack is available.
 +*
 +* We will always be using the user space GS on entry.
 +   */
 +ENTRY(espfix_fix_stack)
 +   SWAPGS
 +   cld
 +   movq PER_CPU_VAR(kernel_stack),%rdi
 +   subq $8*8,%rdi
 +   /* Use the real stack to hold these registers for now */
 +   movq %rsi,-8(%rdi)
 +   movq %rcx,-16(%rdi)
 +   movq %rsp,%rsi
 +   movl $8,%ecx
 +   rep;movsq
 +   leaq -(10*8)(%rdi),%rsp
 +   popq %rcx
 +   popq %rsi
 +   SWAPGS
 +   retq


 Is it guaranteed that the userspace thread that caused this is dead?
 If not, do you need to change RIP so that espfix gets invoked again
 when you return from the exception?

 It is not guaranteed to be dead at all.  Why would you need to change
 RIP, though?

Oh.  You're not changing the RSP that you return to.  So this should be okay.


 +
 +void init_espfix_cpu(void)
 +{
 +   int cpu = smp_processor_id();
 +   unsigned long addr;
 +   pgd_t pgd, *pgd_p;
 +   pud_t pud, *pud_p;
 +   pmd_t pmd, *pmd_p;
 +   pte_t pte, *pte_p;
 +   int n;
 +   void *stack_page;
 +
 +   cpu = smp_processor_id();
 +   BUG_ON(cpu = (8  20)/ESPFIX_STACK_SIZE);
 +
 +   /* We only have to do this once... */
 +   if (likely(this_cpu_read(espfix_stack)))
 +   return; /* Already initialized */
 +
 +   addr = espfix_base_addr(cpu);
 +
 +   /* Did another CPU already set this up? */
 +   if (likely(espfix_already_there(addr)))
 +   goto done;
 +
 +   mutex_lock(espfix_init_mutex);
 +
 +   if (unlikely(espfix_already_there(addr)))
 +   goto unlock_done;

 Wouldn't it be simpler to just have a single static bool to indicate
 whether espfix is initialized?

 No, you would have to allocate memory for every possible CPU, which I
 wanted to avoid in case NR_CPUS  actual CPUs (I don't know if we have
 already done that for percpu, but we *should* if we haven't yet.)

 Even better: why not separate the percpu init from the pagetable init
 and just do the pagetable init once from main or even modify_ldt?

 It needs to be done once per CPU.  I wanted to do it late enough that
 the page allocator is fully functional, so we don't have to do the ugly
 hacks to call one allocator or another as the percpu initialization code
 does (otherwise it would have made a lot of sense to co-locate with percpu.)

Hmm.  I guess espfix_already_there isn't so bad.  Given that, in the
worst case, I think there are 16 pages allocated, it might make sense
to just track which of those 16 pages have been allocated in some
array.  That whole array would probably be shorter than the test of
espfix_already_there.  Or am I still failing to understand how this
works?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin h...@zytor.com wrote:
 Well, if 2^17 CPUs are allocated we might 2K pages allocated.  We could 
 easily do a bitmap here, of course.  NR_CPUS/64 is a small number, and would 
 reduce the code complexity.


Even simpler: just get rid of the check entirely.  That is, break out
of the higher level loops once one of them is set (this should be a
big speedup regardless) and don't allocate the page if the first PTE
is already pointing at something.

After all, espfix_already_there is mostly a duplicate of init_espfix_cpu.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin h...@zytor.com wrote:
 I wanted to avoid the another cpu made this allocation, now I have to free 
 crap, but I also didn't want to grab the lock if there was no work needed.

I guess you also want to avoid bouncing all these cachelines around on
boot on bit multicore machines.

I'd advocate using the bitmap approach or simplifying the existing
code.  For example:

+   for (n = 0; n  ESPFIX_PUD_CLONES; n++) {
+   pud = ACCESS_ONCE(pud_p[n]);
+   if (!pud_present(pud))
+   return false;
+   }

I don't see why that needs to be a loop.  How can one clone exist but
not the others?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-21 Thread Andrew Lutomirski
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin h...@zytor.com wrote:
 Race condition (although with x86 being globally ordered, it probably can't 
 actually happen.) The bitmask is probably the way to go.

Does the race matter?  In the worst case you take the lock
unnecessarily.  But yes, the bitmask is easy.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_TMPFILE fs corruption (Re: Linux 3.11-rc4)

2013-08-04 Thread Andrew Lutomirski
On Sun, Aug 4, 2013 at 8:45 PM, Linus Torvalds
 wrote:
> The patch looks right to me - we should pass in similar flags for the
> create case as for tmpfile to the filesystem.

Alternatively, in case anyone ever wants to add more O_TMPFILE-related
flags, open could return -EINVAL if __O_TMPFILE is set and mode
contains bits outside S_IALLUGO.  Given the ABI discussion about
O_TMPFILE, this might be useful some day.

--Andy

>
> But let's make sure we're all on the same page. Al?
>
>  Linus
>
> On Sun, Aug 4, 2013 at 7:34 PM, Andy Lutomirski  wrote:
>> On 08/04/2013 02:09 PM, Linus Torvalds wrote:
>>>
>>> It's that time of the week again..
>>
>>
>> I still get filesystem corruption with O_TMPFILE.  The program below, run as
>> flinktest foo proc (or flinktest foo linkat if you're root) will produce a
>> bogus inode.  On ext4, once the inode is gone from cache, the inode will be
>> impossible to delete and will require a fsck to fix
>>
>> A patch (not necessarily the appropriate fix) is here:
>>
>> http://article.gmane.org/gmane.linux.kernel/1537088
>>
>> --- cut here ---
>>
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> #define __O_TMPFILE 02000
>> #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>> #define AT_EMPTY_PATH 0x1000
>>
>> int main(int argc, char **argv)
>> {
>>   char buf[128];
>>
>>   if (argc != 3)
>> errx(1, "Usage: flinktest PATH linkat|proc");
>>
>>   int fd = open(".", O_TMPFILE | O_RDWR, 0600);
>>   if (fd == -1)
>> err(1, "O_TMPFILE");
>>   write(fd, "test", 4);
>>
>>   if (!strcmp(argv[2], "linkat")) {
>> if (linkat(fd, "", AT_FDCWD, argv[1], AT_EMPTY_PATH) != 0)
>>   err(1, "linkat");
>>   } else if (!strcmp(argv[2], "proc")) {
>> sprintf(buf, "/proc/self/fd/%d", fd);
>> if (linkat(AT_FDCWD, buf, AT_FDCWD, argv[1], AT_SYMLINK_FOLLOW) != 0)
>>   err(1, "linkat");
>>   } else {
>> errx(1, "invalid mode");
>>   }
>>   return 0;
>> }
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_TMPFILE fs corruption (Re: Linux 3.11-rc4)

2013-08-04 Thread Andrew Lutomirski
On Sun, Aug 4, 2013 at 8:45 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 The patch looks right to me - we should pass in similar flags for the
 create case as for tmpfile to the filesystem.

Alternatively, in case anyone ever wants to add more O_TMPFILE-related
flags, open could return -EINVAL if __O_TMPFILE is set and mode
contains bits outside S_IALLUGO.  Given the ABI discussion about
O_TMPFILE, this might be useful some day.

--Andy


 But let's make sure we're all on the same page. Al?

  Linus

 On Sun, Aug 4, 2013 at 7:34 PM, Andy Lutomirski l...@mit.edu wrote:
 On 08/04/2013 02:09 PM, Linus Torvalds wrote:

 It's that time of the week again..


 I still get filesystem corruption with O_TMPFILE.  The program below, run as
 flinktest foo proc (or flinktest foo linkat if you're root) will produce a
 bogus inode.  On ext4, once the inode is gone from cache, the inode will be
 impossible to delete and will require a fsck to fix

 A patch (not necessarily the appropriate fix) is here:

 http://article.gmane.org/gmane.linux.kernel/1537088

 --- cut here ---

 #include stdio.h
 #include err.h
 #include fcntl.h
 #include unistd.h
 #include string.h

 #define __O_TMPFILE 02000
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define AT_EMPTY_PATH 0x1000

 int main(int argc, char **argv)
 {
   char buf[128];

   if (argc != 3)
 errx(1, Usage: flinktest PATH linkat|proc);

   int fd = open(., O_TMPFILE | O_RDWR, 0600);
   if (fd == -1)
 err(1, O_TMPFILE);
   write(fd, test, 4);

   if (!strcmp(argv[2], linkat)) {
 if (linkat(fd, , AT_FDCWD, argv[1], AT_EMPTY_PATH) != 0)
   err(1, linkat);
   } else if (!strcmp(argv[2], proc)) {
 sprintf(buf, /proc/self/fd/%d, fd);
 if (linkat(AT_FDCWD, buf, AT_FDCWD, argv[1], AT_SYMLINK_FOLLOW) != 0)
   err(1, linkat);
   } else {
 errx(1, invalid mode);
   }
   return 0;
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: adopt(pid_t pid) syscall proposal [patch included]

2013-06-11 Thread Andrew Lutomirski
On Tue, Jun 11, 2013 at 11:38 AM,   wrote:
> On Tue, Jun 11, 2013 at 09:48:22AM -0700, Andy Lutomirski wrote:
>> On 06/10/2013 06:23 PM, vcap...@gnugeneration.com wrote:
>> >+if (!uid_eq(cred->euid, tcred->suid) &&
>> >+!uid_eq(cred->euid, tcred->uid)  &&
>> >+!uid_eq(cred->uid,  tcred->suid) &&
>> >+!uid_eq(cred->uid,  tcred->uid) &&
>> >+!ns_capable(cred->user_ns, CAP_KILL)) {
>> >+ret = -EPERM;
>> >+goto out_unlock;
>> >+}
>> >+
>>
>> That check's far too permissive.
>
> I suspected, but that's easily improved, I just wanted to get this out
> there and see what people thought, start the discussion, as well as
> get my use case functional.  Although I'm curious, what is your cause
> for concern with the existing checks?

For example, "!uid_eq(cred->euid, tcred->uid)" means that you can
adopt setuid things.  Looking at ptrace may be a good start.

>
>>
>> This sounds like it will break anything that uses wait and expects its
>> children to not be stolen out from under it.
>
> This thought crossed my mind, and originally I intended to restrict
> adoption to orphans who had become children of init.  It seemed like
> more general use might be desirable so I left that out.  If this
> really is a possibility worth preventing we could put that restriction
> on it.  To the best of my knowledge, nothing informs init of its
> becoming parent of an orphan, so we should be able to steal the orphan
> back without harm.  This still enables the use case of screen/tmux
> reattachment.
>
>>
>> Also, you'll have problems with screen -x or the default tmux shareable
>> configuration.  It sounds like this is better done in userspace.
>
> It just needs some determination of "session leader" applied to which
> attach adopts the backend before invoking adopt(), that logic goes in
> userspace.  I imagine the implementations of both screen and tmux
> already have such determinations happening for other reasons.

What I mean is: why not just teach your userspace tool to do this
without kernel help?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: adopt(pid_t pid) syscall proposal [patch included]

2013-06-11 Thread Andrew Lutomirski
On Tue, Jun 11, 2013 at 11:38 AM,  vcap...@gnugeneration.com wrote:
 On Tue, Jun 11, 2013 at 09:48:22AM -0700, Andy Lutomirski wrote:
 On 06/10/2013 06:23 PM, vcap...@gnugeneration.com wrote:
 +if (!uid_eq(cred-euid, tcred-suid) 
 +!uid_eq(cred-euid, tcred-uid)  
 +!uid_eq(cred-uid,  tcred-suid) 
 +!uid_eq(cred-uid,  tcred-uid) 
 +!ns_capable(cred-user_ns, CAP_KILL)) {
 +ret = -EPERM;
 +goto out_unlock;
 +}
 +

 That check's far too permissive.

 I suspected, but that's easily improved, I just wanted to get this out
 there and see what people thought, start the discussion, as well as
 get my use case functional.  Although I'm curious, what is your cause
 for concern with the existing checks?

For example, !uid_eq(cred-euid, tcred-uid) means that you can
adopt setuid things.  Looking at ptrace may be a good start.



 This sounds like it will break anything that uses wait and expects its
 children to not be stolen out from under it.

 This thought crossed my mind, and originally I intended to restrict
 adoption to orphans who had become children of init.  It seemed like
 more general use might be desirable so I left that out.  If this
 really is a possibility worth preventing we could put that restriction
 on it.  To the best of my knowledge, nothing informs init of its
 becoming parent of an orphan, so we should be able to steal the orphan
 back without harm.  This still enables the use case of screen/tmux
 reattachment.


 Also, you'll have problems with screen -x or the default tmux shareable
 configuration.  It sounds like this is better done in userspace.

 It just needs some determination of session leader applied to which
 attach adopts the backend before invoking adopt(), that logic goes in
 userspace.  I imagine the implementations of both screen and tmux
 already have such determinations happening for other reasons.

What I mean is: why not just teach your userspace tool to do this
without kernel help?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 54/74] x86, lto, vdso: Don't duplicate vvar address variables

2012-08-20 Thread Andrew Lutomirski
On Sat, Aug 18, 2012 at 7:56 PM, Andi Kleen  wrote:
> From: Andi Kleen 
>
> Every includer of vvar.h currently gets own static variables
> for all the vvar addresses. Generate just one set each for the
> main kernel and for the vdso. This saves some data space.
>
> Cc: Andy Lutomirski 
> Signed-off-by: Andi Kleen 

[This doesn't apply to -linus or to 3.5, so I haven't actually tested it.]

NACK, without significant further evidence that this is a good idea.

On input like this:

static const int * const vvaraddr_test = 0xff601000;

int func(void)
{
return *vvaraddr_test;
}

gcc -O2 generates:

.file   "constptr.c"
.text
.p2align 4,,15
.globl  func
.type   func, @function
func:
.LFB0:
.cfi_startproc
movl-10481664, %eax
ret
.cfi_endproc
.LFE0:
.size   func, .-func
.ident  "GCC: (GNU) 4.6.3 20120306 (Red Hat 4.6.3-2)"
.section.note.GNU-stack,"",@progbits

Note, in particular, that (a) the load from the vvar uses an immediate
memory operand (this avoids a cacheline access, which is a measureable
speedup) and (b) vvaraddr_test was not emitted as data at all.

Your code will force each vvar address to be emitted as data and will
cause each reference to reference it as data.  Barring cleverness (and
I don't remember whether the vdso build is currently clever), this
could result in double-indirect access via the GOT from the vdso.

This kind of change IMO needs actual size measurements, benchmarks,
and some evidence that duplicate .data/.rodata things were emitted.

--Andy

> ---
>  arch/x86/include/asm/vvar.h|   27 +--
>  arch/x86/vdso/vclock_gettime.c |1 +
>  arch/x86/vdso/vma.c|1 +
>  3 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
> index d76ac40..1fd06a8 100644
> --- a/arch/x86/include/asm/vvar.h
> +++ b/arch/x86/include/asm/vvar.h
> @@ -24,27 +24,34 @@
>  /* The kernel linker script defines its own magic to put vvars in the
>   * right place.
>   */
> -#define DECLARE_VVAR(offset, type, name) \
> -   EMIT_VVAR(name, offset)
> +#define DECLARE_VVAR(type, name) \
> +   EMIT_VVAR(name, VVAR_OFFSET_ ## name)
> +
> +#elif defined(__VVAR_ADDR)
> +
> +#define DECLARE_VVAR(type, name)   \
> +   type const * const vvaraddr_ ## name =  \
> +   (void *)(VVAR_ADDRESS + (VVAR_OFFSET_ ## name));
>
>  #else
>
> -#define DECLARE_VVAR(offset, type, name)   \
> -   static type const * const vvaraddr_ ## name =   \
> -   (void *)(VVAR_ADDRESS + (offset));
> +#define DECLARE_VVAR(type, name)   \
> +   extern type const * const vvaraddr_ ## name;
>
>  #define DEFINE_VVAR(type, name)  
>   \
> type name   \
> __attribute__((section(".vvar_" #name), aligned(16))) __visible
> +#endif
>
>  #define VVAR(name) (*vvaraddr_ ## name)
>
> -#endif
> -
>  /* DECLARE_VVAR(offset, type, name) */
>
> -DECLARE_VVAR(0, volatile unsigned long, jiffies)
> -DECLARE_VVAR(16, int, vgetcpu_mode)
> -DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
> +#define VVAR_OFFSET_jiffies 0
> +DECLARE_VVAR(volatile unsigned long, jiffies)
> +#define VVAR_OFFSET_vgetcpu_mode 16
> +DECLARE_VVAR(int, vgetcpu_mode)
> +#define VVAR_OFFSET_vsyscall_gtod_data 128
> +DECLARE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data)
>
>  #undef DECLARE_VVAR
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 885eff4..007eac4 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -10,6 +10,7 @@
>
>  /* Disable profiling for userspace code: */
>  #define DISABLE_BRANCH_PROFILING
> +#define __VVAR_ADDR 1
>
>  #include 
>  #include 
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index fe08e2b..4432cfc 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -3,6 +3,7 @@
>   * Copyright 2007 Andi Kleen, SUSE Labs.
>   * Subject to the GPL, v.2
>   */
> +#define __VVAR_ADDR 1
>  #include 
>  #include 
>  #include 
> --
> 1.7.7.6
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 54/74] x86, lto, vdso: Don't duplicate vvar address variables

2012-08-20 Thread Andrew Lutomirski
On Sat, Aug 18, 2012 at 7:56 PM, Andi Kleen a...@firstfloor.org wrote:
 From: Andi Kleen a...@linux.intel.com

 Every includer of vvar.h currently gets own static variables
 for all the vvar addresses. Generate just one set each for the
 main kernel and for the vdso. This saves some data space.

 Cc: Andy Lutomirski l...@mit.edu
 Signed-off-by: Andi Kleen a...@linux.intel.com

[This doesn't apply to -linus or to 3.5, so I haven't actually tested it.]

NACK, without significant further evidence that this is a good idea.

On input like this:

static const int * const vvaraddr_test = 0xff601000;

int func(void)
{
return *vvaraddr_test;
}

gcc -O2 generates:

.file   constptr.c
.text
.p2align 4,,15
.globl  func
.type   func, @function
func:
.LFB0:
.cfi_startproc
movl-10481664, %eax
ret
.cfi_endproc
.LFE0:
.size   func, .-func
.ident  GCC: (GNU) 4.6.3 20120306 (Red Hat 4.6.3-2)
.section.note.GNU-stack,,@progbits

Note, in particular, that (a) the load from the vvar uses an immediate
memory operand (this avoids a cacheline access, which is a measureable
speedup) and (b) vvaraddr_test was not emitted as data at all.

Your code will force each vvar address to be emitted as data and will
cause each reference to reference it as data.  Barring cleverness (and
I don't remember whether the vdso build is currently clever), this
could result in double-indirect access via the GOT from the vdso.

This kind of change IMO needs actual size measurements, benchmarks,
and some evidence that duplicate .data/.rodata things were emitted.

--Andy

 ---
  arch/x86/include/asm/vvar.h|   27 +--
  arch/x86/vdso/vclock_gettime.c |1 +
  arch/x86/vdso/vma.c|1 +
  3 files changed, 19 insertions(+), 10 deletions(-)

 diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
 index d76ac40..1fd06a8 100644
 --- a/arch/x86/include/asm/vvar.h
 +++ b/arch/x86/include/asm/vvar.h
 @@ -24,27 +24,34 @@
  /* The kernel linker script defines its own magic to put vvars in the
   * right place.
   */
 -#define DECLARE_VVAR(offset, type, name) \
 -   EMIT_VVAR(name, offset)
 +#define DECLARE_VVAR(type, name) \
 +   EMIT_VVAR(name, VVAR_OFFSET_ ## name)
 +
 +#elif defined(__VVAR_ADDR)
 +
 +#define DECLARE_VVAR(type, name)   \
 +   type const * const vvaraddr_ ## name =  \
 +   (void *)(VVAR_ADDRESS + (VVAR_OFFSET_ ## name));

  #else

 -#define DECLARE_VVAR(offset, type, name)   \
 -   static type const * const vvaraddr_ ## name =   \
 -   (void *)(VVAR_ADDRESS + (offset));
 +#define DECLARE_VVAR(type, name)   \
 +   extern type const * const vvaraddr_ ## name;

  #define DEFINE_VVAR(type, name)  
   \
 type name   \
 __attribute__((section(.vvar_ #name), aligned(16))) __visible
 +#endif

  #define VVAR(name) (*vvaraddr_ ## name)

 -#endif
 -
  /* DECLARE_VVAR(offset, type, name) */

 -DECLARE_VVAR(0, volatile unsigned long, jiffies)
 -DECLARE_VVAR(16, int, vgetcpu_mode)
 -DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
 +#define VVAR_OFFSET_jiffies 0
 +DECLARE_VVAR(volatile unsigned long, jiffies)
 +#define VVAR_OFFSET_vgetcpu_mode 16
 +DECLARE_VVAR(int, vgetcpu_mode)
 +#define VVAR_OFFSET_vsyscall_gtod_data 128
 +DECLARE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data)

  #undef DECLARE_VVAR
 diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
 index 885eff4..007eac4 100644
 --- a/arch/x86/vdso/vclock_gettime.c
 +++ b/arch/x86/vdso/vclock_gettime.c
 @@ -10,6 +10,7 @@

  /* Disable profiling for userspace code: */
  #define DISABLE_BRANCH_PROFILING
 +#define __VVAR_ADDR 1

  #include linux/kernel.h
  #include linux/posix-timers.h
 diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
 index fe08e2b..4432cfc 100644
 --- a/arch/x86/vdso/vma.c
 +++ b/arch/x86/vdso/vma.c
 @@ -3,6 +3,7 @@
   * Copyright 2007 Andi Kleen, SUSE Labs.
   * Subject to the GPL, v.2
   */
 +#define __VVAR_ADDR 1
  #include linux/mm.h
  #include linux/err.h
  #include linux/sched.h
 --
 1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip

2012-08-13 Thread Andrew Lutomirski
On Fri, Aug 10, 2012 at 2:14 AM, James Morris  wrote:
> On Sat, 14 Jul 2012, Will Drewry wrote:
>
>> Agreed :) I don't mind making tweaks to get it right, but this only
>> matters to users that want to:
>> - use seccomp filter
>> - with ptrace (or trap with resumption and not sigreturn)
>> - of time, gettimeofday, and getcpu
>> since they will then have to include quirk management _just_ in case
>> their code is linked against something using vsyscall and
>> vsyscall=emulate is in effect!
>
> I think these patches came out during the merge window -- is there any
> further discussion on them?

The patch I sent on Aug 1 is the latest version.  I don't think
there's been any further discussion -- everyone seems reasonably happy
with that version.

--Andy

>
>
> - James
> --
> James Morris
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip

2012-08-13 Thread Andrew Lutomirski
On Fri, Aug 10, 2012 at 2:14 AM, James Morris jmor...@namei.org wrote:
 On Sat, 14 Jul 2012, Will Drewry wrote:

 Agreed :) I don't mind making tweaks to get it right, but this only
 matters to users that want to:
 - use seccomp filter
 - with ptrace (or trap with resumption and not sigreturn)
 - of time, gettimeofday, and getcpu
 since they will then have to include quirk management _just_ in case
 their code is linked against something using vsyscall and
 vsyscall=emulate is in effect!

 I think these patches came out during the merge window -- is there any
 further discussion on them?

The patch I sent on Aug 1 is the latest version.  I don't think
there's been any further discussion -- everyone seems reasonably happy
with that version.

--Andy



 - James
 --
 James Morris
 jmor...@namei.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip

2012-07-14 Thread Andrew Lutomirski
On Sat, Jul 14, 2012 at 8:57 AM, Will Drewry  wrote:
> On Sat, Jul 14, 2012 at 10:50 AM, Will Drewry  wrote:
>> On Sat, Jul 14, 2012 at 10:44 AM, Andrew Lutomirski  wrote:
>>> I think I'd prefer if changing to something other than whatever value is
>>> used to cancel the syscall resulted in a crash rather than just being
>>> ignored.
>>
>> I was trying to keep as much seccomp-ptrace behavior intact rather
>> than making it terminal in this special case.  Is there a reason why
>> it'd make more sense to crash?
>
> Unless you meant something the tracer could catch?  That may make
> sense, but they could also use singlestep or whatever else to get
> similar behavior.  But maybe I'm missing the bigger picture!

I think it would be nice to not introduce any special behavior that
things might rely on if we do this better in the future.  Similarly,
for almost all purposes, a tracer could change gettimeofday to write,
but there would be a silent behavior change if gettimeofday were
entered via vsyscall.

What's the standard way of skipping a syscall?  sigreturn?  (I don't
know off the top of my head what sigreturn does.)  sys_ni_syscall?
I'd be all for making those continue to work but making anything that
can't be emulated correctly do something sufficiently unpleasant that
people won't do it.  Is there a syscall that does nothing at all?

I wish we could just increment rip by 7 and set a flag to allow the
vsyscall page instructions to be fully emulated until one of the ret
instructions happens, but I don't know how to do that without
monkeying with the entry assembly -- the do_page_fault path doesn't
look enough like system_call to pull it off easily.

In any case, can you change the docs to indicate that the special
behavior only happens iff rip & ~0x0c00 == 0xff60?  That
way a hypothetical future emulator could add better emulation
(incrementing rip by 7, for example) and everything would still work.
(There's no need to check the syscall number.)

FWIW, this crap is why I'm sort of tempted to say that seccomp should
just ignore vsyscalls entirely (except in mode 1).  None of them
actually do anything other than querying things that can be read out
of the vsyscall page (for the most part) without any kernel entry.
(Making *that* part of the ABI would be bad, though.)  Better ideas
are welcome.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] vsyscall_64: allow SECCOMP_RET_TRACErs to skip

2012-07-14 Thread Andrew Lutomirski
On Sat, Jul 14, 2012 at 8:57 AM, Will Drewry w...@chromium.org wrote:
 On Sat, Jul 14, 2012 at 10:50 AM, Will Drewry w...@chromium.org wrote:
 On Sat, Jul 14, 2012 at 10:44 AM, Andrew Lutomirski l...@mit.edu wrote:
 I think I'd prefer if changing to something other than whatever value is
 used to cancel the syscall resulted in a crash rather than just being
 ignored.

 I was trying to keep as much seccomp-ptrace behavior intact rather
 than making it terminal in this special case.  Is there a reason why
 it'd make more sense to crash?

 Unless you meant something the tracer could catch?  That may make
 sense, but they could also use singlestep or whatever else to get
 similar behavior.  But maybe I'm missing the bigger picture!

I think it would be nice to not introduce any special behavior that
things might rely on if we do this better in the future.  Similarly,
for almost all purposes, a tracer could change gettimeofday to write,
but there would be a silent behavior change if gettimeofday were
entered via vsyscall.

What's the standard way of skipping a syscall?  sigreturn?  (I don't
know off the top of my head what sigreturn does.)  sys_ni_syscall?
I'd be all for making those continue to work but making anything that
can't be emulated correctly do something sufficiently unpleasant that
people won't do it.  Is there a syscall that does nothing at all?

I wish we could just increment rip by 7 and set a flag to allow the
vsyscall page instructions to be fully emulated until one of the ret
instructions happens, but I don't know how to do that without
monkeying with the entry assembly -- the do_page_fault path doesn't
look enough like system_call to pull it off easily.

In any case, can you change the docs to indicate that the special
behavior only happens iff rip  ~0x0c00 == 0xff60?  That
way a hypothetical future emulator could add better emulation
(incrementing rip by 7, for example) and everything would still work.
(There's no need to check the syscall number.)

FWIW, this crap is why I'm sort of tempted to say that seccomp should
just ignore vsyscalls entirely (except in mode 1).  None of them
actually do anything other than querying things that can be read out
of the vsyscall page (for the most part) without any kernel entry.
(Making *that* part of the ABI would be bad, though.)  Better ideas
are welcome.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry  wrote:
> If a seccomp filter program is installed, older static binaries and
> distributions with older libc implementations (glibc 2.13 and earlier)
> that rely on vsyscall use will be terminated regardless of the filter
> program policy when executing time, gettimeofday, or getcpu.  This is
> only the case when vsyscall emulation is in use (vsyscall=emulate is the
> default).
>
> This patch emulates system call entry inside a vsyscall=emulate by
> populating regs->ax and regs->orig_ax with the system call number prior
> to calling into seccomp such that all seccomp-dependencies function
> normally.  Additionally, system call return behavior is emulated in line
> with other vsyscall entrypoints for the trace/trap cases.
>
> Reported-by: Owen Kibel 
> Signed-off-by: Will Drewry 
>
> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
>
> current_thread_info()->sig_on_uaccess_error = 
> prev_sig_on_uaccess_error;
>
> +   if (skip) {
> +   if ((long)regs->ax <= 0L) /* seccomp errno emulation */
> +   goto do_ret;
> +   goto done; /* seccomp trace/trap */
> +   }
> +
> if (ret == -EFAULT) {
> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
> warn_bad_vsyscall(KERN_INFO, regs,
> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
>
> regs->ax = ret;
>
> +do_ret:
> /* Emulate a ret instruction. */
> regs->ip = caller;
> regs->sp += 8;
> -
> +done:
> return true;
>
>  sigsegv:
> --
> 1.7.9.5
>

This has the same odd property as the sigsegv path that the faulting
instruction will appear to be the mov, not the syscall.  That seems to
be okay, though -- various pieces of code that try to restart the segv
are okay with that.

Is there any code that assumes that changing rax (i.e. the syscall
number) and restarting a syscall after SIGSYS will invoke the new
syscall?  (The RET_TRACE path might be similar -- does the
ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
a chance to synchronously cancel or change the syscall?

If those issues aren't problems, then:

Reviewed-by: Andy Lutomirski 

(If the syscall number needs to change after the fact in the
SECCOMP_RET_TRAP case, it'll be a mess.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Thu, Jul 12, 2012 at 10:17 PM, Will Drewry  wrote:
> If a seccomp filter program is installed, older static binaries and
> distributions with older libc implementations (glibc 2.13 and earlier)
> that rely on vsyscall use will be terminated regardless of the filter
> program policy when executing time, gettimeofday, or getcpu.  This is
> only the case when vsyscall emulation is in use (vsyscall=emulate is the
> default).
>
> This patch emulates system call entry inside a vsyscall=emulate trap
> such that seccomp can properly evaluate the system call.
>
> Reported-by: Owen Kibel 
> Signed-off-by: Will Drewry 
> ---
>  arch/x86/kernel/vsyscall_64.c |   29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 7515cf0..433545f 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -139,6 +139,14 @@ static int addr_to_vsyscall_nr(unsigned long addr)
> return nr;
>  }
>
> +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
> +{
> +   if (!seccomp_mode(>seccomp))
> +   return 0;
> +   task_pt_regs(tsk)->orig_ax = syscall_nr;
> +   return __secure_computing(syscall_nr);
> +}
> +
>  static bool write_ok_or_segv(unsigned long ptr, size_t size)
>  {
> /*
> @@ -174,6 +182,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
> address)
> int vsyscall_nr;
> int prev_sig_on_uaccess_error;
> long ret;
> +   int skip;
>
> /*
>  * No point in checking CS -- the only way to get here is a user mode
> @@ -205,9 +214,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
> address)
> }
>
> tsk = current;
> -   if (seccomp_mode(>seccomp))
> -   do_exit(SIGKILL);
> -
> /*
>  * With a real vsyscall, page faults cause SIGSEGV.  We want to
>  * preserve that behavior to make writing exploits harder.
> @@ -222,8 +228,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
>  * address 0".
>  */
> ret = -EFAULT;
> +   skip = 0;
> switch (vsyscall_nr) {
> case 0:
> +   skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
> +   if (skip)
> +   break;
> +
> if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
> !write_ok_or_segv(regs->si, sizeof(struct timezone)))
> break;
> @@ -234,6 +245,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
> break;
>
> case 1:
> +   skip = vsyscall_seccomp(tsk, __NR_time);
> +   if (skip)
> +   break;
> +
> if (!write_ok_or_segv(regs->di, sizeof(time_t)))
> break;
>
> @@ -241,6 +256,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
> break;
>
> case 2:
> +   skip = vsyscall_seccomp(tsk, __NR_getcpu);
> +   if (skip)
> +   break;
> +
> if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
> !write_ok_or_segv(regs->si, sizeof(unsigned)))
> break;
> @@ -253,6 +272,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
> address)
>
> current_thread_info()->sig_on_uaccess_error = 
> prev_sig_on_uaccess_error;
>
> +   if (skip)
> +   goto do_ret;
> +
> if (ret == -EFAULT) {
> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
> warn_bad_vsyscall(KERN_INFO, regs,
> @@ -271,6 +293,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
> address)
>
> regs->ax = ret;
>
> +do_ret:
> /* Emulate a ret instruction. */
> regs->ip = caller;
> regs->sp += 8;

Does this work correctly in SECCOMP_RET_TRAP, TRACE, or ERRNO mode?
errno looks okay, but trap and trace still emulate the ret
instruction, which looks like it could confuse debuggers.  (If, on the
other hand, no change is made to the registers, then the debugger will
see a syscall instruction at rip, albeit one that can't actually be
executed due to the nx bit.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Thu, Jul 12, 2012 at 10:17 PM, Will Drewry w...@chromium.org wrote:
 If a seccomp filter program is installed, older static binaries and
 distributions with older libc implementations (glibc 2.13 and earlier)
 that rely on vsyscall use will be terminated regardless of the filter
 program policy when executing time, gettimeofday, or getcpu.  This is
 only the case when vsyscall emulation is in use (vsyscall=emulate is the
 default).

 This patch emulates system call entry inside a vsyscall=emulate trap
 such that seccomp can properly evaluate the system call.

 Reported-by: Owen Kibel qme...@gmail.com
 Signed-off-by: Will Drewry w...@chromium.org
 ---
  arch/x86/kernel/vsyscall_64.c |   29 ++---
  1 file changed, 26 insertions(+), 3 deletions(-)

 diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
 index 7515cf0..433545f 100644
 --- a/arch/x86/kernel/vsyscall_64.c
 +++ b/arch/x86/kernel/vsyscall_64.c
 @@ -139,6 +139,14 @@ static int addr_to_vsyscall_nr(unsigned long addr)
 return nr;
  }

 +static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
 +{
 +   if (!seccomp_mode(tsk-seccomp))
 +   return 0;
 +   task_pt_regs(tsk)-orig_ax = syscall_nr;
 +   return __secure_computing(syscall_nr);
 +}
 +
  static bool write_ok_or_segv(unsigned long ptr, size_t size)
  {
 /*
 @@ -174,6 +182,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
 address)
 int vsyscall_nr;
 int prev_sig_on_uaccess_error;
 long ret;
 +   int skip;

 /*
  * No point in checking CS -- the only way to get here is a user mode
 @@ -205,9 +214,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
 address)
 }

 tsk = current;
 -   if (seccomp_mode(tsk-seccomp))
 -   do_exit(SIGKILL);
 -
 /*
  * With a real vsyscall, page faults cause SIGSEGV.  We want to
  * preserve that behavior to make writing exploits harder.
 @@ -222,8 +228,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)
  * address 0.
  */
 ret = -EFAULT;
 +   skip = 0;
 switch (vsyscall_nr) {
 case 0:
 +   skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
 +   if (skip)
 +   break;
 +
 if (!write_ok_or_segv(regs-di, sizeof(struct timeval)) ||
 !write_ok_or_segv(regs-si, sizeof(struct timezone)))
 break;
 @@ -234,6 +245,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)
 break;

 case 1:
 +   skip = vsyscall_seccomp(tsk, __NR_time);
 +   if (skip)
 +   break;
 +
 if (!write_ok_or_segv(regs-di, sizeof(time_t)))
 break;

 @@ -241,6 +256,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)
 break;

 case 2:
 +   skip = vsyscall_seccomp(tsk, __NR_getcpu);
 +   if (skip)
 +   break;
 +
 if (!write_ok_or_segv(regs-di, sizeof(unsigned)) ||
 !write_ok_or_segv(regs-si, sizeof(unsigned)))
 break;
 @@ -253,6 +272,9 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
 address)

 current_thread_info()-sig_on_uaccess_error = 
 prev_sig_on_uaccess_error;

 +   if (skip)
 +   goto do_ret;
 +
 if (ret == -EFAULT) {
 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
 warn_bad_vsyscall(KERN_INFO, regs,
 @@ -271,6 +293,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
 address)

 regs-ax = ret;

 +do_ret:
 /* Emulate a ret instruction. */
 regs-ip = caller;
 regs-sp += 8;

Does this work correctly in SECCOMP_RET_TRAP, TRACE, or ERRNO mode?
errno looks okay, but trap and trace still emulate the ret
instruction, which looks like it could confuse debuggers.  (If, on the
other hand, no change is made to the registers, then the debugger will
see a syscall instruction at rip, albeit one that can't actually be
executed due to the nx bit.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry w...@chromium.org wrote:
 If a seccomp filter program is installed, older static binaries and
 distributions with older libc implementations (glibc 2.13 and earlier)
 that rely on vsyscall use will be terminated regardless of the filter
 program policy when executing time, gettimeofday, or getcpu.  This is
 only the case when vsyscall emulation is in use (vsyscall=emulate is the
 default).

 This patch emulates system call entry inside a vsyscall=emulate by
 populating regs-ax and regs-orig_ax with the system call number prior
 to calling into seccomp such that all seccomp-dependencies function
 normally.  Additionally, system call return behavior is emulated in line
 with other vsyscall entrypoints for the trace/trap cases.

 Reported-by: Owen Kibel qme...@gmail.com
 Signed-off-by: Will Drewry w...@chromium.org

 v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

 @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 current_thread_info()-sig_on_uaccess_error = 
 prev_sig_on_uaccess_error;

 +   if (skip) {
 +   if ((long)regs-ax = 0L) /* seccomp errno emulation */
 +   goto do_ret;
 +   goto done; /* seccomp trace/trap */
 +   }
 +
 if (ret == -EFAULT) {
 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
 warn_bad_vsyscall(KERN_INFO, regs,
 @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 regs-ax = ret;

 +do_ret:
 /* Emulate a ret instruction. */
 regs-ip = caller;
 regs-sp += 8;
 -
 +done:
 return true;

  sigsegv:
 --
 1.7.9.5


This has the same odd property as the sigsegv path that the faulting
instruction will appear to be the mov, not the syscall.  That seems to
be okay, though -- various pieces of code that try to restart the segv
are okay with that.

Is there any code that assumes that changing rax (i.e. the syscall
number) and restarting a syscall after SIGSYS will invoke the new
syscall?  (The RET_TRACE path might be similar -- does the
ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
a chance to synchronously cancel or change the syscall?

If those issues aren't problems, then:

Reviewed-by: Andy Lutomirski l...@amacapital.net

(If the syscall number needs to change after the fact in the
SECCOMP_RET_TRAP case, it'll be a mess.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-21 Thread Andrew Lutomirski
On Dec 20, 2007 3:17 PM, Phillip Susi <[EMAIL PROTECTED]> wrote:
> Andrew Lutomirski wrote:
> > I understand that there's no way that /dev/random can provide good
> > output if there's insufficient entropy.  But it still shouldn't leak
> > arbitrary bits of user data that were never meant to be put into the
> > pool at all.
>
> It doesn't leak it though, it consumes it, and it then vanishes into the
> entropy pool, never to be seen again.

No, it's there, and if there's little enough entropy around it can be
recovered by brute force.

>
> > Step 1: Boot a system without a usable entropy source.
> > Step 2: add some (predictable) "entropy" from userspace which isn't a
> > multiple of 4, so up to three extra bytes get added.
> > Step 3: Read a few bytes of /dev/random and send them over the network.
>
> Only root can do 1 and 2, at which point, it's already game over.

Again, no.  Root could do this accidentally.  Step 1 happens all the
time (see the comments on non-unique UUIDs).  Step 2 just requires a
program to put data which it things is OK to go into the pool next to
data that shouldn't be there in memory.  I'm OK with the entropy pool
being insecure in cases where it cannot possibly be secure.  But it
should not leak information that never belonged in it in the first
place.

(Remember, the entire justification for Linux's model of the entropy
pool seems to be that it should be as secure as possible even against
computationally unbounded attackers or attackers who can find SHA
preimages.  A brute force attack that works sometimes and only
requires 2^24 brute force iterations certainly fits into this threat
model.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-21 Thread Andrew Lutomirski
On Dec 20, 2007 3:17 PM, Phillip Susi [EMAIL PROTECTED] wrote:
 Andrew Lutomirski wrote:
  I understand that there's no way that /dev/random can provide good
  output if there's insufficient entropy.  But it still shouldn't leak
  arbitrary bits of user data that were never meant to be put into the
  pool at all.

 It doesn't leak it though, it consumes it, and it then vanishes into the
 entropy pool, never to be seen again.

No, it's there, and if there's little enough entropy around it can be
recovered by brute force.


  Step 1: Boot a system without a usable entropy source.
  Step 2: add some (predictable) entropy from userspace which isn't a
  multiple of 4, so up to three extra bytes get added.
  Step 3: Read a few bytes of /dev/random and send them over the network.

 Only root can do 1 and 2, at which point, it's already game over.

Again, no.  Root could do this accidentally.  Step 1 happens all the
time (see the comments on non-unique UUIDs).  Step 2 just requires a
program to put data which it things is OK to go into the pool next to
data that shouldn't be there in memory.  I'm OK with the entropy pool
being insecure in cases where it cannot possibly be secure.  But it
should not leak information that never belonged in it in the first
place.

(Remember, the entire justification for Linux's model of the entropy
pool seems to be that it should be as secure as possible even against
computationally unbounded attackers or attackers who can find SHA
preimages.  A brute force attack that works sometimes and only
requires 2^24 brute force iterations certainly fits into this threat
model.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-19 Thread Andrew Lutomirski
On Dec 17, 2007 10:46 PM, Theodore Tso <[EMAIL PROTECTED]> wrote:
> If you have a system with insufficent entropy inputs, you're in
> trouble, of course.  There are "catastrophic reseeds" that attempt to
> mitigrate some of worse attacks, but at the end of the day,
> /dev/random isn't magic.
>

I understand that there's no way that /dev/random can provide good
output if there's insufficient entropy.  But it still shouldn't leak
arbitrary bits of user data that were never meant to be put into the
pool at all.

(My hypothetical attack is a lot hypothetical than I thought at first.
 An attacker does not need to break into the kernel and steal the
state of the pool.  It may be as easy as this to trigger:

Step 1: Boot a system without a usable entropy source.
Step 2: add some (predictable) "entropy" from userspace which isn't a
multiple of 4, so up to three extra bytes get added.
Step 3: Read a few bytes of /dev/random and send them over the network.

An attacker can now try all possibilities of the three extra bytes and
guess them pretty quickly.  No compromise needed.  This is, IMHO, bad.
 (It's one thing for the "random" numbers to be weak.  It's another
thing entirely for them to reveal data that never belonged in the pool
in the first place.)

Actually, perhaps there should be a policy that we try never to reseed
the pool at all until there is enough entropy around to prevent
attacks like these.  (In theory the state of the pool might contain
2^(smallish number) bits of data interesting to the attacker even
without the uninitialized data issue.)  This would make the situation
even worse for low-entropy systems, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-19 Thread Andrew Lutomirski
On Dec 17, 2007 10:46 PM, Theodore Tso [EMAIL PROTECTED] wrote:
 If you have a system with insufficent entropy inputs, you're in
 trouble, of course.  There are catastrophic reseeds that attempt to
 mitigrate some of worse attacks, but at the end of the day,
 /dev/random isn't magic.


I understand that there's no way that /dev/random can provide good
output if there's insufficient entropy.  But it still shouldn't leak
arbitrary bits of user data that were never meant to be put into the
pool at all.

(My hypothetical attack is a lot hypothetical than I thought at first.
 An attacker does not need to break into the kernel and steal the
state of the pool.  It may be as easy as this to trigger:

Step 1: Boot a system without a usable entropy source.
Step 2: add some (predictable) entropy from userspace which isn't a
multiple of 4, so up to three extra bytes get added.
Step 3: Read a few bytes of /dev/random and send them over the network.

An attacker can now try all possibilities of the three extra bytes and
guess them pretty quickly.  No compromise needed.  This is, IMHO, bad.
 (It's one thing for the random numbers to be weak.  It's another
thing entirely for them to reveal data that never belonged in the pool
in the first place.)

Actually, perhaps there should be a policy that we try never to reseed
the pool at all until there is enough entropy around to prevent
attacks like these.  (In theory the state of the pool might contain
2^(smallish number) bits of data interesting to the attacker even
without the uninitialized data issue.)  This would make the situation
even worse for low-entropy systems, though.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A little coding style nugget of joy

2007-09-19 Thread Andrew Lutomirski
On 9/19/07, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > This is a terrible assumption in general (i.e. if filesize % blocksize
> > is close to uniformly distributed).  If you remove one byte and the data
> > is stored with blocksize B, then you either save zero bytes with
> > probability 1-1/B or you save B bytes with probability 1/B.  The
> > expected number of bytes saved is B*1/B=1.  Since expectation is linear,
> > if you remove x bytes, the expected number of bytes saved is x (even if
> > there is more than one byte removed per file).
>
> You didn't calculate the probability of actually saving a full block
> or not (that's the only thing that matters). I assumed it's relatively
> small and can be ignored in practice since the amount of end white
> space is negligible compared to total file size.

Sure I did.  It's roughly 1/B per byte removed ( = 1/4096 ).

--Andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A little coding style nugget of joy

2007-09-19 Thread Andrew Lutomirski
On 9/19/07, Andi Kleen [EMAIL PROTECTED] wrote:
  This is a terrible assumption in general (i.e. if filesize % blocksize
  is close to uniformly distributed).  If you remove one byte and the data
  is stored with blocksize B, then you either save zero bytes with
  probability 1-1/B or you save B bytes with probability 1/B.  The
  expected number of bytes saved is B*1/B=1.  Since expectation is linear,
  if you remove x bytes, the expected number of bytes saved is x (even if
  there is more than one byte removed per file).

 You didn't calculate the probability of actually saving a full block
 or not (that's the only thing that matters). I assumed it's relatively
 small and can be ignored in practice since the amount of end white
 space is negligible compared to total file size.

Sure I did.  It's roughly 1/B per byte removed ( = 1/4096 ).

--Andy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/