Re: [PATCH] bcache: add cond_resched() in __bch_cache_cmp()

2019-03-07 Thread Vojtech Pavlik
On Thu, Mar 07, 2019 at 11:36:18PM +0800, Coly Li wrote:
> On 2019/3/7 11:06 下午, Shile Zhang wrote:
> > 
> > On 2019/3/7 18:34, Coly Li wrote:
> >> On 2019/3/7 1:15 下午, shile.zh...@linux.alibaba.com wrote:
> >>> From: Shile Zhang 
> >>>
> >>> Read /sys/fs/bcache//cacheN/priority_stats can take very long
> >>> time with huge cache after long run.
> >>>
> >>> Signed-off-by: Shile Zhang 
> >> Hi Shile,
> >>
> >> Do you test your change ? It will be helpful with more performance data
> >> (what problem that you improved).
> > 
> > In case of 960GB SSD cache device, once read of the 'priority_stats'
> > costs about 600ms in our test environment.
> > 
> 
> After the fix, how much time it takes ?
> 
> 
> > The perf tool shown that near 50% CPU time consumed by 'sort()', this
> > means once sort will hold the CPU near 300ms.
> > 
> > In our case, the statistics collector reads the 'priority_stats'
> > periodically, it will trigger the schedule latency jitters of the
> > 
> > task which shared same CPU core.
> > 
> 
> Hmm, it seems you just make the sort slower, and nothing more changes.
> Am I right ?

Well, it has to make the sort slower, but it'll also avoid hogging the
CPU (on a non-preemptible kernel), avoiding a potential soft lockup
warning and allowing other tasks to run.

-- 
Vojtech Pavlik
VP SUSE Labs


Re: [rfd] saving old mice -- button glitching/debouncing

2018-12-15 Thread Vojtech Pavlik
On Sat, Dec 15, 2018 at 10:47:22AM +0100, Pavel Machek wrote:

> > > b) would it be acceptable if done properly? (cmd line option to
> > > enable, avoiding duplicate/wrong events?)
> > 
> > Well, for one, you shouldn't be using a timer, all the debouncing can be
> > done by math on the event timestamps.
> 
> Not... really, right? You need to send an release some time after
> button indicates release if bounce did not happen. It is similar to
> autorepeat needing a timer.

You can send the first release and ignore all presses and releases in a
time window after that.

> Let me gain some experience with the patch. I don't think hardware
> does as heavy debouncing as you describe.

Microswitches vibrate significantly when clicked. Without hardware
debouncing, you'd have many clicks instead of one even on a new mouse.


-- 
Vojtech Pavlik


Re: [rfd] saving old mice -- button glitching/debouncing

2018-12-15 Thread Vojtech Pavlik
On Sat, Dec 15, 2018 at 12:24:37AM +0100, Pavel Machek wrote:

> Patch is obviously not ready; but:
> 
> a) would it be useful to people

Probably not.

Mice already do internal software/hardware debouncing on switches. If you
see duplicate clicks making it all the way to the kernel driver, something
is very wrong with the switch, to the point where it'll soon fail
completely.

Hence the value of doing extra processing in the kernel is very limited.

> b) would it be acceptable if done properly? (cmd line option to
> enable, avoiding duplicate/wrong events?)

Well, for one, you shouldn't be using a timer, all the debouncing can be
done by math on the event timestamps.

You'd also have a hard time distinguishing between doubleclicks and bounces.

Since the mice already filter the quick bounces and there is significant
delay (100ms on USB, similar on PS/2) between updates from the hardware, a
real doubleclick can look absolutely the same as what you observe as a
bounce.

As such, it couldn't be enabled by default.

If you really want to go this way, a userspace solution is the better choice.

-- 
Vojtech Pavlik


Re: [PATCH] Input: iforce - Add the Saitek R440 Force Wheel

2018-11-14 Thread Vojtech Pavlik
On Wed, Nov 14, 2018 at 10:52:40AM +0100, Tim Schumacher wrote:
> Signed-off-by: Tim Schumacher 
> ---
> Please note that I do NOT own this device.
> 
> I'm adding this based on the fact that this is an iforce-based
> device and that the Windows driver for the R440 works for the
> Logitech WingMan Formula Force after replacing the device/vendor
> IDs (I got the vendor/device IDs from there as well).
> 
> Please don't add this patch if adding devices based on that is
> not ok.
> 
> Also (not related to this patch specifically), does anyone know
> what the question marks at the end of some device definitions
> are supposed to mean?

I believe those are unconfirmed devices based on similar reasoning to yours.

> ---
>  drivers/input/joystick/iforce/iforce-main.c | 1 +
>  drivers/input/joystick/iforce/iforce-usb.c  | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/input/joystick/iforce/iforce-main.c 
> b/drivers/input/joystick/iforce/iforce-main.c
> index 58d5cfe46526..432deecaeff9 100644
> --- a/drivers/input/joystick/iforce/iforce-main.c
> +++ b/drivers/input/joystick/iforce/iforce-main.c
> @@ -67,6 +67,7 @@ static struct iforce_device iforce_device[] = {
>   { 0x05ef, 0x, "AVB Top Shot Force Feedback Racing Wheel",   
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x061c, 0xc0a4, "ACT LABS Force RS",  
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x061c, 0xc084, "ACT LABS Force RS",  
> btn_wheel, abs_wheel, ff_iforce },
> + { 0x06a3, 0xff04, "Saitek R440 Force Wheel",
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x06f8, 0x0001, "Guillemot Race Leader Force Feedback",   
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x06f8, 0x0001, "Guillemot Jet Leader Force Feedback",
> btn_joystick, abs_joystick_rudder, ff_iforce },
>   { 0x06f8, 0x0004, "Guillemot Force Feedback Racing Wheel",  
> btn_wheel, abs_wheel, ff_iforce }, //?
> diff --git a/drivers/input/joystick/iforce/iforce-usb.c 
> b/drivers/input/joystick/iforce/iforce-usb.c
> index 78073259c9a1..cdea61ae838d 100644
> --- a/drivers/input/joystick/iforce/iforce-usb.c
> +++ b/drivers/input/joystick/iforce/iforce-usb.c
> @@ -214,6 +214,7 @@ static const struct usb_device_id iforce_usb_ids[] = {
>   { USB_DEVICE(0x05ef, 0x) }, /* AVB Top Shot FFB Racing 
> Wheel */
>   { USB_DEVICE(0x061c, 0xc0a4) }, /* ACT LABS Force RS */
>   { USB_DEVICE(0x061c, 0xc084) }, /* ACT LABS Force RS */
> + { USB_DEVICE(0x06a3, 0xff04) }, /* Saitek R440 Force Wheel */
>   { USB_DEVICE(0x06f8, 0x0001) }, /* Guillemot Race Leader Force 
> Feedback */
>   { USB_DEVICE(0x06f8, 0x0003) }, /* Guillemot Jet Leader Force 
> Feedback */
>   { USB_DEVICE(0x06f8, 0x0004) }, /* Guillemot Force Feedback 
> Racing Wheel */
> -- 
> 2.19.1.450.ga4b8ab536
> 

-- 
Vojtech Pavlik
Director SUSE Labs


Re: [PATCH] Input: iforce - Add the Saitek R440 Force Wheel

2018-11-14 Thread Vojtech Pavlik
On Wed, Nov 14, 2018 at 10:52:40AM +0100, Tim Schumacher wrote:
> Signed-off-by: Tim Schumacher 
> ---
> Please note that I do NOT own this device.
> 
> I'm adding this based on the fact that this is an iforce-based
> device and that the Windows driver for the R440 works for the
> Logitech WingMan Formula Force after replacing the device/vendor
> IDs (I got the vendor/device IDs from there as well).
> 
> Please don't add this patch if adding devices based on that is
> not ok.
> 
> Also (not related to this patch specifically), does anyone know
> what the question marks at the end of some device definitions
> are supposed to mean?

I believe those are unconfirmed devices based on similar reasoning to yours.

> ---
>  drivers/input/joystick/iforce/iforce-main.c | 1 +
>  drivers/input/joystick/iforce/iforce-usb.c  | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/input/joystick/iforce/iforce-main.c 
> b/drivers/input/joystick/iforce/iforce-main.c
> index 58d5cfe46526..432deecaeff9 100644
> --- a/drivers/input/joystick/iforce/iforce-main.c
> +++ b/drivers/input/joystick/iforce/iforce-main.c
> @@ -67,6 +67,7 @@ static struct iforce_device iforce_device[] = {
>   { 0x05ef, 0x, "AVB Top Shot Force Feedback Racing Wheel",   
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x061c, 0xc0a4, "ACT LABS Force RS",  
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x061c, 0xc084, "ACT LABS Force RS",  
> btn_wheel, abs_wheel, ff_iforce },
> + { 0x06a3, 0xff04, "Saitek R440 Force Wheel",
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x06f8, 0x0001, "Guillemot Race Leader Force Feedback",   
> btn_wheel, abs_wheel, ff_iforce }, //?
>   { 0x06f8, 0x0001, "Guillemot Jet Leader Force Feedback",
> btn_joystick, abs_joystick_rudder, ff_iforce },
>   { 0x06f8, 0x0004, "Guillemot Force Feedback Racing Wheel",  
> btn_wheel, abs_wheel, ff_iforce }, //?
> diff --git a/drivers/input/joystick/iforce/iforce-usb.c 
> b/drivers/input/joystick/iforce/iforce-usb.c
> index 78073259c9a1..cdea61ae838d 100644
> --- a/drivers/input/joystick/iforce/iforce-usb.c
> +++ b/drivers/input/joystick/iforce/iforce-usb.c
> @@ -214,6 +214,7 @@ static const struct usb_device_id iforce_usb_ids[] = {
>   { USB_DEVICE(0x05ef, 0x) }, /* AVB Top Shot FFB Racing 
> Wheel */
>   { USB_DEVICE(0x061c, 0xc0a4) }, /* ACT LABS Force RS */
>   { USB_DEVICE(0x061c, 0xc084) }, /* ACT LABS Force RS */
> + { USB_DEVICE(0x06a3, 0xff04) }, /* Saitek R440 Force Wheel */
>   { USB_DEVICE(0x06f8, 0x0001) }, /* Guillemot Race Leader Force 
> Feedback */
>   { USB_DEVICE(0x06f8, 0x0003) }, /* Guillemot Jet Leader Force 
> Feedback */
>   { USB_DEVICE(0x06f8, 0x0004) }, /* Guillemot Force Feedback 
> Racing Wheel */
> -- 
> 2.19.1.450.ga4b8ab536
> 

-- 
Vojtech Pavlik
Director SUSE Labs


Re: [PATCH] Input: stop telling users to snail-mail Vojtech

2018-07-26 Thread Vojtech Pavlik
On Tue, Jul 24, 2018 at 11:45:54AM -0700, Dmitry Torokhov wrote:

> I do not think Vojtech wants snail mail these days, even if the
> address were correct (I do not know if it is), so let's remove
> snail-mail instructions from the sources.
 
Thanks. Nobody has ever sent me mail this way, but nevertheless, the
address isn't valid anymore, so removing it makes good sense.

-- 
Vojtech Pavlik
Director SUSE Labs


Re: [PATCH] Input: stop telling users to snail-mail Vojtech

2018-07-26 Thread Vojtech Pavlik
On Tue, Jul 24, 2018 at 11:45:54AM -0700, Dmitry Torokhov wrote:

> I do not think Vojtech wants snail mail these days, even if the
> address were correct (I do not know if it is), so let's remove
> snail-mail instructions from the sources.
 
Thanks. Nobody has ever sent me mail this way, but nevertheless, the
address isn't valid anymore, so removing it makes good sense.

-- 
Vojtech Pavlik
Director SUSE Labs


Re: bcache with existing ext4 filesystem

2017-07-25 Thread Vojtech Pavlik
On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > > free and "reserved for a bootloader"?
> > 
> > Yes.
> 
> Thanks.
> 
> > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > easy to accomplish on existing filesystem?
> > 
> > Huh?  Why would the bcache superblock matter when you're talking about
> > the ext4 layout?  The bcache superblock will be on the bcache
> > device/partition, and the ext4 superblock will be on the ext4
> > device/partition.
> 
> I'd like to enable bcache on already existing ext4 partition. AFAICT
> normal situation, even on the backing device, is:
> 
> | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> 
> Unfortunately, that would mean shifting 400GB data 8KB forward, and
> compatibility problems. So I'd prefer adding bcache superblock into
> the reserved space, so I can have caching _and_ compatibility with
> grub2 etc (and avoid 400GB move):

The common way to do that is to move the beginning of the partition,
assuming your ext4 lives in a partition.

I don't see how overlapping the ext4 and the bcache backing device
starts would give you what you want, because bcache assumes the
backing device data starts with an offset.

-- 
Vojtech Pavlik


Re: bcache with existing ext4 filesystem

2017-07-25 Thread Vojtech Pavlik
On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > > free and "reserved for a bootloader"?
> > 
> > Yes.
> 
> Thanks.
> 
> > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > easy to accomplish on existing filesystem?
> > 
> > Huh?  Why would the bcache superblock matter when you're talking about
> > the ext4 layout?  The bcache superblock will be on the bcache
> > device/partition, and the ext4 superblock will be on the ext4
> > device/partition.
> 
> I'd like to enable bcache on already existing ext4 partition. AFAICT
> normal situation, even on the backing device, is:
> 
> | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> 
> Unfortunately, that would mean shifting 400GB data 8KB forward, and
> compatibility problems. So I'd prefer adding bcache superblock into
> the reserved space, so I can have caching _and_ compatibility with
> grub2 etc (and avoid 400GB move):

The common way to do that is to move the beginning of the partition,
assuming your ext4 lives in a partition.

I don't see how overlapping the ext4 and the bcache backing device
starts would give you what you want, because bcache assumes the
backing device data starts with an offset.

-- 
Vojtech Pavlik


Re: [PATCH 7/7] DWARF: add the config option

2017-05-08 Thread Vojtech Pavlik
On Sun, May 07, 2017 at 04:48:36PM -0500, Josh Poimboeuf wrote:

> > Can objtool verify the unwinder at each address in the kernel, or is that 
> > an AI-complete problem?
> 
> It can't verify the *unwinder*, but it can verify the data which is fed
> to the unwinder (either DWARF or the structs I proposed above).  For
> each function, it follows every possible code path, and it can keep
> track of the stack pointer while doing so.

In that case, the kernel build process can verify the DWARF data and its
compatibility with the kernel unwinder by running the unwinder against
each kernel code address verifying the output and bail if there is a bug
in the toolchain that affects it.

-- 
Vojtech Pavlik
Director SuSE Labs


Re: [PATCH 7/7] DWARF: add the config option

2017-05-08 Thread Vojtech Pavlik
On Sun, May 07, 2017 at 04:48:36PM -0500, Josh Poimboeuf wrote:

> > Can objtool verify the unwinder at each address in the kernel, or is that 
> > an AI-complete problem?
> 
> It can't verify the *unwinder*, but it can verify the data which is fed
> to the unwinder (either DWARF or the structs I proposed above).  For
> each function, it follows every possible code path, and it can keep
> track of the stack pointer while doing so.

In that case, the kernel build process can verify the DWARF data and its
compatibility with the kernel unwinder by running the unwinder against
each kernel code address verifying the output and bail if there is a bug
in the toolchain that affects it.

-- 
Vojtech Pavlik
Director SuSE Labs


Re: [PATCH] Input: joystick: gf2k - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
On Tue, Nov 29, 2016 at 01:11:49AM +0530, Aniroop Mathur wrote:

> msleep(1~20) may not do what the caller intends, and will often sleep longer.
> (~20 ms actual sleep for any value given in the 1~20ms range)
> This is not the desired behaviour for many cases like device resume time,
> device suspend time, device enable time, connection time, probe time,
> loops, retry logic, etc
> msleep is built on jiffies / legacy timers which are not precise whereas
> usleep_range is build on top of hrtimers so the wakeups are precise.
> Thus, change msleep to usleep_range for precise wakeups.
> 
> For example:
> On a machine with tick rate / HZ as 100, msleep(4) will make the process to
> sleep for a minimum period of 10 ms whereas usleep_range(4000, 4100) will make
> sure that the process does not sleep for more than 4100 us or 4.1ms

And once more, patch not needed.

> 
> Signed-off-by: Aniroop Mathur <a.mat...@samsung.com>
> ---
>  drivers/input/joystick/gf2k.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/input/joystick/gf2k.c b/drivers/input/joystick/gf2k.c
> index 0f519db..e9d5095 100644
> --- a/drivers/input/joystick/gf2k.c
> +++ b/drivers/input/joystick/gf2k.c
> @@ -42,7 +42,7 @@ MODULE_LICENSE("GPL");
>  
>  #define GF2K_START   400 /* The time we wait for the first bit 
> [400 us] */
>  #define GF2K_STROBE  40  /* The time we wait for the first bit 
> [40 us] */
> -#define GF2K_TIMEOUT 4   /* Wait for everything to settle [4 ms] 
> */
> +#define GF2K_TIMEOUT 4000/* Wait for everything to settle [4000 
> us] */
>  #define GF2K_LENGTH  80  /* Max number of triplets in a packet */
>  
>  /*
> @@ -138,7 +138,7 @@ static void gf2k_trigger_seq(struct gameport *gameport, 
> short *seq)
>   i = 0;
>  do {
>   gameport_trigger(gameport);
> - t = gameport_time(gameport, GF2K_TIMEOUT * 1000);
> + t = gameport_time(gameport, GF2K_TIMEOUT);
>   while ((gameport_read(gameport) & 1) && t) t--;
>  udelay(seq[i]);
>  } while (seq[++i]);
> @@ -259,11 +259,11 @@ static int gf2k_connect(struct gameport *gameport, 
> struct gameport_driver *drv)
>  
>   gf2k_trigger_seq(gameport, gf2k_seq_reset);
>  
> - msleep(GF2K_TIMEOUT);
> + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100);
>  
>   gf2k_trigger_seq(gameport, gf2k_seq_digital);
>  
> - msleep(GF2K_TIMEOUT);
> + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100);
>  
>   if (gf2k_read_packet(gameport, GF2K_LENGTH, data) < 12) {
>   err = -ENODEV;
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [PATCH] Input: joystick: gf2k - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
On Tue, Nov 29, 2016 at 01:11:49AM +0530, Aniroop Mathur wrote:

> msleep(1~20) may not do what the caller intends, and will often sleep longer.
> (~20 ms actual sleep for any value given in the 1~20ms range)
> This is not the desired behaviour for many cases like device resume time,
> device suspend time, device enable time, connection time, probe time,
> loops, retry logic, etc
> msleep is built on jiffies / legacy timers which are not precise whereas
> usleep_range is build on top of hrtimers so the wakeups are precise.
> Thus, change msleep to usleep_range for precise wakeups.
> 
> For example:
> On a machine with tick rate / HZ as 100, msleep(4) will make the process to
> sleep for a minimum period of 10 ms whereas usleep_range(4000, 4100) will make
> sure that the process does not sleep for more than 4100 us or 4.1ms

And once more, patch not needed.

> 
> Signed-off-by: Aniroop Mathur 
> ---
>  drivers/input/joystick/gf2k.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/input/joystick/gf2k.c b/drivers/input/joystick/gf2k.c
> index 0f519db..e9d5095 100644
> --- a/drivers/input/joystick/gf2k.c
> +++ b/drivers/input/joystick/gf2k.c
> @@ -42,7 +42,7 @@ MODULE_LICENSE("GPL");
>  
>  #define GF2K_START   400 /* The time we wait for the first bit 
> [400 us] */
>  #define GF2K_STROBE  40  /* The time we wait for the first bit 
> [40 us] */
> -#define GF2K_TIMEOUT 4   /* Wait for everything to settle [4 ms] 
> */
> +#define GF2K_TIMEOUT 4000/* Wait for everything to settle [4000 
> us] */
>  #define GF2K_LENGTH  80  /* Max number of triplets in a packet */
>  
>  /*
> @@ -138,7 +138,7 @@ static void gf2k_trigger_seq(struct gameport *gameport, 
> short *seq)
>   i = 0;
>  do {
>   gameport_trigger(gameport);
> - t = gameport_time(gameport, GF2K_TIMEOUT * 1000);
> + t = gameport_time(gameport, GF2K_TIMEOUT);
>   while ((gameport_read(gameport) & 1) && t) t--;
>  udelay(seq[i]);
>  } while (seq[++i]);
> @@ -259,11 +259,11 @@ static int gf2k_connect(struct gameport *gameport, 
> struct gameport_driver *drv)
>  
>   gf2k_trigger_seq(gameport, gf2k_seq_reset);
>  
> - msleep(GF2K_TIMEOUT);
> + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100);
>  
>   gf2k_trigger_seq(gameport, gf2k_seq_digital);
>  
> - msleep(GF2K_TIMEOUT);
> + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100);
>  
>   if (gf2k_read_packet(gameport, GF2K_LENGTH, data) < 12) {
>   err = -ENODEV;
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [PATCH] Input: joystick: analog - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
On Tue, Nov 29, 2016 at 01:11:31AM +0530, Aniroop Mathur wrote:

> msleep(1~20) may not do what the caller intends, and will often sleep longer.
> (~20 ms actual sleep for any value given in the 1~20ms range)
> This is not the desired behaviour for many cases like device resume time,
> device suspend time, device enable time, connection time, probe time,
> loops, retry logic, etc
> msleep is built on jiffies / legacy timers which are not precise whereas
> usleep_range is build on top of hrtimers so the wakeups are precise.
> Thus, change msleep to usleep_range for precise wakeups.
> 
> For example:
> On a machine with tick rate / HZ as 100, msleep(3) will make the process to
> sleep for a minimum period of 10 ms whereas usleep_range(3000, 3100) will make
> sure that the process does not sleep for more than 3100 us or 3.1ms

Again, not needed, if the MAX_TIME sleeps are longer, nobody cares.

> 
> Signed-off-by: Aniroop Mathur <a.mat...@samsung.com>
> ---
>  drivers/input/joystick/analog.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c
> index 3d8ff09..2891704 100644
> --- a/drivers/input/joystick/analog.c
> +++ b/drivers/input/joystick/analog.c
> @@ -88,7 +88,7 @@ MODULE_PARM_DESC(map, "Describes analog joysticks 
> type/capabilities");
>  #define ANALOG_EXTENSIONS0x7ff00
>  #define ANALOG_GAMEPAD   0x8
>  
> -#define ANALOG_MAX_TIME  3   /* 3 ms */
> +#define ANALOG_MAX_TIME  3000/* 3000 us */
>  #define ANALOG_LOOP_TIME 2000/* 2 * loop */
>  #define ANALOG_SAITEK_DELAY  200 /* 200 us */
>  #define ANALOG_SAITEK_TIME   2000/* 2000 us */
> @@ -257,7 +257,7 @@ static int analog_cooked_read(struct analog_port *port)
>   int i, j;
>  
>   loopout = (ANALOG_LOOP_TIME * port->loop) / 1000;
> - timeout = ANALOG_MAX_TIME * port->speed;
> + timeout = (ANALOG_MAX_TIME / 1000) * port->speed;
>  
>   local_irq_save(flags);
>   gameport_trigger(gameport);
> @@ -625,20 +625,20 @@ static int analog_init_port(struct gameport *gameport, 
> struct gameport_driver *d
>  
>   gameport_trigger(gameport);
>   t = gameport_read(gameport);
> - msleep(ANALOG_MAX_TIME);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
>   port->mask = (gameport_read(gameport) ^ t) & t & 0xf;
>   port->fuzz = (port->speed * ANALOG_FUZZ_MAGIC) / port->loop / 
> 1000 + ANALOG_FUZZ_BITS;
>  
>   for (i = 0; i < ANALOG_INIT_RETRIES; i++) {
>   if (!analog_cooked_read(port))
>   break;
> - msleep(ANALOG_MAX_TIME);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
>   }
>  
>   u = v = 0;
>  
> - msleep(ANALOG_MAX_TIME);
> - t = gameport_time(gameport, ANALOG_MAX_TIME * 1000);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
> +     t = gameport_time(gameport, ANALOG_MAX_TIME);
>   gameport_trigger(gameport);
>   while ((gameport_read(port->gameport) & port->mask) && (u < t))
>   u++;
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [PATCH] Input: joystick: analog - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
On Tue, Nov 29, 2016 at 01:11:31AM +0530, Aniroop Mathur wrote:

> msleep(1~20) may not do what the caller intends, and will often sleep longer.
> (~20 ms actual sleep for any value given in the 1~20ms range)
> This is not the desired behaviour for many cases like device resume time,
> device suspend time, device enable time, connection time, probe time,
> loops, retry logic, etc
> msleep is built on jiffies / legacy timers which are not precise whereas
> usleep_range is build on top of hrtimers so the wakeups are precise.
> Thus, change msleep to usleep_range for precise wakeups.
> 
> For example:
> On a machine with tick rate / HZ as 100, msleep(3) will make the process to
> sleep for a minimum period of 10 ms whereas usleep_range(3000, 3100) will make
> sure that the process does not sleep for more than 3100 us or 3.1ms

Again, not needed, if the MAX_TIME sleeps are longer, nobody cares.

> 
> Signed-off-by: Aniroop Mathur 
> ---
>  drivers/input/joystick/analog.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c
> index 3d8ff09..2891704 100644
> --- a/drivers/input/joystick/analog.c
> +++ b/drivers/input/joystick/analog.c
> @@ -88,7 +88,7 @@ MODULE_PARM_DESC(map, "Describes analog joysticks 
> type/capabilities");
>  #define ANALOG_EXTENSIONS0x7ff00
>  #define ANALOG_GAMEPAD   0x8
>  
> -#define ANALOG_MAX_TIME  3   /* 3 ms */
> +#define ANALOG_MAX_TIME  3000/* 3000 us */
>  #define ANALOG_LOOP_TIME 2000/* 2 * loop */
>  #define ANALOG_SAITEK_DELAY  200 /* 200 us */
>  #define ANALOG_SAITEK_TIME   2000/* 2000 us */
> @@ -257,7 +257,7 @@ static int analog_cooked_read(struct analog_port *port)
>   int i, j;
>  
>   loopout = (ANALOG_LOOP_TIME * port->loop) / 1000;
> - timeout = ANALOG_MAX_TIME * port->speed;
> + timeout = (ANALOG_MAX_TIME / 1000) * port->speed;
>  
>   local_irq_save(flags);
>   gameport_trigger(gameport);
> @@ -625,20 +625,20 @@ static int analog_init_port(struct gameport *gameport, 
> struct gameport_driver *d
>  
>   gameport_trigger(gameport);
>   t = gameport_read(gameport);
> - msleep(ANALOG_MAX_TIME);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
>   port->mask = (gameport_read(gameport) ^ t) & t & 0xf;
>   port->fuzz = (port->speed * ANALOG_FUZZ_MAGIC) / port->loop / 
> 1000 + ANALOG_FUZZ_BITS;
>  
>   for (i = 0; i < ANALOG_INIT_RETRIES; i++) {
>   if (!analog_cooked_read(port))
>   break;
> - msleep(ANALOG_MAX_TIME);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
>   }
>  
>   u = v = 0;
>  
> - msleep(ANALOG_MAX_TIME);
> - t = gameport_time(gameport, ANALOG_MAX_TIME * 1000);
> + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100);
> +     t = gameport_time(gameport, ANALOG_MAX_TIME);
>   gameport_trigger(gameport);
>   while ((gameport_read(port->gameport) & port->mask) && (u < t))
>   u++;
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [PATCH] Input: joystick: sidewinder - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
ital */
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* 
> Retry reading packet */
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   dbg("Init 1b: Length %d.", i);
>   if (!i) {   /* No 
> data -> FAIL */
>   err = -ENODEV;
> @@ -636,7 +636,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>   dbg("Init 2: Mode %d. ID Length %d.", m, j);
>  
>   if (j <= 0) {   /* Read 
> ID failed. Happens in 1-bit mode on PP */
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* 
> Retry reading packet */
>   m |= sw_guess_mode(buf, i);
>   dbg("Init 2b: Mode %d. Length %d.", m, i);
> @@ -644,7 +644,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>   err = -ENODEV;
>   goto fail2;
>   }
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   j = sw_read_packet(gameport, idbuf, SW_LENGTH, i);  /* 
> Retry reading ID */
>       dbg("Init 2c: ID Length %d.", j);
>   }
> @@ -655,7 +655,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>  
>   do {
>   k--;
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* Read 
> data packet */
>   dbg("Init 3: Mode %d. Length %d. Last %d. Tries %d.", m, i, l, 
> k);
>  
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [PATCH] Input: joystick: sidewinder - change msleep to usleep_range for small msecs

2016-11-28 Thread Vojtech Pavlik
/
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* 
> Retry reading packet */
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   dbg("Init 1b: Length %d.", i);
>   if (!i) {   /* No 
> data -> FAIL */
>   err = -ENODEV;
> @@ -636,7 +636,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>   dbg("Init 2: Mode %d. ID Length %d.", m, j);
>  
>   if (j <= 0) {   /* Read 
> ID failed. Happens in 1-bit mode on PP */
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* 
> Retry reading packet */
>   m |= sw_guess_mode(buf, i);
>   dbg("Init 2b: Mode %d. Length %d.", m, i);
> @@ -644,7 +644,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>   err = -ENODEV;
>   goto fail2;
>   }
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   j = sw_read_packet(gameport, idbuf, SW_LENGTH, i);  /* 
> Retry reading ID */
>   dbg("Init 2c: ID Length %d.", j);
>   }
> @@ -655,7 +655,7 @@ static int sw_connect(struct gameport *gameport, struct 
> gameport_driver *drv)
>  
>   do {
>   k--;
> - msleep(SW_TIMEOUT);
> + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100);
>   i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* Read 
> data packet */
>   dbg("Init 3: Mode %d. Length %d. Last %d. Tries %d.", m, i, l, 
> k);
>  
> -- 
> 2.6.2
> 

-- 
Vojtech Pavlik


Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model

2016-04-05 Thread Vojtech Pavlik
On Mon, Apr 04, 2016 at 01:33:07PM -0500, Josh Poimboeuf wrote:
> On Mon, Apr 04, 2016 at 08:27:59PM +0200, Vojtech Pavlik wrote:
> > On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote:
> > 
> > > Hm, can you explain why it should be copied from the parent?
> > > 
> > > I'm thinking the above code is correct for today, but it should still be
> > > changed to be more future-proof.
> > > 
> > > Here's my thinking:
> > > 
> > > A forked task starts out with no stack, so if I understand correctly, it
> > > can safely start out in the goal universe, regardless of which universe
> > > its parent belongs to.
> > > 
> > > However, the current ret_from_fork code is a mess, and Andy Lutomirski
> > > has mentioned that he would like to give newly forked tasks a proper
> > > stack such that instead of jumping to ret_from_fork, they would just
> > > return from schedule().  In that case, it would no longer be safe to
> > > start the new task in the goal universe because it could be "sleeping"
> > > on a to-be-patched function.
> > > 
> > > So for proper future proofing, newly forked tasks should be started in
> > > the initial universe (rather than starting in the goal universe or
> > > inheriting the parent's universe).  They can then be transitioned over
> > > to the goal universe like any other task.  How does that sound?
> > 
> > How could a newly forked task start in the old universe if its parent
> > has already been migrated? Any context it inherits will already be from
> > the new universe.
> 
> Can you be more specific about "context" and why inheritance of it would
> be a problem?

Currently a forked task starts out with no stack, and as such it can
start in the goal universe.

If we create a synthetic stack, then we may need to start in the initial
universe, as the synthetic stack would likely be created using initial
universe return addresses. 

If we simply copy the stack of the parent process, which is in my
opionion the safest way, as it places little assumptions on the
compiler, then it may contain either old or new addresses
and we need to copy the universe flag along.

-- 
Vojtech Pavlik
Director SUSE Labs


Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model

2016-04-05 Thread Vojtech Pavlik
On Mon, Apr 04, 2016 at 01:33:07PM -0500, Josh Poimboeuf wrote:
> On Mon, Apr 04, 2016 at 08:27:59PM +0200, Vojtech Pavlik wrote:
> > On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote:
> > 
> > > Hm, can you explain why it should be copied from the parent?
> > > 
> > > I'm thinking the above code is correct for today, but it should still be
> > > changed to be more future-proof.
> > > 
> > > Here's my thinking:
> > > 
> > > A forked task starts out with no stack, so if I understand correctly, it
> > > can safely start out in the goal universe, regardless of which universe
> > > its parent belongs to.
> > > 
> > > However, the current ret_from_fork code is a mess, and Andy Lutomirski
> > > has mentioned that he would like to give newly forked tasks a proper
> > > stack such that instead of jumping to ret_from_fork, they would just
> > > return from schedule().  In that case, it would no longer be safe to
> > > start the new task in the goal universe because it could be "sleeping"
> > > on a to-be-patched function.
> > > 
> > > So for proper future proofing, newly forked tasks should be started in
> > > the initial universe (rather than starting in the goal universe or
> > > inheriting the parent's universe).  They can then be transitioned over
> > > to the goal universe like any other task.  How does that sound?
> > 
> > How could a newly forked task start in the old universe if its parent
> > has already been migrated? Any context it inherits will already be from
> > the new universe.
> 
> Can you be more specific about "context" and why inheritance of it would
> be a problem?

Currently a forked task starts out with no stack, and as such it can
start in the goal universe.

If we create a synthetic stack, then we may need to start in the initial
universe, as the synthetic stack would likely be created using initial
universe return addresses. 

If we simply copy the stack of the parent process, which is in my
opionion the safest way, as it places little assumptions on the
compiler, then it may contain either old or new addresses
and we need to copy the universe flag along.

-- 
Vojtech Pavlik
Director SUSE Labs


Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model

2016-04-04 Thread Vojtech Pavlik
On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote:

> Hm, can you explain why it should be copied from the parent?
> 
> I'm thinking the above code is correct for today, but it should still be
> changed to be more future-proof.
> 
> Here's my thinking:
> 
> A forked task starts out with no stack, so if I understand correctly, it
> can safely start out in the goal universe, regardless of which universe
> its parent belongs to.
> 
> However, the current ret_from_fork code is a mess, and Andy Lutomirski
> has mentioned that he would like to give newly forked tasks a proper
> stack such that instead of jumping to ret_from_fork, they would just
> return from schedule().  In that case, it would no longer be safe to
> start the new task in the goal universe because it could be "sleeping"
> on a to-be-patched function.
> 
> So for proper future proofing, newly forked tasks should be started in
> the initial universe (rather than starting in the goal universe or
> inheriting the parent's universe).  They can then be transitioned over
> to the goal universe like any other task.  How does that sound?

How could a newly forked task start in the old universe if its parent
has already been migrated? Any context it inherits will already be from
the new universe.

-- 
Vojtech Pavlik
Director SuSE Labs


Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model

2016-04-04 Thread Vojtech Pavlik
On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote:

> Hm, can you explain why it should be copied from the parent?
> 
> I'm thinking the above code is correct for today, but it should still be
> changed to be more future-proof.
> 
> Here's my thinking:
> 
> A forked task starts out with no stack, so if I understand correctly, it
> can safely start out in the goal universe, regardless of which universe
> its parent belongs to.
> 
> However, the current ret_from_fork code is a mess, and Andy Lutomirski
> has mentioned that he would like to give newly forked tasks a proper
> stack such that instead of jumping to ret_from_fork, they would just
> return from schedule().  In that case, it would no longer be safe to
> start the new task in the goal universe because it could be "sleeping"
> on a to-be-patched function.
> 
> So for proper future proofing, newly forked tasks should be started in
> the initial universe (rather than starting in the goal universe or
> inheriting the parent's universe).  They can then be transitioned over
> to the goal universe like any other task.  How does that sound?

How could a newly forked task start in the old universe if its parent
has already been migrated? Any context it inherits will already be from
the new universe.

-- 
Vojtech Pavlik
Director SuSE Labs


Re: [PATCH] livepatch: Update maintainers

2016-03-18 Thread Vojtech Pavlik
On Wed, Mar 16, 2016 at 10:03:36AM -0500, Josh Poimboeuf wrote:

> Seth and Vojtech are no longer active maintainers of livepatch, so
> remove them in favor of Jessica and Miroslav.
> 
> Also add Petr as a designated reviewer.
> 
> Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
> ---
>  MAINTAINERS | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 860e306..e04e0a5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6583,9 +6583,10 @@ F: drivers/platform/x86/hp_accel.c
>  
>  LIVE PATCHING
>  M:   Josh Poimboeuf <jpoim...@redhat.com>
> -M:   Seth Jennings <sjenn...@redhat.com>
> +M:   Jessica Yu <j...@redhat.com>
>  M:   Jiri Kosina <ji...@kernel.org>
> -M:   Vojtech Pavlik <vojt...@suse.com>
> +M:   Miroslav Benes <mbe...@suse.cz>
> +R:   Petr Mladek <pmla...@suse.com>
>  S:   Maintained
>  F:   kernel/livepatch/
>  F:   include/linux/livepatch.h
> -- 
> 2.4.3

Acked-by: Vojtech Pavlik <vojt...@suse.com>

-- 
Vojtech Pavlik
Director SuSE Labs


Re: [PATCH] livepatch: Update maintainers

2016-03-18 Thread Vojtech Pavlik
On Wed, Mar 16, 2016 at 10:03:36AM -0500, Josh Poimboeuf wrote:

> Seth and Vojtech are no longer active maintainers of livepatch, so
> remove them in favor of Jessica and Miroslav.
> 
> Also add Petr as a designated reviewer.
> 
> Signed-off-by: Josh Poimboeuf 
> ---
>  MAINTAINERS | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 860e306..e04e0a5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6583,9 +6583,10 @@ F: drivers/platform/x86/hp_accel.c
>  
>  LIVE PATCHING
>  M:   Josh Poimboeuf 
> -M:   Seth Jennings 
> +M:   Jessica Yu 
>  M:   Jiri Kosina 
> -M:   Vojtech Pavlik 
> +M:   Miroslav Benes 
> +R:   Petr Mladek 
>  S:   Maintained
>  F:   kernel/livepatch/
>  F:   include/linux/livepatch.h
> -- 
> 2.4.3

Acked-by: Vojtech Pavlik 

-- 
Vojtech Pavlik
Director SuSE Labs


[PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-05 Thread Vojtech Pavlik
Fix writeback_thread never finishing writing back all dirty data in bcache when
partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.

Signed-off-by: Vojtech Pavlik 
---

This is a fix for the current upstream bcache, not the devel branch.

If partial_stripes_expensive is set for a cache set, then writeback_thread
always attempts to write full stripes out back to the backing device first.
However, since it does that based on a bitmap and not a simple linear
search, like the rest of the code of refill_dirty(), it changes the
last_scanned pointer so that never points to 'end'. refill_dirty() then
never tries to scan from 'start', resulting in the writeback_thread
looping, consuming 100% of CPU, but never making progress in writing out
the incomplete dirty stripes in the cache.

Scanning the tree after not finding enough full stripes fixes the issue.

Incomplete dirty stripes are written to the backing device, the device
eventually reaches a clean state if there is nothing dirtying data and
writeback_thread sleeps. This also fixes the problem of the cache device
not being possible to detach in the partial_stripes_expensive scenario.

It may be more efficient to separate the last_scanned field for normal and
stripe scans instead.

 drivers/md/bcache/writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..6f8b81d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc)
refill_full_stripes(dc);
if (array_freelist_empty(>freelist))
return false;
+   bch_refill_keybuf(dc->disk.c, buf, , dirty_pred);
}
 
if (bkey_cmp(>last_scanned, ) >= 0) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

2015-09-05 Thread Vojtech Pavlik
Fix writeback_thread never finishing writing back all dirty data in bcache when
partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead.

Signed-off-by: Vojtech Pavlik <vojt...@suse.com>
---

This is a fix for the current upstream bcache, not the devel branch.

If partial_stripes_expensive is set for a cache set, then writeback_thread
always attempts to write full stripes out back to the backing device first.
However, since it does that based on a bitmap and not a simple linear
search, like the rest of the code of refill_dirty(), it changes the
last_scanned pointer so that never points to 'end'. refill_dirty() then
never tries to scan from 'start', resulting in the writeback_thread
looping, consuming 100% of CPU, but never making progress in writing out
the incomplete dirty stripes in the cache.

Scanning the tree after not finding enough full stripes fixes the issue.

Incomplete dirty stripes are written to the backing device, the device
eventually reaches a clean state if there is nothing dirtying data and
writeback_thread sleeps. This also fixes the problem of the cache device
not being possible to detach in the partial_stripes_expensive scenario.

It may be more efficient to separate the last_scanned field for normal and
stripe scans instead.

 drivers/md/bcache/writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..6f8b81d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc)
refill_full_stripes(dc);
if (array_freelist_empty(>freelist))
return false;
+   bch_refill_keybuf(dc->disk.c, buf, , dirty_pred);
}
 
if (bkey_cmp(>last_scanned, ) >= 0) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 02/10] x86: Compile-time asm code validation

2015-06-10 Thread Vojtech Pavlik
On Wed, Jun 10, 2015 at 10:21:36AM -0700, Andy Lutomirski wrote:
> On Jun 10, 2015 5:07 AM, "Josh Poimboeuf"  wrote:
> >
> > Add a new CONFIG_ASM_VALIDATION option which adds an asmvalidate host
> > tool which runs on every compiled .S file.  Its goal is to enforce sane
> > rules on all asm code, so that stack debug metadata (frame/back chain
> > pointers and/or DWARF CFI metadata) can be made reliable.
> >
> > It enforces the following rules:
> >
> > 1. Each callable function must be annotated with the ELF STT_FUNC type.
> >This is typically done using the ENTRY/ENDPROC macros.  If
> >asmvalidate finds a return instruction outside of a function, it
> >flags an error, since that usually indicates callable code which
> >should be annotated accordingly.
> >
> > 2. Each callable function must never leave its own bounds (i.e. with a
> >jump to outside the function) except when returning.
> 
> Won't that break with sibling/tail calls?  GCC can generate those, and
> the ia32_ptregs_common label is an example of such a thing.

It only validates hand-written assembly, so how it could?

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 02/10] x86: Compile-time asm code validation

2015-06-10 Thread Vojtech Pavlik
On Wed, Jun 10, 2015 at 10:21:36AM -0700, Andy Lutomirski wrote:
 On Jun 10, 2015 5:07 AM, Josh Poimboeuf jpoim...@redhat.com wrote:
 
  Add a new CONFIG_ASM_VALIDATION option which adds an asmvalidate host
  tool which runs on every compiled .S file.  Its goal is to enforce sane
  rules on all asm code, so that stack debug metadata (frame/back chain
  pointers and/or DWARF CFI metadata) can be made reliable.
 
  It enforces the following rules:
 
  1. Each callable function must be annotated with the ELF STT_FUNC type.
 This is typically done using the ENTRY/ENDPROC macros.  If
 asmvalidate finds a return instruction outside of a function, it
 flags an error, since that usually indicates callable code which
 should be annotated accordingly.
 
  2. Each callable function must never leave its own bounds (i.e. with a
 jump to outside the function) except when returning.
 
 Won't that break with sibling/tail calls?  GCC can generate those, and
 the ia32_ptregs_common label is an example of such a thing.

It only validates hand-written assembly, so how it could?

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] Add virtio-input driver.

2015-03-24 Thread Vojtech Pavlik
On Wed, Mar 25, 2015 at 01:51:43PM +1030, Rusty Russell wrote:

> Gerd Hoffmann  writes:
> > virtio-input is basically evdev-events-over-virtio, so this driver isn't
> > much more than reading configuration from config space and forwarding
> > incoming events to the linux input layer.
> >
> > Signed-off-by: Gerd Hoffmann 
> 
> Is the input layer sane?  I've never dealt with it, so I don't know.

I'm rather biased having designed it, but I'd say it's reasonable. It
certainly has limitations and design mistakes, but none are too bad.
One testimony to that Android has based its own Input API around it.

> What's it used for?

For all human input devices, like keyboards, mice, touchscreens, etc. 

> Imagine a future virtio standard which incorporates this.  And a Windows
> or FreeBSD implementation of the device and or driver.  How ugly would
> they be?

A windows translation layer is fairly easy, people porting software from
Windows to Linux told me numerous times that adapting isn't hard. I also
believe that at least one of the BSD's has a compatible implementation
these days based on the fact that I was asked to allow copying the
headers in the past.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] Add virtio-input driver.

2015-03-24 Thread Vojtech Pavlik
On Wed, Mar 25, 2015 at 01:51:43PM +1030, Rusty Russell wrote:

 Gerd Hoffmann kra...@redhat.com writes:
  virtio-input is basically evdev-events-over-virtio, so this driver isn't
  much more than reading configuration from config space and forwarding
  incoming events to the linux input layer.
 
  Signed-off-by: Gerd Hoffmann kra...@redhat.com
 
 Is the input layer sane?  I've never dealt with it, so I don't know.

I'm rather biased having designed it, but I'd say it's reasonable. It
certainly has limitations and design mistakes, but none are too bad.
One testimony to that Android has based its own Input API around it.

 What's it used for?

For all human input devices, like keyboards, mice, touchscreens, etc. 

 Imagine a future virtio standard which incorporates this.  And a Windows
 or FreeBSD implementation of the device and or driver.  How ugly would
 they be?

A windows translation layer is fairly easy, people porting software from
Windows to Linux told me numerous times that adapting isn't hard. I also
believe that at least one of the BSD's has a compatible implementation
these days based on the fact that I was asked to allow copying the
headers in the past.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote:

> > Your upgrade proposal is an *enormous* disruption to the 
> > system:
> > 
> > - a latency of "well below 10" seconds is completely
> >   unacceptable to most users who want to patch the kernel 
> >   of a production system _while_ it's in production.
> 
> I think this statement is false for the following reasons.

The statement is very true.

>   - I'd say the majority of system operators of production 
> systems can live with a couple of seconds of delay at a 
> well defined moment of the day or week - with gradual, 
> pretty much open ended improvements in that latency 
> down the line.

In the most usual corporate setting any noticeable outage, even out of
business hours, requires an ahead notice, and an agreement of all
stakeholders - teams that depend on the system.

If a live patching technology introduces an outage, it's not "live" and
because of these bureaucratic reasons, it will not be used and a regular
reboot will be scheduled instead.

>   - I think your argument ignores the fact that live 
> upgrades would extend the scope of 'users willing to 
> patch the kernel of a production system' _enormously_. 
> 
> For example, I have a production system with this much 
> uptime:
> 
>10:50:09 up 153 days,  3:58, 34 users,  load average: 0.00, 0.02, 0.05
> 
> While currently I'm reluctant to reboot the system to 
> upgrade the kernel (due to a reboot's intrusiveness), 
> and that is why it has achieved a relatively high 
> uptime, but I'd definitely allow the kernel to upgrade 
> at 0:00am just fine. (I'd even give it up to a few 
> minutes, as long as TCP connections don't time out.)
> 
> And I don't think my usecase is special.

I agree that this is useful. But it is a different problem that only
partially overlaps with what we're trying to achieve with live patching.

If you can make full kernel upgrades to work this way, which I doubt is
achievable in the next 10 years due to all the research and
infrastructure needed, then you certainly gain an additional group of
users. And a great tool. A large portion of those that ask for live
patching won't use it, though.

But honestly, I prefer a solution that works for small patches now, than
a solution for unlimited patches sometime in next decade.

> What gradual improvements in live upgrade latency am I 
> talking about?
> 
>  - For example the majority of pure user-space process 
>pages in RAM could be saved from the old kernel over 
>into the new kernel - i.e. they'd stay in place in RAM, 
>but they'd be re-hashed for the new data structures. 
>This avoids a big chunk of checkpointing overhead.

I'd have hoped this would be a given. If you can't preserve memory
contents and have to re-load from disk, you can just as well reboot
entirely, the time needed will not be much more..

>  - Likewise, most of the page cache could be saved from an
>old kernel to a new kernel as well - further reducing
>checkpointing overhead.
> 
>  - The PROT_NONE mechanism of the current NUMA balancing
>code could be used to transparently mark user-space 
>pages as 'checkpointed'. This would reduce system 
>interruption as only 'newly modified' pages would have 
>to be checkpointed when the upgrade happens.
> 
>  - Hardware devices could be marked as 'already in well
>defined state', skipping the more expensive steps of 
>driver initialization.
> 
>  - Possibly full user-space page tables could be preserved 
>over an upgrade: this way user-space execution would be 
>unaffected even in the micro level: cache layout, TLB
>patterns, etc.
> 
> There's lots of gradual speedups possible with such a model 
> IMO.

Yes, as I say above, guaranteeing decades of employment. ;)

> With live kernel patching we run into a brick wall of 
> complexity straight away: we have to analyze the nature of 
> the kernel modification, in the context of live patching, 
> and that only works for the simplest of kernel 
> modifications.

But you're able to _use_ it.

> With live kernel upgrades no such brick wall exists, just 
> about any transition between kernel versions is possible.

The brick wall you run to is "I need to implement full kernel state
serialization before I can do anything at all." That's something that
isn't even clear _how_ to do. Particularly with Linux kernel's
development model where internal ABI and structures are always in flux
it may not even be realistic.

> Granted, with live kernel upgrades it's much more complex 
> to get the 'simple' case into an even rudimentarily working 
> fashion (full userspace state has to be enumerated, 

Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote:
> 
> * Jiri Kosina  wrote:
> 
> > [...] We could optimize the kernel the craziest way we 
> > can, but hardware takes its time to reinitialize. And in 
> > most cases, you'd really need to reinitalize it; [...]
> 
> If we want to reinitialize a device, most of the longer 
> initialization latencies during bootup these days involve 
> things like: 'poke hardware, see if there's any response'. 
> Those are mostly going away quickly with modern, 
> well-enumerated hardware interfaces.
> 
> Just try a modprobe of a random hardware driver - most 
> initialization sequences are very fast. (That's how people 
> are able to do cold bootups in less than 1 second.)

Have you ever tried to boot a system with a large (> 100) number of
drives connected over FC? That takes time to discover and you have to do
the discovery as the configuration could have changed while you were not
looking.

Or a machine with terabytes of memory? Just initializing the memory
takes minutes.

Or a desktop with USB? And you have to reinitialize the USB bus and the
state of all the USB devices, because an application might be accessing
files on an USB drive.

> In theory this could also be optimized: we could avoid the 
> reinitialization step through an upgrade via relatively 
> simple means, for example if drivers define their own 
> version and the new kernel's driver checks whether the 
> previous state is from a compatible driver. Then the new 
> driver could do a shorter initialization sequence.

There you're clearly getting in the "so complex to maintain that it'll never
work reliably" territory.

> But I'd only do it only in special cases, where for some 
> reason the initialization sequence takes longer time and it 
> makes sense to share hardware discovery information between 
> two versions of the driver. I'm not convinced such a 
> mechanism is necessary in the general case.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote:

> > This is the most common argument that's raised when live 
> > patching is discussed. "Why do need live patching when we 
> > have redundancy?"
> 
> My argument is that if we start off with a latency of 10 
> seconds and improve that gradually, it will be good for 
> everyone with a clear, actionable route for even those who 
> cannot take a 10 seconds delay today.

Sure, we can do it that way. 

Or do it in the other direction.

Today we have a tool (livepatch) in the kernel that can apply trivial
single-function fixes without a measurable disruption to applications.

And we can improve it gradually to expand the range of fixes it can
apply.

Dependent functions can be done by kGraft's lazy migration.

Limited data structure changes can be handled by shadowing.

Major data structure and/or locking changes require stopping the kernel,
and trapping all tasks at the kernel/userspace boundary is clearly the
cleanest way to do that. I comes at a steep latency cost, though.

Full code replacement without change scope consideration requires full
serialization and deserialization of hardware and userspace
interface state, which is something we don't have today and would
require work on every single driver. Possible, but probably a decade of
effort.

With this approach you have something useful at every point and every
piece of effort put in gives you a rewars.

> Lets see the use cases:
> 
> > [...] Examples would be legacy applications which can't 
> > run in an active-active cluster and need to be restarted 
> > on failover.
> 
> Most clusters (say web frontends) can take a stoppage of a 
> couple of seconds.

It's easy to find examples of workloads that can be stopped. It doesn't
rule out a significant set of those where stopping them is very
expensive.

> > Another usecase is large HPC clusters, where all nodes 
> > have to run carefully synchronized. Once one gets behind 
> > in a calculation cycle, others have to wait for the 
> > results and the efficiency of the whole cluster goes 
> > down. [...]
> 
> I think calculation nodes on large HPC clusters qualify as 
> the specialized case that I mentioned, where the update 
> latency could be brought down into the 1 second range.
> 
> But I don't think calculation nodes are patched in the 
> typical case: you might want to patch Internet facing 
> frontend systems, the rest is left as undisturbed as 
> possible. So I'm not even sure this is a typical usecase.

They're not patched for security bugs, but stability bugs are an
important issue for multi-month calculations.

> In any case, there's no hard limit on how fast such a 
> kernel upgrade can get in principle, and the folks who care 
> about that latency will sure help out optimizing it and 
> many HPC projects are well funded.

So far, unless you come up with an effective solutions, if you're
catching all tasks at the kernel/userspace boundary (the "Kragle"
approach), the service interruption is effectively unbounded due to
tasks in D state.

> > The value of live patching is in near zero disruption.
> 
> Latency is a good attribute of a kernel upgrade mechanism, 
> but it's by far not the only attribute and we should 
> definitely not design limitations into the approach and 
> hurt all the other attributes, just to optimize that single 
> attribute.

It's an attribute I'm not willing to give up. On the other hand, I
definitely wouldn't argue against having modes of operation where the
latency is higher and the tool is more powerful.

> I.e. don't make it a single-issue project.

There is no need to worry about that. 

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote:

  This is the most common argument that's raised when live 
  patching is discussed. Why do need live patching when we 
  have redundancy?
 
 My argument is that if we start off with a latency of 10 
 seconds and improve that gradually, it will be good for 
 everyone with a clear, actionable route for even those who 
 cannot take a 10 seconds delay today.

Sure, we can do it that way. 

Or do it in the other direction.

Today we have a tool (livepatch) in the kernel that can apply trivial
single-function fixes without a measurable disruption to applications.

And we can improve it gradually to expand the range of fixes it can
apply.

Dependent functions can be done by kGraft's lazy migration.

Limited data structure changes can be handled by shadowing.

Major data structure and/or locking changes require stopping the kernel,
and trapping all tasks at the kernel/userspace boundary is clearly the
cleanest way to do that. I comes at a steep latency cost, though.

Full code replacement without change scope consideration requires full
serialization and deserialization of hardware and userspace
interface state, which is something we don't have today and would
require work on every single driver. Possible, but probably a decade of
effort.

With this approach you have something useful at every point and every
piece of effort put in gives you a rewars.

 Lets see the use cases:
 
  [...] Examples would be legacy applications which can't 
  run in an active-active cluster and need to be restarted 
  on failover.
 
 Most clusters (say web frontends) can take a stoppage of a 
 couple of seconds.

It's easy to find examples of workloads that can be stopped. It doesn't
rule out a significant set of those where stopping them is very
expensive.

  Another usecase is large HPC clusters, where all nodes 
  have to run carefully synchronized. Once one gets behind 
  in a calculation cycle, others have to wait for the 
  results and the efficiency of the whole cluster goes 
  down. [...]
 
 I think calculation nodes on large HPC clusters qualify as 
 the specialized case that I mentioned, where the update 
 latency could be brought down into the 1 second range.
 
 But I don't think calculation nodes are patched in the 
 typical case: you might want to patch Internet facing 
 frontend systems, the rest is left as undisturbed as 
 possible. So I'm not even sure this is a typical usecase.

They're not patched for security bugs, but stability bugs are an
important issue for multi-month calculations.

 In any case, there's no hard limit on how fast such a 
 kernel upgrade can get in principle, and the folks who care 
 about that latency will sure help out optimizing it and 
 many HPC projects are well funded.

So far, unless you come up with an effective solutions, if you're
catching all tasks at the kernel/userspace boundary (the Kragle
approach), the service interruption is effectively unbounded due to
tasks in D state.

  The value of live patching is in near zero disruption.
 
 Latency is a good attribute of a kernel upgrade mechanism, 
 but it's by far not the only attribute and we should 
 definitely not design limitations into the approach and 
 hurt all the other attributes, just to optimize that single 
 attribute.

It's an attribute I'm not willing to give up. On the other hand, I
definitely wouldn't argue against having modes of operation where the
latency is higher and the tool is more powerful.

 I.e. don't make it a single-issue project.

There is no need to worry about that. 

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote:

  Your upgrade proposal is an *enormous* disruption to the 
  system:
  
  - a latency of well below 10 seconds is completely
unacceptable to most users who want to patch the kernel 
of a production system _while_ it's in production.
 
 I think this statement is false for the following reasons.

The statement is very true.

   - I'd say the majority of system operators of production 
 systems can live with a couple of seconds of delay at a 
 well defined moment of the day or week - with gradual, 
 pretty much open ended improvements in that latency 
 down the line.

In the most usual corporate setting any noticeable outage, even out of
business hours, requires an ahead notice, and an agreement of all
stakeholders - teams that depend on the system.

If a live patching technology introduces an outage, it's not live and
because of these bureaucratic reasons, it will not be used and a regular
reboot will be scheduled instead.

   - I think your argument ignores the fact that live 
 upgrades would extend the scope of 'users willing to 
 patch the kernel of a production system' _enormously_. 
 
 For example, I have a production system with this much 
 uptime:
 
10:50:09 up 153 days,  3:58, 34 users,  load average: 0.00, 0.02, 0.05
 
 While currently I'm reluctant to reboot the system to 
 upgrade the kernel (due to a reboot's intrusiveness), 
 and that is why it has achieved a relatively high 
 uptime, but I'd definitely allow the kernel to upgrade 
 at 0:00am just fine. (I'd even give it up to a few 
 minutes, as long as TCP connections don't time out.)
 
 And I don't think my usecase is special.

I agree that this is useful. But it is a different problem that only
partially overlaps with what we're trying to achieve with live patching.

If you can make full kernel upgrades to work this way, which I doubt is
achievable in the next 10 years due to all the research and
infrastructure needed, then you certainly gain an additional group of
users. And a great tool. A large portion of those that ask for live
patching won't use it, though.

But honestly, I prefer a solution that works for small patches now, than
a solution for unlimited patches sometime in next decade.

 What gradual improvements in live upgrade latency am I 
 talking about?
 
  - For example the majority of pure user-space process 
pages in RAM could be saved from the old kernel over 
into the new kernel - i.e. they'd stay in place in RAM, 
but they'd be re-hashed for the new data structures. 
This avoids a big chunk of checkpointing overhead.

I'd have hoped this would be a given. If you can't preserve memory
contents and have to re-load from disk, you can just as well reboot
entirely, the time needed will not be much more..

  - Likewise, most of the page cache could be saved from an
old kernel to a new kernel as well - further reducing
checkpointing overhead.
 
  - The PROT_NONE mechanism of the current NUMA balancing
code could be used to transparently mark user-space 
pages as 'checkpointed'. This would reduce system 
interruption as only 'newly modified' pages would have 
to be checkpointed when the upgrade happens.
 
  - Hardware devices could be marked as 'already in well
defined state', skipping the more expensive steps of 
driver initialization.
 
  - Possibly full user-space page tables could be preserved 
over an upgrade: this way user-space execution would be 
unaffected even in the micro level: cache layout, TLB
patterns, etc.
 
 There's lots of gradual speedups possible with such a model 
 IMO.

Yes, as I say above, guaranteeing decades of employment. ;)

 With live kernel patching we run into a brick wall of 
 complexity straight away: we have to analyze the nature of 
 the kernel modification, in the context of live patching, 
 and that only works for the simplest of kernel 
 modifications.

But you're able to _use_ it.

 With live kernel upgrades no such brick wall exists, just 
 about any transition between kernel versions is possible.

The brick wall you run to is I need to implement full kernel state
serialization before I can do anything at all. That's something that
isn't even clear _how_ to do. Particularly with Linux kernel's
development model where internal ABI and structures are always in flux
it may not even be realistic.

 Granted, with live kernel upgrades it's much more complex 
 to get the 'simple' case into an even rudimentarily working 
 fashion (full userspace state has to be enumerated, saved 
 and restored), but once we are there, it's a whole new 
 category of goodness and it probably covers 90%+ of the 
 live kernel patching usecases on day 1 already ...

Feel free to start working on it. I'll stick with live patching.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe

Re: live kernel upgrades (was: live kernel patching design)

2015-02-24 Thread Vojtech Pavlik
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote:
 
 * Jiri Kosina jkos...@suse.cz wrote:
 
  [...] We could optimize the kernel the craziest way we 
  can, but hardware takes its time to reinitialize. And in 
  most cases, you'd really need to reinitalize it; [...]
 
 If we want to reinitialize a device, most of the longer 
 initialization latencies during bootup these days involve 
 things like: 'poke hardware, see if there's any response'. 
 Those are mostly going away quickly with modern, 
 well-enumerated hardware interfaces.
 
 Just try a modprobe of a random hardware driver - most 
 initialization sequences are very fast. (That's how people 
 are able to do cold bootups in less than 1 second.)

Have you ever tried to boot a system with a large ( 100) number of
drives connected over FC? That takes time to discover and you have to do
the discovery as the configuration could have changed while you were not
looking.

Or a machine with terabytes of memory? Just initializing the memory
takes minutes.

Or a desktop with USB? And you have to reinitialize the USB bus and the
state of all the USB devices, because an application might be accessing
files on an USB drive.

 In theory this could also be optimized: we could avoid the 
 reinitialization step through an upgrade via relatively 
 simple means, for example if drivers define their own 
 version and the new kernel's driver checks whether the 
 previous state is from a compatible driver. Then the new 
 driver could do a shorter initialization sequence.

There you're clearly getting in the so complex to maintain that it'll never
work reliably territory.

 But I'd only do it only in special cases, where for some 
 reason the initialization sequence takes longer time and it 
 makes sense to share hardware discovery information between 
 two versions of the driver. I'm not convinced such a 
 mechanism is necessary in the general case.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-23 Thread Vojtech Pavlik
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote:

> > Of course, if you are random Joe User, you can do whatever you want, i.e.
> > also compile your own home-brew patches and apply them randomly and brick
> > your system that way. But that's in no way different to what you as Joe
> > User can do today; there is nothing that will prevent you from shooting
> > yourself in a foot if you are creative.
> 
> Sorry if I ask something that got already discussed, I did not follow
> the whole live-patching discussion.
> 
> How much of the userspace tools will be public available?
> With live-patching mainline the kernel offers the mechanism, but
> random Joe user still needs
> the tools to create good live patches.

All the tools for kGraft and kpatch are available in public git
repositories.

Also, while kGraft has tools to automate the generation of patches,
these are generally not required to create a patch.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-23 Thread Vojtech Pavlik
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote:

  Of course, if you are random Joe User, you can do whatever you want, i.e.
  also compile your own home-brew patches and apply them randomly and brick
  your system that way. But that's in no way different to what you as Joe
  User can do today; there is nothing that will prevent you from shooting
  yourself in a foot if you are creative.
 
 Sorry if I ask something that got already discussed, I did not follow
 the whole live-patching discussion.
 
 How much of the userspace tools will be public available?
 With live-patching mainline the kernel offers the mechanism, but
 random Joe user still needs
 the tools to create good live patches.

All the tools for kGraft and kpatch are available in public git
repositories.

Also, while kGraft has tools to automate the generation of patches,
these are generally not required to create a patch.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-22 Thread Vojtech Pavlik
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:

> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina  wrote:
> 
> > But if you ask the folks who are hungry for live bug patching, they 
> > wouldn't care.
> > 
> > You mentioned "10 seconds", that's more or less equal to infinity to them. 
> 
> 10 seconds outage is unacceptable, but we're running our service on a
> single machine with no failover.  Who is doing this??

This is the most common argument that's raised when live patching is
discussed. "Why do need live patching when we have redundancy?"

People who are asking for live patching typically do have failover in
place, but prefer not to have to use it when they don't have to.

In many cases, the failover just can't be made transparent to the
outside world and there is a short outage. Examples would be legacy
applications which can't run in an active-active cluster and need to be
restarted on failover. Or trading systems, where the calculations must
be strictly serialized and response times are counted in tens of
microseconds. 

Another usecase is large HPC clusters, where all nodes have to run
carefully synchronized. Once one gets behind in a calculation cycle,
others have to wait for the results and the efficiency of the whole
cluster goes down. There are people who run realtime on them for
that reason. Dumping all data and restarting the HPC cluster takes a lot
of time and many nodes (out of tens of thousands) may not come back up,
making the restore from media difficult. Doing a rolling upgrade causes
the nodes one by one stall by 10+ seconds, which times 10k is a long
time, too.

And even the case where you have a perfect setup with everything
redundant and with instant failover does benefit from live patching.
Since you have to plan for failure, you have to plan for failure while
patching, too. With live patching you need 2 servers minimum (or N+1),
without you need 3 (or N+2), as one will be offline while during the
upgrade process.

10 seconds of outage may be acceptable in a disaster scenario. Not
necessarily for a regular update scenario.

The value of live patching is in near zero disruption.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live kernel upgrades (was: live kernel patching design)

2015-02-22 Thread Vojtech Pavlik
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:

 On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote:
 
  But if you ask the folks who are hungry for live bug patching, they 
  wouldn't care.
  
  You mentioned 10 seconds, that's more or less equal to infinity to them. 
 
 10 seconds outage is unacceptable, but we're running our service on a
 single machine with no failover.  Who is doing this??

This is the most common argument that's raised when live patching is
discussed. Why do need live patching when we have redundancy?

People who are asking for live patching typically do have failover in
place, but prefer not to have to use it when they don't have to.

In many cases, the failover just can't be made transparent to the
outside world and there is a short outage. Examples would be legacy
applications which can't run in an active-active cluster and need to be
restarted on failover. Or trading systems, where the calculations must
be strictly serialized and response times are counted in tens of
microseconds. 

Another usecase is large HPC clusters, where all nodes have to run
carefully synchronized. Once one gets behind in a calculation cycle,
others have to wait for the results and the efficiency of the whole
cluster goes down. There are people who run realtime on them for
that reason. Dumping all data and restarting the HPC cluster takes a lot
of time and many nodes (out of tens of thousands) may not come back up,
making the restore from media difficult. Doing a rolling upgrade causes
the nodes one by one stall by 10+ seconds, which times 10k is a long
time, too.

And even the case where you have a perfect setup with everything
redundant and with instant failover does benefit from live patching.
Since you have to plan for failure, you have to plan for failure while
patching, too. With live patching you need 2 servers minimum (or N+1),
without you need 3 (or N+2), as one will be offline while during the
upgrade process.

10 seconds of outage may be acceptable in a disaster scenario. Not
necessarily for a regular update scenario.

The value of live patching is in near zero disruption.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Vojtech Pavlik
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:

> > ... the choice the sysadmins have here is either have the 
> > system running in an intermediate state, or have the 
> > system completely dead for the *same time*. Because to 
> > finish the transition successfully, all the tasks have to 
> > be woken up in any case.
> 
> That statement is false: an 'intermediate state' system 
> where 'new' tasks are still running is still running and 
> will interfere with the resolution of 'old' tasks.

Can you suggest a way how they would interfere? The transition happens
on entering or returning from a syscall, there is no influence between
individual tasks.

If you mean that the patch author has to consider the fact that both old
and new code will be running simultaneously, then yes, they have to.

> > But I do get your point; what you are basically saying is 
> > that your preference is what kgraft is doing, and option 
> > to allow for a global synchronization point should be 
> > added in parallel to the gradual lazy migration.
> 
> I think you misunderstood: the 'simple' method I outlined 
> does not just 'synchronize', it actually executes the live 
> patching atomically, once all tasks are gathered and we 
> know they are _all_ in a safe state.

The 'simple' method has to catch and freeze all tasks one by one in
syscall entry/exit, at the kernel/userspace boundary, until all are
frozen and then patch the system atomically. 

This means that each and every sleeping task in the system has to be
woken up in some way (sending a signal ...) to exit from a syscall it is
sleeping in. Same for CPU hogs. All kernel threads need to be parked.

This is exactly what you need to do for kGraft to complete patching.

This may take a significant amount of time to achieve and you won't be
able to use a userspace script to send the signals, because it'd be
frozen immediately.

> I.e. it's in essence the strong stop-all atomic patching 
> model of 'kpatch', combined with the reliable avoidance of 
> kernel stacks that 'kgraft' uses.

> That should be the starting point, because it's the most 
> reliable method.

In the consistency models discussion, this was marked the
"LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of
all, but also comes at the highest cost in terms of impact on running
tasks. It's so high (the interruption may be seconds or more) that it
was deemed not worth implementing.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

2015-02-20 Thread Vojtech Pavlik
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:

  ... the choice the sysadmins have here is either have the 
  system running in an intermediate state, or have the 
  system completely dead for the *same time*. Because to 
  finish the transition successfully, all the tasks have to 
  be woken up in any case.
 
 That statement is false: an 'intermediate state' system 
 where 'new' tasks are still running is still running and 
 will interfere with the resolution of 'old' tasks.

Can you suggest a way how they would interfere? The transition happens
on entering or returning from a syscall, there is no influence between
individual tasks.

If you mean that the patch author has to consider the fact that both old
and new code will be running simultaneously, then yes, they have to.

  But I do get your point; what you are basically saying is 
  that your preference is what kgraft is doing, and option 
  to allow for a global synchronization point should be 
  added in parallel to the gradual lazy migration.
 
 I think you misunderstood: the 'simple' method I outlined 
 does not just 'synchronize', it actually executes the live 
 patching atomically, once all tasks are gathered and we 
 know they are _all_ in a safe state.

The 'simple' method has to catch and freeze all tasks one by one in
syscall entry/exit, at the kernel/userspace boundary, until all are
frozen and then patch the system atomically. 

This means that each and every sleeping task in the system has to be
woken up in some way (sending a signal ...) to exit from a syscall it is
sleeping in. Same for CPU hogs. All kernel threads need to be parked.

This is exactly what you need to do for kGraft to complete patching.

This may take a significant amount of time to achieve and you won't be
able to use a userspace script to send the signals, because it'd be
frozen immediately.

 I.e. it's in essence the strong stop-all atomic patching 
 model of 'kpatch', combined with the reliable avoidance of 
 kernel stacks that 'kgraft' uses.

 That should be the starting point, because it's the most 
 reliable method.

In the consistency models discussion, this was marked the
LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of
all, but also comes at the highest cost in terms of impact on running
tasks. It's so high (the interruption may be seconds or more) that it
was deemed not worth implementing.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > > 
> > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > they don't accidentally enter the kernel while you flip them. 
> > > > > > Something
> > > > > > like so should do.
> > > > > > 
> > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, 
> > > > > > flip
> > > > > > them then clear TIF_ENTER_WAIT.
> > > > > 
> > > > > Ah, that's a good idea.  But how do we check if they're in user space?
> > > > 
> > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > flip them from the syscall code as kGraft does.
> > > 
> > > But we were talking specifically about HPC tasks which never make
> > > syscalls.
> > 
> > Yes. I'm saying that rather than guaranteeing they don't enter the
> > kernel (by having them spin) you can flip them in case they try to do
> > that instead. That solves the race condition just as well.
> 
> Ok, gotcha.
> 
> We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

> How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
> right at the border.  Then for running tasks it's as simple as:
> 
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
>   klp_switch_task_universe(task);

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf  wrote:

>> Yes. I'm saying that rather than guaranteeing they don't enter the
>> kernel (by having them spin) you can flip them in case they try to do
>> that instead. That solves the race condition just as well.
>
>Ok, gotcha.
>
>We'd still need a safe way to check if they're in user space though.
>
>How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
>right at the border.  Then for running tasks it's as simple as:
>
>if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
>   klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, 
negated). I think it would work well, but it isn't entirely for free: two 
instructions per syscall and an extra TIF bit, which are (near) exhausted on 
some archs.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > 
> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > > 
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > > 
> > > Ah, that's a good idea.  But how do we check if they're in user space?
> > 
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
> 
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

> > No, these tasks will _never_ make syscalls. So you need to guarantee
> > they don't accidentally enter the kernel while you flip them. Something
> > like so should do.
> > 
> > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > them then clear TIF_ENTER_WAIT.
> 
> Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

> > I still absolutely hate you need to disturb userspace like that. Signals
> > are quite visible and perturb userspace state.
> 
> Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
> results in EINTR being returned from a system call.  It's not ideal, but
> it's much less intrusive than something like suspend.
> 
> But anyway we leave it up to the user to decide whether they want to
> take that risk, or wait for the task to wake up on its own, or cancel
> the patch.
> 
> > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> > what they thought was a good reason. You'd wreck their state.
> 
> Hm, that's a good point.  Maybe use the freezer instead of signals?
> 
> (Again this would only be for those user tasks which are sleeping on a
> patched function)
> 
> > > But now I'm thinking that kthreads will almost never be a problem.  Most
> > > kthreads are basically this:
> > 
> > You guys are way too optimistic; maybe its because I've worked on
> > realtime stuff too much, but I'm always looking at worst cases. If you
> > can't handle those, I feel you might as well not bother :-)
> 
> Well, I think we're already resigned to the fact that live patching
> won't work for every patch, every time.  And that the patch author must
> know what they're doing, and must do it carefully.
> 
> Our target worst case is that the patching fails gracefully and the user
> can't patch their system with that particular patch.  Or that the system
> remains in a partially patched state forever, if the user is ok with
> that.
> 
> > > Patching thread_fn wouldn't be possible unless we killed the thread.
> > 
> > It is, see kthread_park().
> 
> When the kthread returns from kthread_parkme(), it'll still be running
> the old thread_fn code, regardless of whether we flipped the task's
> patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 6/9] livepatch: create per-task consistency model

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:52:51AM +0100, Peter Zijlstra wrote:

> On Wed, Feb 18, 2015 at 09:44:44PM +0100, Vojtech Pavlik wrote:
> > For live patching it doesn't matter whether code is running, sleeping or
> > frozen.
> > 
> > What matters is whether there is state before patching that may not be
> > valid after patching.
> > 
> > For userspace tasks, the exit from a syscall is a perfect moment for
> > switching to the "after" state, as all stacks, and thus all local
> > variables are gone and no local state exists in the kernel for the
> > thread.
> > 
> > The freezer is a logical choice for kernel threads, however, given that
> > kernel threads have no defined entry/exit point and execute within a
> > single main function, local variables stay and thus local state persists
> > from before to after freezing.
> > 
> > Defining that no local state within a kernel thread may be relied upon
> > after exiting from the freezer is certainly possible, and is already
> > true for many kernel threads.
> > 
> > It isn't a given property of the freezer itself, though. And isn't
> > obvious for author of new kernel threads either.
> > 
> > The ideal solution would be to convert the majority of kernel threads to
> > workqueues, because then there is a defined entry/exit point over which
> > state isn't transferred. That is a lot of work, though, and has other
> > drawbacks, particularly in the realtime space.
> 
> kthread_park() functionality seems to be exactly what you want.

It might be exactly that, indeed. The requrement of not just cleaning
up, but also not using contents of local variables from before parking
would need to be documented.

And kernel threads would need to start using it, too. I have been able
to find one instance where this functionality is actually used. So it is
again a matter of a massive patch adding that, like with the approach of
converting kernel threads to workqueues.

By the way, if kthread_park() was implemented all through the kernel,
would we still need the freezer for kernel threads at all? Since parking
seems to be stronger than freezing, it could also be used for that
purpose.

Vojtech
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
 On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
  On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
   On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

  No, these tasks will _never_ make syscalls. So you need to guarantee
  they don't accidentally enter the kernel while you flip them. 
  Something
  like so should do.
  
  You set TIF_ENTER_WAIT on them, check they're still in userspace, 
  flip
  them then clear TIF_ENTER_WAIT.
 
 Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.
   
   But we were talking specifically about HPC tasks which never make
   syscalls.
  
  Yes. I'm saying that rather than guaranteeing they don't enter the
  kernel (by having them spin) you can flip them in case they try to do
  that instead. That solves the race condition just as well.
 
 Ok, gotcha.
 
 We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

 How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
 right at the border.  Then for running tasks it's as simple as:
 
 if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
   klp_switch_task_universe(task);

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 6/9] livepatch: create per-task consistency model

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:52:51AM +0100, Peter Zijlstra wrote:

 On Wed, Feb 18, 2015 at 09:44:44PM +0100, Vojtech Pavlik wrote:
  For live patching it doesn't matter whether code is running, sleeping or
  frozen.
  
  What matters is whether there is state before patching that may not be
  valid after patching.
  
  For userspace tasks, the exit from a syscall is a perfect moment for
  switching to the after state, as all stacks, and thus all local
  variables are gone and no local state exists in the kernel for the
  thread.
  
  The freezer is a logical choice for kernel threads, however, given that
  kernel threads have no defined entry/exit point and execute within a
  single main function, local variables stay and thus local state persists
  from before to after freezing.
  
  Defining that no local state within a kernel thread may be relied upon
  after exiting from the freezer is certainly possible, and is already
  true for many kernel threads.
  
  It isn't a given property of the freezer itself, though. And isn't
  obvious for author of new kernel threads either.
  
  The ideal solution would be to convert the majority of kernel threads to
  workqueues, because then there is a defined entry/exit point over which
  state isn't transferred. That is a lot of work, though, and has other
  drawbacks, particularly in the realtime space.
 
 kthread_park() functionality seems to be exactly what you want.

It might be exactly that, indeed. The requrement of not just cleaning
up, but also not using contents of local variables from before parking
would need to be documented.

And kernel threads would need to start using it, too. I have been able
to find one instance where this functionality is actually used. So it is
again a matter of a massive patch adding that, like with the approach of
converting kernel threads to workqueues.

By the way, if kthread_park() was implemented all through the kernel,
would we still need the freezer for kernel threads at all? Since parking
seems to be stronger than freezing, it could also be used for that
purpose.

Vojtech
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

  No, these tasks will _never_ make syscalls. So you need to guarantee
  they don't accidentally enter the kernel while you flip them. Something
  like so should do.
  
  You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
  them then clear TIF_ENTER_WAIT.
 
 Ah, that's a good idea.  But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

  I still absolutely hate you need to disturb userspace like that. Signals
  are quite visible and perturb userspace state.
 
 Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
 results in EINTR being returned from a system call.  It's not ideal, but
 it's much less intrusive than something like suspend.
 
 But anyway we leave it up to the user to decide whether they want to
 take that risk, or wait for the task to wake up on its own, or cancel
 the patch.
 
  Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
  what they thought was a good reason. You'd wreck their state.
 
 Hm, that's a good point.  Maybe use the freezer instead of signals?
 
 (Again this would only be for those user tasks which are sleeping on a
 patched function)
 
   But now I'm thinking that kthreads will almost never be a problem.  Most
   kthreads are basically this:
  
  You guys are way too optimistic; maybe its because I've worked on
  realtime stuff too much, but I'm always looking at worst cases. If you
  can't handle those, I feel you might as well not bother :-)
 
 Well, I think we're already resigned to the fact that live patching
 won't work for every patch, every time.  And that the patch author must
 know what they're doing, and must do it carefully.
 
 Our target worst case is that the patching fails gracefully and the user
 can't patch their system with that particular patch.  Or that the system
 remains in a partially patched state forever, if the user is ok with
 that.
 
   Patching thread_fn wouldn't be possible unless we killed the thread.
  
  It is, see kthread_park().
 
 When the kthread returns from kthread_parkme(), it'll still be running
 the old thread_fn code, regardless of whether we flipped the task's
 patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
 On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
  On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
  
No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.
   
   Ah, that's a good idea.  But how do we check if they're in user space?
  
  I don't see the benefit in holding them in a loop - you can just as well
  flip them from the syscall code as kGraft does.
 
 But we were talking specifically about HPC tasks which never make
 syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] sched: add sched_task_call()

2015-02-19 Thread Vojtech Pavlik
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf jpoim...@redhat.com wrote:

 Yes. I'm saying that rather than guaranteeing they don't enter the
 kernel (by having them spin) you can flip them in case they try to do
 that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag?  It could be cleared/set
right at the border.  Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
   klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, 
negated). I think it would work well, but it isn't entirely for free: two 
instructions per syscall and an extra TIF bit, which are (near) exhausted on 
some archs.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 6/9] livepatch: create per-task consistency model

2015-02-18 Thread Vojtech Pavlik
On Wed, Feb 18, 2015 at 09:17:55PM +0100, Ingo Molnar wrote:
> 
> * Jiri Kosina  wrote:
> 
> > On Thu, 12 Feb 2015, Peter Zijlstra wrote:
> > 
> > > And what's wrong with using known good spots like the freezer?
> > 
> > Quoting Tejun from the thread Jiri Slaby likely had on 
> > mind:
> > 
> > "The fact that they may coincide often can be useful as a 
> > guideline or whatever but I'm completely against just 
> > mushing it together when it isn't correct.  This kind of 
> > things quickly lead to ambiguous situations where people 
> > are not sure about the specific semantics or guarantees 
> > of the construct and implement weird voodoo code followed 
> > by voodoo fixes.  We already had a full round of that 
> > with the kernel freezer itself, where people thought that 
> > the freezer magically makes PM work properly for a 
> > subsystem.  Let's please not do that again."
> 
> I don't follow this vague argument.
> 
> The concept of 'freezing' all userspace execution is pretty 
> unambiguous: tasks that are running are trapped out at 
> known safe points such as context switch points or syscall 
> entry. Once all tasks have stopped, the system is frozen in 
> the sense that only the code we want is running, so you can 
> run special code without worrying about races.
> 
> What's the problem with that? Why would it be fundamentally 
> unsuitable for live patching?

For live patching it doesn't matter whether code is running, sleeping or
frozen.

What matters is whether there is state before patching that may not be
valid after patching.

For userspace tasks, the exit from a syscall is a perfect moment for
switching to the "after" state, as all stacks, and thus all local
variables are gone and no local state exists in the kernel for the
thread.

The freezer is a logical choice for kernel threads, however, given that
kernel threads have no defined entry/exit point and execute within a
single main function, local variables stay and thus local state persists
from before to after freezing.

Defining that no local state within a kernel thread may be relied upon
after exiting from the freezer is certainly possible, and is already
true for many kernel threads.

It isn't a given property of the freezer itself, though. And isn't
obvious for author of new kernel threads either.

The ideal solution would be to convert the majority of kernel threads to
workqueues, because then there is a defined entry/exit point over which
state isn't transferred. That is a lot of work, though, and has other
drawbacks, particularly in the realtime space.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 6/9] livepatch: create per-task consistency model

2015-02-18 Thread Vojtech Pavlik
On Wed, Feb 18, 2015 at 09:17:55PM +0100, Ingo Molnar wrote:
 
 * Jiri Kosina jkos...@suse.cz wrote:
 
  On Thu, 12 Feb 2015, Peter Zijlstra wrote:
  
   And what's wrong with using known good spots like the freezer?
  
  Quoting Tejun from the thread Jiri Slaby likely had on 
  mind:
  
  The fact that they may coincide often can be useful as a 
  guideline or whatever but I'm completely against just 
  mushing it together when it isn't correct.  This kind of 
  things quickly lead to ambiguous situations where people 
  are not sure about the specific semantics or guarantees 
  of the construct and implement weird voodoo code followed 
  by voodoo fixes.  We already had a full round of that 
  with the kernel freezer itself, where people thought that 
  the freezer magically makes PM work properly for a 
  subsystem.  Let's please not do that again.
 
 I don't follow this vague argument.
 
 The concept of 'freezing' all userspace execution is pretty 
 unambiguous: tasks that are running are trapped out at 
 known safe points such as context switch points or syscall 
 entry. Once all tasks have stopped, the system is frozen in 
 the sense that only the code we want is running, so you can 
 run special code without worrying about races.
 
 What's the problem with that? Why would it be fundamentally 
 unsuitable for live patching?

For live patching it doesn't matter whether code is running, sleeping or
frozen.

What matters is whether there is state before patching that may not be
valid after patching.

For userspace tasks, the exit from a syscall is a perfect moment for
switching to the after state, as all stacks, and thus all local
variables are gone and no local state exists in the kernel for the
thread.

The freezer is a logical choice for kernel threads, however, given that
kernel threads have no defined entry/exit point and execute within a
single main function, local variables stay and thus local state persists
from before to after freezing.

Defining that no local state within a kernel thread may be relied upon
after exiting from the freezer is certainly possible, and is already
true for many kernel threads.

It isn't a given property of the freezer itself, though. And isn't
obvious for author of new kernel threads either.

The ideal solution would be to convert the majority of kernel threads to
workqueues, because then there is a defined entry/exit point over which
state isn't transferred. That is a lot of work, though, and has other
drawbacks, particularly in the realtime space.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv7 0/3] Kernel Live Patching

2014-12-17 Thread Vojtech Pavlik
On Wed, Dec 17, 2014 at 01:22:21PM +0530, Balbir Singh wrote:
> On Wed, Dec 17, 2014 at 12:16 PM, Jiri Kosina  wrote:
> > On Wed, 17 Dec 2014, Balbir Singh wrote:
> >
> >> >> Could you describe what this does to signing? I presume the patched
> >> >> module should cause a taint on module signing?
> >> >
> >> > Hmm, why should it?
> >>
> >> I wanted to clarify it from a different perspective
> >>
> >> If the base image is signed by X and the patched module is signed by
> >> Y, is that supported. What does it imply in the case of live-patching?
> >
> > Why should that matter? Both are signed by keys that kernel is configured
> > to trust, which makes them equal (even though they are technically
> > different).
> >
> 
> I am not sure they are equal, others can comment

Since any loaded kernel module can do virtually anything on a machine,
there can only be one level of trust. As such, all trusted keys are
equally trusted.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv7 0/3] Kernel Live Patching

2014-12-17 Thread Vojtech Pavlik
On Wed, Dec 17, 2014 at 01:22:21PM +0530, Balbir Singh wrote:
 On Wed, Dec 17, 2014 at 12:16 PM, Jiri Kosina jkos...@suse.cz wrote:
  On Wed, 17 Dec 2014, Balbir Singh wrote:
 
   Could you describe what this does to signing? I presume the patched
   module should cause a taint on module signing?
  
   Hmm, why should it?
 
  I wanted to clarify it from a different perspective
 
  If the base image is signed by X and the patched module is signed by
  Y, is that supported. What does it imply in the case of live-patching?
 
  Why should that matter? Both are signed by keys that kernel is configured
  to trust, which makes them equal (even though they are technically
  different).
 
 
 I am not sure they are equal, others can comment

Since any loaded kernel module can do virtually anything on a machine,
there can only be one level of trust. As such, all trusted keys are
equally trusted.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-24 Thread Vojtech Pavlik
On Mon, Nov 24, 2014 at 02:26:08PM +0100, Thomas Gleixner wrote:
> On Mon, 24 Nov 2014, Jiri Kosina wrote:
> > On Mon, 24 Nov 2014, Thomas Gleixner wrote:
> > > How is determined whether a change can be applied w/o a consistency
> > > mechanism or not?
> > 
> > By a human being producing the "live patch" code.
> > 
> > If the semantics of the patch requires consistency mechanism to be applied 
> > (such as "old and new function must not run in parallel, because locking 
> > rules would be violated", or "return value from a function that is being 
> > called in a loop is changing its meaning", etc.), then this first naive 
> > implementation simply can't be used.
> > 
> > For simple things though, such as "add a missing bounds check to sys_foo() 
> > prologue and return -EINVAL if out-of-bounds", this is sufficient.
> > 
> > It's being designed in a way that more advanced consistency models (such 
> > as the ones kgraft and kpatch are currently implementing) can be built on 
> > top of it.
> > 
> > The person writing the patch would always need to understand what he is 
> > doing to be able to pick correct consistency model to be used. I 
> > personally think this is a good thing -- this is nothing where we should 
> > be relying on any kinds of tools.
> 
> But why want we to provide a mechanism which has no consistency
> enforcement at all?
> 
> Surely you can argue that the person who is doing that is supposed to
> know what he's doing, but what's the downside of enforcing consistency
> mechanisms on all live code changes?

The consistency engine implementing the consistency model is the most
complex part of the live patching technology. We want to have something
small, easy to understand pushed out first, to build on top of that.

Plus we're still discussing which exact consistency model to use for
upstream live patching (there are many considerations) and whether one
is enough, or whether an engine that can do more than one is required.

The consistency models of kpatch and kGraft aren't directly compatible.

I think we're on a good way towards a single model, but we'll see when
it's implemented within the live patching framework just posted.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-24 Thread Vojtech Pavlik
On Mon, Nov 24, 2014 at 12:13:20PM +0100, Thomas Gleixner wrote:

> > This commit introduces code for the live patching core.  It implements
> > an ftrace-based mechanism and kernel interface for doing live patching
> > of kernel and kernel module functions.
> > 
> > It represents the greatest common functionality set between kpatch and
> > kgraft and can accept patches built using either method.
> > 
> > This first version does not implement any consistency mechanism that
> > ensures that old and new code do not run together.  In practice, ~90% of
> > CVEs are safe to apply in this way, since they simply add a conditional
> > check.  However, any function change that can not execute safely with
> > the old version of the function can _not_ be safely applied in this
> > version.
> 
> To be honest this sounds frightening.
> 
> How is determined whether a change can be applied w/o a consistency
> mechanism or not?

If there are any syntactic (function prototype - arguments, return value
type) or semantic dependencies (data value semantics, locking order,
...) between multiple functions the patch changes, then it cannot be
applied.

If the changes are small and localized, don't depend on the order in
which individual functions are replaced, when both new and old code can
run in parallel on different CPUs or in sequence in a single thread
safely, then it can be applied.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-24 Thread Vojtech Pavlik
On Mon, Nov 24, 2014 at 12:13:20PM +0100, Thomas Gleixner wrote:

  This commit introduces code for the live patching core.  It implements
  an ftrace-based mechanism and kernel interface for doing live patching
  of kernel and kernel module functions.
  
  It represents the greatest common functionality set between kpatch and
  kgraft and can accept patches built using either method.
  
  This first version does not implement any consistency mechanism that
  ensures that old and new code do not run together.  In practice, ~90% of
  CVEs are safe to apply in this way, since they simply add a conditional
  check.  However, any function change that can not execute safely with
  the old version of the function can _not_ be safely applied in this
  version.
 
 To be honest this sounds frightening.
 
 How is determined whether a change can be applied w/o a consistency
 mechanism or not?

If there are any syntactic (function prototype - arguments, return value
type) or semantic dependencies (data value semantics, locking order,
...) between multiple functions the patch changes, then it cannot be
applied.

If the changes are small and localized, don't depend on the order in
which individual functions are replaced, when both new and old code can
run in parallel on different CPUs or in sequence in a single thread
safely, then it can be applied.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 2/3] kernel: add support for live patching

2014-11-24 Thread Vojtech Pavlik
On Mon, Nov 24, 2014 at 02:26:08PM +0100, Thomas Gleixner wrote:
 On Mon, 24 Nov 2014, Jiri Kosina wrote:
  On Mon, 24 Nov 2014, Thomas Gleixner wrote:
   How is determined whether a change can be applied w/o a consistency
   mechanism or not?
  
  By a human being producing the live patch code.
  
  If the semantics of the patch requires consistency mechanism to be applied 
  (such as old and new function must not run in parallel, because locking 
  rules would be violated, or return value from a function that is being 
  called in a loop is changing its meaning, etc.), then this first naive 
  implementation simply can't be used.
  
  For simple things though, such as add a missing bounds check to sys_foo() 
  prologue and return -EINVAL if out-of-bounds, this is sufficient.
  
  It's being designed in a way that more advanced consistency models (such 
  as the ones kgraft and kpatch are currently implementing) can be built on 
  top of it.
  
  The person writing the patch would always need to understand what he is 
  doing to be able to pick correct consistency model to be used. I 
  personally think this is a good thing -- this is nothing where we should 
  be relying on any kinds of tools.
 
 But why want we to provide a mechanism which has no consistency
 enforcement at all?
 
 Surely you can argue that the person who is doing that is supposed to
 know what he's doing, but what's the downside of enforcing consistency
 mechanisms on all live code changes?

The consistency engine implementing the consistency model is the most
complex part of the live patching technology. We want to have something
small, easy to understand pushed out first, to build on top of that.

Plus we're still discussing which exact consistency model to use for
upstream live patching (there are many considerations) and whether one
is enough, or whether an engine that can do more than one is required.

The consistency models of kpatch and kGraft aren't directly compatible.

I think we're on a good way towards a single model, but we'll see when
it's implemented within the live patching framework just posted.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-13 Thread Vojtech Pavlik
he switch per-thread like
> > kGraft does, not for the whole system, when the respective counters
> > reach zero.
> 
> I'm not sure what happens if a process sleeps on the patched-set?

Then the patching process will be stuck until it is woken up somehow.
But it's still much better to only have to care about processes sleeping
on the patched-set than about processes sleeping anywhere (kGraft).

> If we switch the other threads, when this sleeping thread wakes up
> that will see the old functions (and old data).

Yes, until the patching process is complete, data must be kept in the
old format, even by new functions.

> So I think we need both SWITCH_THREAD and SWITCH_KERNEL options in
> that case.

With data shadowing that's not required. It still may be worth having
it.

> What I'm thinking is to merge the code (technique) of both and
> allow to choose the "switch-timing" based on the patch's consistency
> requirement.

That's what I'm thinking about, too. But I'm also thinking, "this will
be complex, is it really needed"?

> Anyway, I'd like to support for this effort from kernel side.
> At least I have to solve ftrace regs conflict by IPMODIFY flag and
> a headache kretprobe failure case by sharing per-thread retstack
> with ftrace-callgraph.

While I don't know enough about the IPMODIFY flag, I wholeheartedly
support sharing the return stacks between kprobes and ftrace graph
caller.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-13 Thread Vojtech Pavlik
 the old functions (and old data).

Yes, until the patching process is complete, data must be kept in the
old format, even by new functions.

 So I think we need both SWITCH_THREAD and SWITCH_KERNEL options in
 that case.

With data shadowing that's not required. It still may be worth having
it.

 What I'm thinking is to merge the code (technique) of both and
 allow to choose the switch-timing based on the patch's consistency
 requirement.

That's what I'm thinking about, too. But I'm also thinking, this will
be complex, is it really needed?

 Anyway, I'd like to support for this effort from kernel side.
 At least I have to solve ftrace regs conflict by IPMODIFY flag and
 a headache kretprobe failure case by sharing per-thread retstack
 with ftrace-callgraph.

While I don't know enough about the IPMODIFY flag, I wholeheartedly
support sharing the return stacks between kprobes and ftrace graph
caller.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: Re: [PATCH 0/2] Kernel Live Patching

2014-11-12 Thread Vojtech Pavlik
e patched set. (Or you can
just include it manually in the set.)

Then, you can be sure that the place which calls func() is not on the
stack when patching. This way, in your classification, PATCH_KERNEL can
be as good as PATCH_THREAD. In my classification, I'm saying that
LEAVE_PATCHED_SET is as good as LEAVE_KERNEL.

> >> (*) Instead of checking stacks, at first, wait for all threads leaving
> >> the kernel once, after that, wait for refcount becomes zero and switch
> >> all the patched functions.
> > 
> > This is a very beautiful idea.
> > 
> > It does away with both the stack parsing and the kernel stopping,
> > achieving kGraft's goals, while preserving kpatch's consistency model.
> > 
> > Sadly, it combines the disadvantages of both kpatch and kGraft: From
> > kpatch it takes the inability to patch functions where threads are
> > sleeping often and as such never leave them at once. From kGraft it
> > takes the need to annotate kernel threads and wake sleepers from
> > userspace.
> 
> But how frequently the former case happens? It seems very very rare.
> And if we aim to enable both kpatch mode and kGraft mode in the kernel,
> anyway we'll have something for the latter cases.

The kpatch problem case isn't that rare. It just happened with a CVE in
futexes recently. It will happen if you try to patch anything that is on
the stack when a TTY or TCP read is waiting for data as another example. 

The kGraft problem case will happen when you load a 3rd party module
with a non-annotated kernel thread. Or a different problem will happen
when you have an application sleeping that will exit when receiving any
signal.

Both the cases can be handled with tricks and workarounds. But it'd be
much nicer to have a patching engine that is reliable.

> > So while it is beautiful, it's less practical than either kpatch or
> > kGraft alone. 
> 
> Ah, sorry for confusing, I don't tend to integrate kpatch and kGraft.
> Actually, it is just about modifying kpatch, since it may shorten
> stack-checking time.
> This means that does not change the consistency model.
> We certainly need both of kGraft mode and kpatch mode.

What I'm proposing is a LEAVE_PATCHED_SET + SWITCH_THREAD mode. It's
less consistency, but it is enough. And it is more reliable (likely to
succeed in finite time) than either kpatch or kGraft.

It'd be mostly based on your refcounting code, including stack
checking (when a process sleeps, counter gets set based on number of
patched functions on the stack), possibly including setting the counter
to 0 on syscall entry/exit, but it'd make the switch per-thread like
kGraft does, not for the whole system, when the respective counters
reach zero.

This handles the frequent sleeper case, it doesn't need annotated kernel
thread main loops, it will not need the user to wake up every process in
the system unless it sleeps in a patched function.

And it can handle all the patches that kpatch and kGraft can (it needs
shadowing for some).

> > Yes, this is what I call 'extending the patched set'. You can do that
> > either by deliberately changing the prototype of the patched function
> > being called, which causes the calling function to be considered
> > different, or just add it to the set of functions considered manually.
> 
> I'd prefer latter one :) or just gives hints of watching targets.

Me too.


-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: Re: [PATCH 0/2] Kernel Live Patching

2014-11-12 Thread Vojtech Pavlik
 stopping,
  achieving kGraft's goals, while preserving kpatch's consistency model.
  
  Sadly, it combines the disadvantages of both kpatch and kGraft: From
  kpatch it takes the inability to patch functions where threads are
  sleeping often and as such never leave them at once. From kGraft it
  takes the need to annotate kernel threads and wake sleepers from
  userspace.
 
 But how frequently the former case happens? It seems very very rare.
 And if we aim to enable both kpatch mode and kGraft mode in the kernel,
 anyway we'll have something for the latter cases.

The kpatch problem case isn't that rare. It just happened with a CVE in
futexes recently. It will happen if you try to patch anything that is on
the stack when a TTY or TCP read is waiting for data as another example. 

The kGraft problem case will happen when you load a 3rd party module
with a non-annotated kernel thread. Or a different problem will happen
when you have an application sleeping that will exit when receiving any
signal.

Both the cases can be handled with tricks and workarounds. But it'd be
much nicer to have a patching engine that is reliable.

  So while it is beautiful, it's less practical than either kpatch or
  kGraft alone. 
 
 Ah, sorry for confusing, I don't tend to integrate kpatch and kGraft.
 Actually, it is just about modifying kpatch, since it may shorten
 stack-checking time.
 This means that does not change the consistency model.
 We certainly need both of kGraft mode and kpatch mode.

What I'm proposing is a LEAVE_PATCHED_SET + SWITCH_THREAD mode. It's
less consistency, but it is enough. And it is more reliable (likely to
succeed in finite time) than either kpatch or kGraft.

It'd be mostly based on your refcounting code, including stack
checking (when a process sleeps, counter gets set based on number of
patched functions on the stack), possibly including setting the counter
to 0 on syscall entry/exit, but it'd make the switch per-thread like
kGraft does, not for the whole system, when the respective counters
reach zero.

This handles the frequent sleeper case, it doesn't need annotated kernel
thread main loops, it will not need the user to wake up every process in
the system unless it sleeps in a patched function.

And it can handle all the patches that kpatch and kGraft can (it needs
shadowing for some).

  Yes, this is what I call 'extending the patched set'. You can do that
  either by deliberately changing the prototype of the patched function
  being called, which causes the calling function to be considered
  different, or just add it to the set of functions considered manually.
 
 I'd prefer latter one :) or just gives hints of watching targets.

Me too.


-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH 0/2] Kernel Live Patching

2014-11-11 Thread Vojtech Pavlik
On Tue, Nov 11, 2014 at 10:24:03AM +0900, Masami Hiramatsu wrote:

> Hmm, I doubt this can cover all. what I'm thinking is a combination of
> LEAVE_KERNEL and SWITCH_KERNEL by using my refcounting and kGraft's
> per-thread "new universe" flagging(*). It switches all threads but not
> change entire kernel as kexec does.

While your approach described below indeed forces all threads to leave
kernel once, to initialize the reference counts, this can be considered
a preparatory phase before the actual patching begins.

The actual consistency model remains unchanged from what kpatch offers
today, which guarantees that at the time of switch, no execution thread
is inside the set of patched functions and that the switch happens at
once for all threads. Hence I'd still classify the consistency model
offered as LEAVE_PATCHED_SET SWITCH_KERNEL.

> So, I think the patch may be classified by following four types
> 
> PATCH_FUNCTION - Patching per function. This ignores context, just
>change the function.
>User must ensure that the new function can co-exist
>with old functions on the same context (e.g. recursive
>call can cause inconsistency).
> 
> PATCH_THREAD - Patching per thread. If a thread leave the kernel,
>changes are applied for that thread.
>User must ensure that the new functions can co-exist
>with old functions per-thread. Inter-thread shared
>data acquisition(locks) should not be involved.
> 
> PATCH_KERNEL - Patching all threads. This wait for all threads leave the
>all target functions.
>User must ensure that the new functions can co-exist
>with old functions on a thread (note that if there is a
>loop, old one can be called first n times, and new one
>can be called afterwords).(**)

Yes, but only when the function calling it is not included in the
patched set, which is only a problem for semantic changes accompanied by
no change in the function prototyppe. This can be avoided by changing
the prototype deliberately.

> RENEW_KERNEL - Renew entire kernel and reset internally. No patch limitation,
>but involving kernel resetting. This may take a time.

And involves recording the userspace-kernel interface state exactly. The
interface is fairly large, so this can become hairy.

> (*) Instead of checking stacks, at first, wait for all threads leaving
> the kernel once, after that, wait for refcount becomes zero and switch
> all the patched functions.

This is a very beautiful idea.

It does away with both the stack parsing and the kernel stopping,
achieving kGraft's goals, while preserving kpatch's consistency model.

Sadly, it combines the disadvantages of both kpatch and kGraft: From
kpatch it takes the inability to patch functions where threads are
sleeping often and as such never leave them at once. From kGraft it
takes the need to annotate kernel threads and wake sleepers from
userspace.

So while it is beautiful, it's less practical than either kpatch or
kGraft alone. 

> (**) For the loops, if it is a simple loop or some simple lock calls,
> we can wait for all threads leave the caller function to avoid inconsistency
> by using refcounting.

Yes, this is what I call 'extending the patched set'. You can do that
either by deliberately changing the prototype of the patched function
being called, which causes the calling function to be considered
different, or just add it to the set of functions considered manually.

> > There would be the refcounting engine, counting entries/exits of the
> > area of interest (nothing for LEAVE_FUNCTION, patched functions for
> > LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit for
> > LEAVE_KERNEL), and it'd do the counting either per thread, flagging a
> > thread as 'new universe' when the count goes to zero, or flipping a
> > 'new universe' switch for the whole kernel when the count goes down to zero.
> 
> Ah, that's similar thing what I'd like to try next :)

Cool.

> Sorry, here is an off-topic talk.  I think a problem of kGraft's
> LEAVE_KERNEL work is that the sleeping processes. To ensure all the
> threads are changing to new universe, we need to wakeup all the
> threads, or we need stack-dumping to find someone is sleeping on the
> target functions. What would the kGraft do for this issue?

Yes, kGraft uses an userspace helper to find such sleeper and wake them
by sending a SIGUSR1 or SIGSTOP/SIGCONT. It's one of the disadvantages
of kGraft that sleeper threads have to be handled possibly case by case.

Also, kernel threads are problematic for kGraft (as you may have seen in
earlier kGraft upstream submissions) as they never leave the kernel.

-- 
Vojtech Pavlik

Re: [PATCH 0/2] Kernel Live Patching

2014-11-11 Thread Vojtech Pavlik
 functions in the list.

So you would not be skipping the stack checking entirely, just allowing
certain functions from the patched set to be on the stack while you
switch to the new universe.

That indeed would make it a mixed SWITCH_FUNCTION/SWITCH_KERNEL
situation. 

The big caveat in such a situation is that you must not change the
calling convention or semantics of any function called directly from the
function you skipped the stack check for. As doing so would crash the
kernel when an old call calls a new function.

> > But there are a few (probably much less than 10%) cases like the locking
> > one I used above, where SWITCH_THREAD just isn't going to cut it and for
> > those I would need SWITCH_KERNEL or get very creative with refactoring
> > the patch to do things differently.
> 
> I'm not opposed to having both if necessary.  But I think the code would
> be _much_ simpler if we could agree on a single consistency model that
> can be used in all cases.  Plus there wouldn't be such a strong
> requirement to get incremental patching to work safely (which will add
> more complexity).
> 
> I actually agree with you that LEAVE_PATCHED_SET + SWITCH_THREAD is
> pretty nice.

Cool! Do you see it as the next step consistency model we would focus on
implementing in livepatch after the null model is complete and upstream?

(That wouldn't preclude extending it or implementing more models later.)

> So I'd like to hear more about cases where you think we _need_
> SWITCH_KERNEL.  As I mentioned above, I think many of those cases can be
> solved by using data structure versioning with shadow data fields.

I have tried, but so far I can't find a situation whether we would
absolutely need SWITCH_KERNEL, assuming we have LEAVE_PATCHED_SET +
SWITCH_THREAD + TRANSFORM_ON_ACCESS.

> > > Why would LEAVE_PATCHED_SET SWITCH_THREAD finish much quicker than
> > > LEAVE_PATCHED_SET SWITCH_KERNEL?  Wouldn't they be about the same?
> > 
> > Because with LEAVE_PATCHED_SET SWITCH_THREAD you're waiting for each
> > thread to leave the patched set and when they each have done that at
> > least once, you're done. Even if some are already back within the set.
> 
> Ok, so you're talking about the case when you're trying to patch a
> function which is always active.  Agreed :-)

Yes.

> > With LEAVE_PATCHED_SET SWITCH_KERNEL, you have to find the perfect
> > moment when all of the threads are outside of the patched set at the
> > same time. Depending on how often the functions are used and how large
> > the set is, reaching that moment will get harder.
> 
> Yeah, I think this is the biggest drawback of SWITCH_KERNEL.  More
> likely to fail (or never succeed).

If some threads are sleeping in a loop inside the patched set:

With SWITCH_THREAD you can wake (eg. by a signal) the threads from
userspace as a last resort and that will complete your patching.

With SWITCH_KERNEL you'd have somehow to wake them at the same time
hoping they also leave the patched set together. That's unlikely to
happen when many threads are involved.

In addition the "in progress" behavior is nicer for SWITCH_THREAD, as
any new thread will already be running patched code. With SWITCH_KERNEL,
you're waiting with applying the fix until the perfect moment.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-11 Thread Vojtech Pavlik
 SWITCH_KERNEL or get very creative with refactoring
  the patch to do things differently.
 
 I'm not opposed to having both if necessary.  But I think the code would
 be _much_ simpler if we could agree on a single consistency model that
 can be used in all cases.  Plus there wouldn't be such a strong
 requirement to get incremental patching to work safely (which will add
 more complexity).
 
 I actually agree with you that LEAVE_PATCHED_SET + SWITCH_THREAD is
 pretty nice.

Cool! Do you see it as the next step consistency model we would focus on
implementing in livepatch after the null model is complete and upstream?

(That wouldn't preclude extending it or implementing more models later.)

 So I'd like to hear more about cases where you think we _need_
 SWITCH_KERNEL.  As I mentioned above, I think many of those cases can be
 solved by using data structure versioning with shadow data fields.

I have tried, but so far I can't find a situation whether we would
absolutely need SWITCH_KERNEL, assuming we have LEAVE_PATCHED_SET +
SWITCH_THREAD + TRANSFORM_ON_ACCESS.

   Why would LEAVE_PATCHED_SET SWITCH_THREAD finish much quicker than
   LEAVE_PATCHED_SET SWITCH_KERNEL?  Wouldn't they be about the same?
  
  Because with LEAVE_PATCHED_SET SWITCH_THREAD you're waiting for each
  thread to leave the patched set and when they each have done that at
  least once, you're done. Even if some are already back within the set.
 
 Ok, so you're talking about the case when you're trying to patch a
 function which is always active.  Agreed :-)

Yes.

  With LEAVE_PATCHED_SET SWITCH_KERNEL, you have to find the perfect
  moment when all of the threads are outside of the patched set at the
  same time. Depending on how often the functions are used and how large
  the set is, reaching that moment will get harder.
 
 Yeah, I think this is the biggest drawback of SWITCH_KERNEL.  More
 likely to fail (or never succeed).

If some threads are sleeping in a loop inside the patched set:

With SWITCH_THREAD you can wake (eg. by a signal) the threads from
userspace as a last resort and that will complete your patching.

With SWITCH_KERNEL you'd have somehow to wake them at the same time
hoping they also leave the patched set together. That's unlikely to
happen when many threads are involved.

In addition the in progress behavior is nicer for SWITCH_THREAD, as
any new thread will already be running patched code. With SWITCH_KERNEL,
you're waiting with applying the fix until the perfect moment.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH 0/2] Kernel Live Patching

2014-11-11 Thread Vojtech Pavlik
On Tue, Nov 11, 2014 at 10:24:03AM +0900, Masami Hiramatsu wrote:

 Hmm, I doubt this can cover all. what I'm thinking is a combination of
 LEAVE_KERNEL and SWITCH_KERNEL by using my refcounting and kGraft's
 per-thread new universe flagging(*). It switches all threads but not
 change entire kernel as kexec does.

While your approach described below indeed forces all threads to leave
kernel once, to initialize the reference counts, this can be considered
a preparatory phase before the actual patching begins.

The actual consistency model remains unchanged from what kpatch offers
today, which guarantees that at the time of switch, no execution thread
is inside the set of patched functions and that the switch happens at
once for all threads. Hence I'd still classify the consistency model
offered as LEAVE_PATCHED_SET SWITCH_KERNEL.

 So, I think the patch may be classified by following four types
 
 PATCH_FUNCTION - Patching per function. This ignores context, just
change the function.
User must ensure that the new function can co-exist
with old functions on the same context (e.g. recursive
call can cause inconsistency).
 
 PATCH_THREAD - Patching per thread. If a thread leave the kernel,
changes are applied for that thread.
User must ensure that the new functions can co-exist
with old functions per-thread. Inter-thread shared
data acquisition(locks) should not be involved.
 
 PATCH_KERNEL - Patching all threads. This wait for all threads leave the
all target functions.
User must ensure that the new functions can co-exist
with old functions on a thread (note that if there is a
loop, old one can be called first n times, and new one
can be called afterwords).(**)

Yes, but only when the function calling it is not included in the
patched set, which is only a problem for semantic changes accompanied by
no change in the function prototyppe. This can be avoided by changing
the prototype deliberately.

 RENEW_KERNEL - Renew entire kernel and reset internally. No patch limitation,
but involving kernel resetting. This may take a time.

And involves recording the userspace-kernel interface state exactly. The
interface is fairly large, so this can become hairy.

 (*) Instead of checking stacks, at first, wait for all threads leaving
 the kernel once, after that, wait for refcount becomes zero and switch
 all the patched functions.

This is a very beautiful idea.

It does away with both the stack parsing and the kernel stopping,
achieving kGraft's goals, while preserving kpatch's consistency model.

Sadly, it combines the disadvantages of both kpatch and kGraft: From
kpatch it takes the inability to patch functions where threads are
sleeping often and as such never leave them at once. From kGraft it
takes the need to annotate kernel threads and wake sleepers from
userspace.

So while it is beautiful, it's less practical than either kpatch or
kGraft alone. 

 (**) For the loops, if it is a simple loop or some simple lock calls,
 we can wait for all threads leave the caller function to avoid inconsistency
 by using refcounting.

Yes, this is what I call 'extending the patched set'. You can do that
either by deliberately changing the prototype of the patched function
being called, which causes the calling function to be considered
different, or just add it to the set of functions considered manually.

  There would be the refcounting engine, counting entries/exits of the
  area of interest (nothing for LEAVE_FUNCTION, patched functions for
  LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit for
  LEAVE_KERNEL), and it'd do the counting either per thread, flagging a
  thread as 'new universe' when the count goes to zero, or flipping a
  'new universe' switch for the whole kernel when the count goes down to zero.
 
 Ah, that's similar thing what I'd like to try next :)

Cool.

 Sorry, here is an off-topic talk.  I think a problem of kGraft's
 LEAVE_KERNEL work is that the sleeping processes. To ensure all the
 threads are changing to new universe, we need to wakeup all the
 threads, or we need stack-dumping to find someone is sleeping on the
 target functions. What would the kGraft do for this issue?

Yes, kGraft uses an userspace helper to find such sleeper and wake them
by sending a SIGUSR1 or SIGSTOP/SIGCONT. It's one of the disadvantages
of kGraft that sleeper threads have to be handled possibly case by case.

Also, kernel threads are problematic for kGraft (as you may have seen in
earlier kGraft upstream submissions) as they never leave the kernel.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] Kernel Live Patching

2014-11-08 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 09:45:53PM -0600, Josh Poimboeuf wrote:

> On Fri, Nov 07, 2014 at 10:27:35PM +0100, Vojtech Pavlik wrote:
> > On Fri, Nov 07, 2014 at 09:45:00AM -0600, Josh Poimboeuf wrote:
> > 
> > > > LEAVE_FUNCTION
> > > > LEAVE_PATCHED_SET
> > > > LEAVE_KERNEL
> > > > 
> > > > SWITCH_FUNCTION
> > > > SWITCH_THREAD
> > > > SWITCH_KERNEL
> > > > 
> > 
> > There would be the refcounting engine, counting entries/exits of the
> > area of interest (nothing for LEAVE_FUNCTION, patched functions for
> > LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit
> > for LEAVE_KERNEL), and it'd do the counting either per thread,
> > flagging a thread as 'new universe' when the count goes to zero, or
> > flipping a 'new universe' switch for the whole kernel when the count
> > goes down to zero.
> > 
> > A patch would have flags which specify a combination of the above
> > properties that are needed for successful patching of that specific
> > patch.
> 
> Would it really be necessary to support all possible permutations?  I
> think that would add a lot of complexity to the code.  Especially if we
> have to support LEAVE_KERNEL, which adds a lot of interdependencies with
> the rest of the kernel (kthreads, syscall, irq, etc).
> 
> It seems to me that the two most interesting options are:
> - LEAVE_PATCHED_SET + SWITCH_THREAD (Masami-kGraft)
> - LEAVE_PATCHED_SET + SWITCH_KERNEL (kpatch and/or Masami-kpatch)

I agree here.

In fact LEAVE_KERNEL can be approximated by extending the patched
set as required to include functions which are not changed per se, but
are "contaminated" by propagation of semantic changes in the calling
convention, and/or data format.

This applies to cases like this (needs LEAVE_KERNEL or extending patched
set beyond changed functions):

---

int bar() {
[...]
-   return x;
+   return x + 1;
}

foo() {
int ret = bar();
do {
wait_interruptible();
} while (ret == bar());
}

---

Or like this (needs SWITCH_KERNEL so won't work with kGraft, but also
extending patched set, will not work with kpatch as it stands today):

---

void lock_a()
{
-   spin_lock();
+   spin_lock();
}
void lock_b()
{
-   spin_lock();
+   spin_lock();
}
void unlock_a()
}
-   spin_unlock();
+   spin_unlock();
}
void unlock_b()
{
-   spin_unlock();
+   spin_unlock();
}

void foo()
{
lock_a();
lock_b();
[...]
unlock_b();
unlock_a();
}
---


So once we can extend the patched set as needed (manually, after
review), we can handle all the cases that LEAVE_KERNEL offers, making
LEAVE_KERNEL unnecessary.

It'd be nice if we wouldn't require actually patching those functions,
only include them in the set we have to leave to proceed.

The remaining question is the performance impact of using a large set
with refcounting. LEAVE_KERNEL's impact as implemented in kGraft is
beyond being measurable, it's about 16 added instructions for each
thread that get executed.

> I do see some appeal for the choose-your-own-consistency-model idea.
> But it wouldn't be practical for cumulative patch upgrades, which I
> think we both agreed at LPC seems to be safer than incremental
> patching.  If you only ever have one patch module loaded at any given
> time, you only get one consistency model anyway.

> In order for multiple consistency models to be practical, I think we'd
> need to figure out how to make incremental patching safe.

I believe we have to get incremental patching working anyway as it is a
valid usecase for many users, just not for major distributions.

And we may want to take a look at how to mark parts of a cumulative
patch with different consistency models, when we combine eg. the recent
futex CVE patch (not working with SWITCH_KERNEL) and anything requiring
SWITCH kernel.

> > For data layout an semantic changes, there are two approaches:
> > 
> > 1) TRANSFORM_WORLD
 
> I'm kind of surprised to hear that Ksplice does this.  I had
> considered this approach, but it sounds really tricky, if not
> impossible in many cases.

By the way, the 

Re: [PATCH 0/2] Kernel Live Patching

2014-11-08 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 09:45:53PM -0600, Josh Poimboeuf wrote:

 On Fri, Nov 07, 2014 at 10:27:35PM +0100, Vojtech Pavlik wrote:
  On Fri, Nov 07, 2014 at 09:45:00AM -0600, Josh Poimboeuf wrote:
  
LEAVE_FUNCTION
LEAVE_PATCHED_SET
LEAVE_KERNEL

SWITCH_FUNCTION
SWITCH_THREAD
SWITCH_KERNEL

  
  There would be the refcounting engine, counting entries/exits of the
  area of interest (nothing for LEAVE_FUNCTION, patched functions for
  LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit
  for LEAVE_KERNEL), and it'd do the counting either per thread,
  flagging a thread as 'new universe' when the count goes to zero, or
  flipping a 'new universe' switch for the whole kernel when the count
  goes down to zero.
  
  A patch would have flags which specify a combination of the above
  properties that are needed for successful patching of that specific
  patch.
 
 Would it really be necessary to support all possible permutations?  I
 think that would add a lot of complexity to the code.  Especially if we
 have to support LEAVE_KERNEL, which adds a lot of interdependencies with
 the rest of the kernel (kthreads, syscall, irq, etc).
 
 It seems to me that the two most interesting options are:
 - LEAVE_PATCHED_SET + SWITCH_THREAD (Masami-kGraft)
 - LEAVE_PATCHED_SET + SWITCH_KERNEL (kpatch and/or Masami-kpatch)

I agree here.

In fact LEAVE_KERNEL can be approximated by extending the patched
set as required to include functions which are not changed per se, but
are contaminated by propagation of semantic changes in the calling
convention, and/or data format.

This applies to cases like this (needs LEAVE_KERNEL or extending patched
set beyond changed functions):

---

int bar() {
[...]
-   return x;
+   return x + 1;
}

foo() {
int ret = bar();
do {
wait_interruptible();
} while (ret == bar());
}

---

Or like this (needs SWITCH_KERNEL so won't work with kGraft, but also
extending patched set, will not work with kpatch as it stands today):

---

void lock_a()
{
-   spin_lock(x);
+   spin_lock(y);
}
void lock_b()
{
-   spin_lock(y);
+   spin_lock(x);
}
void unlock_a()
}
-   spin_unlock(x);
+   spin_unlock(y);
}
void unlock_b()
{
-   spin_unlock(y);
+   spin_unlock(x);
}

void foo()
{
lock_a();
lock_b();
[...]
unlock_b();
unlock_a();
}
---


So once we can extend the patched set as needed (manually, after
review), we can handle all the cases that LEAVE_KERNEL offers, making
LEAVE_KERNEL unnecessary.

It'd be nice if we wouldn't require actually patching those functions,
only include them in the set we have to leave to proceed.

The remaining question is the performance impact of using a large set
with refcounting. LEAVE_KERNEL's impact as implemented in kGraft is
beyond being measurable, it's about 16 added instructions for each
thread that get executed.

 I do see some appeal for the choose-your-own-consistency-model idea.
 But it wouldn't be practical for cumulative patch upgrades, which I
 think we both agreed at LPC seems to be safer than incremental
 patching.  If you only ever have one patch module loaded at any given
 time, you only get one consistency model anyway.

 In order for multiple consistency models to be practical, I think we'd
 need to figure out how to make incremental patching safe.

I believe we have to get incremental patching working anyway as it is a
valid usecase for many users, just not for major distributions.

And we may want to take a look at how to mark parts of a cumulative
patch with different consistency models, when we combine eg. the recent
futex CVE patch (not working with SWITCH_KERNEL) and anything requiring
SWITCH kernel.

  For data layout an semantic changes, there are two approaches:
  
  1) TRANSFORM_WORLD
 
 I'm kind of surprised to hear that Ksplice does this.  I had
 considered this approach, but it sounds really tricky, if not
 impossible in many cases.

By the way, the ability to do this is basically the only advantage of
actually stopping the kernel.
 
 Ahem, this would be an opportune time for a Ksplice person to chime in
 with their experiences...

Indeed. :)

  2) TRANSFORM_ON_ACCESS
 
 This is a variation on what we've been doing with kpatch, using shadow
 data fields to add data and/or version metadata

Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
t null model) and while it
needs two steps (patch, then enable conversion), it doesn't require two
rounds of patching. Also, you don't have to consider oldfunc/newdata as
that will never happen. oldfunc/olddata obviously works, so you only
have to look at newfunc/olddata and newfunc/newdata as the
transformation goes on.

I don't see either of these as really that much simpler. But I do see value
in offering both.

> On the other hand, SWITCH_KERNEL doesn't have those problems.  It does
> have the problem you mentioned, roughly 2% of the time, where it can't
> patch functions which are always in use.  But in that case we can skip
> the backtrace check ~90% of the time.  

An interesting bit is that when you skip the backtrace check you're
actually reverting to LEAVE_FUNCION SWITCH_FUNCTION, forfeiting all
consistency and not LEAVE_FUNCTION SWITCH_KERNEL as one would expect.

Hence for those 2% of cases (going with your number, because it's a
guess anyway) LEAVE_PATCHED_SET SWITCH_THREAD would in fact be a safer
option.

> So it's really maybe something
> like 0.2% of patches which can't be patched with SWITCH_KERNEL.  But
> even then I think we could overcome that by getting creative, e.g. using
> the multiple patch approach.
> 
> So my perspective is that SWITCH_THREAD causes big headaches 10% of the
> time, whereas SWITCH_KERNEL causes small headaches 1.8% of the time, and
> big headaches 0.2% of the time :-)

My preferred way would be to go with SWITCH_THREAD for the simpler stuff
and do a SWITCH_KERNEL for the 10% of complex patches. This because
(LEAVE_PATCHED_SET) SWITCH_THREAD finishes much quicker. But I'm biased
there. ;)

It seems more and more to me that we will actually want the more
powerful engine coping with the various options.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 07:11:53AM -0600, Josh Poimboeuf wrote:

> 2. Add consistency model(s) (e.g. kpatch stop_machine, kGraft per-task
>consistency, Masami's per task ref counting)

I have given some thought to the consistency models and how they differ
and how they potentially could be unified.

I have to thank Masami, because his rewrite of the kpatch model based on
refcounting is what brought it closer to the kGraft model and thus
allowed me to find the parallels.

Let me start by defining the properties of the patching consistency
model. First, what entity the execution must be outside of to be able to
make the switch, ordered from weakest to strongest:

LEAVE_FUNCTION
- execution has to leave a patched function to switch
  to the new implementation

LEAVE_PATCHED_SET
- execution has to leave the set of patched functions
  to switch to the new implementation

LEAVE_KERNEL
- execution has to leave the entire kernel to switch
  to the new implementation

Then, what entity the switch happens for. Again, from weakest to strongest:

SWITCH_FUNCTION
- the switch to the new implementation happens on a per-function
   basis

SWITCH_THREAD
- the switch to the new implementation is per-thread.

SWITCH_KERNEL
- the switch to the new implementation happens at once for
  the whole kernel

Now with those definitions:

livepatch (null model), as is, is LEAVE_FUNCTION and SWITCH_FUNCTION

kpatch, masami-refcounting and Ksplice are LEAVE_PATCHED_SET and 
SWITCH_KERNEL

kGraft is LEAVE_KERNEL and SWITCH_THREAD

CRIU/kexec is LEAVE_KERNEL and SWITCH_KERNEL

By blending kGraft and masami-refcounting, we could create a consistency
engine capable of almost any combination of these properties and thus
all the consistency models.

However, I'm currently thinking that the most interesting model is
LEAVE_PATCHED_SET and SWITCH_THREAD, as it is reliable, fast converging,
doesn't require annotating kernel threads nor fails with frequent
sleepers like futexes. 

It provides the least consistency that is required to be able to change
the calling convention of functions and still allows for semantic
dependencies.

What do you think?



PS.: Livepatch's null model isn't in fact the weakest possible, as it still
guarantees executing complete intact functions, this thanks to ftrace.
That is much more than what would direct overwriting of the function in
memory achieve.

This is also the reason why Ksplice is locked to a very specific
consistency model. Ksplice can patch only when the kernel is stopped and
the model is built from that.

masami-refcounting, kpatch, kGraft, livepatch have a lot more freedom,
thanks to ftrace, into what the consistency model should look like.

PPS.: I haven't included any handling of changed data structures in
this, that's another set of properties.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 06:31:54AM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 06, 2014 at 09:24:23PM +0100, Vojtech Pavlik wrote:
> > On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote:
> > 
> > > On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote:
> > > > I don't think this specific example was generated. 
> > > > 
> > > > I also don't think including the whole kpatch automation into the kernel
> > > > tree is a viable development model for it. (Same would apply for kGraft
> > > > automation.)
> > > 
> > > Why?  We (IMHO incorrectly) used the argument of tight coupling to put
> > > perf into the kernel tree.  Generating kernel live patches is way more
> > > integrated that it absolutely has to go into the tree to be able to do
> > > proper development on it in an integrated fashion.
> > 
> > One reason is that there are currently at least two generators using
> > very different methods of generation (in addition to the option of doing
> > the patch module by hand), and neither of them are currently in a state
> > where they would be ready for inclusion into the kernel (although the
> > kpatch one is clearly closer to that).
> 
> What generator does kGraft have?  Is that the one that generates the
> source patch, or is there one that generates a binary patch module?

The generator for kGraft:

* extracts a list of changed functions from a patch (rather naïvely so 
far)
* uses DWARF debuginfo of the old kernel to handle things like inlining
  and create a complete list of functions that need to be replaced
* compiles the kernel with -fdata-sections -ffunction-sections
* uses a modified objcopy to extract functions from the kernel
  into a single .o file
* creates a stub .c file that references those functions
* compiles the .c and links with the .o to build a .ko

The main difference is in that the kGraft generator doesn't try to
compare the old and new binary objects, but rather works with function
lists and the DWARF info of the old code and extracts new functions from
the new binary.

However, as I said before, we have found enough trouble around eg.
IPA-SRA and other optimizations that make any automated approach fragile
and in our view more effort than benefit. Hence, we're intend to use the
manual way of creating live patches until proven that we were wrong in
this assessment. :)

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 06:31:54AM -0600, Josh Poimboeuf wrote:
 On Thu, Nov 06, 2014 at 09:24:23PM +0100, Vojtech Pavlik wrote:
  On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote:
  
   On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote:
I don't think this specific example was generated. 

I also don't think including the whole kpatch automation into the kernel
tree is a viable development model for it. (Same would apply for kGraft
automation.)
   
   Why?  We (IMHO incorrectly) used the argument of tight coupling to put
   perf into the kernel tree.  Generating kernel live patches is way more
   integrated that it absolutely has to go into the tree to be able to do
   proper development on it in an integrated fashion.
  
  One reason is that there are currently at least two generators using
  very different methods of generation (in addition to the option of doing
  the patch module by hand), and neither of them are currently in a state
  where they would be ready for inclusion into the kernel (although the
  kpatch one is clearly closer to that).
 
 What generator does kGraft have?  Is that the one that generates the
 source patch, or is there one that generates a binary patch module?

The generator for kGraft:

* extracts a list of changed functions from a patch (rather naïvely so 
far)
* uses DWARF debuginfo of the old kernel to handle things like inlining
  and create a complete list of functions that need to be replaced
* compiles the kernel with -fdata-sections -ffunction-sections
* uses a modified objcopy to extract functions from the kernel
  into a single .o file
* creates a stub .c file that references those functions
* compiles the .c and links with the .o to build a .ko

The main difference is in that the kGraft generator doesn't try to
compare the old and new binary objects, but rather works with function
lists and the DWARF info of the old code and extracts new functions from
the new binary.

However, as I said before, we have found enough trouble around eg.
IPA-SRA and other optimizations that make any automated approach fragile
and in our view more effort than benefit. Hence, we're intend to use the
manual way of creating live patches until proven that we were wrong in
this assessment. :)

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
On Fri, Nov 07, 2014 at 07:11:53AM -0600, Josh Poimboeuf wrote:

 2. Add consistency model(s) (e.g. kpatch stop_machine, kGraft per-task
consistency, Masami's per task ref counting)

I have given some thought to the consistency models and how they differ
and how they potentially could be unified.

I have to thank Masami, because his rewrite of the kpatch model based on
refcounting is what brought it closer to the kGraft model and thus
allowed me to find the parallels.

Let me start by defining the properties of the patching consistency
model. First, what entity the execution must be outside of to be able to
make the switch, ordered from weakest to strongest:

LEAVE_FUNCTION
- execution has to leave a patched function to switch
  to the new implementation

LEAVE_PATCHED_SET
- execution has to leave the set of patched functions
  to switch to the new implementation

LEAVE_KERNEL
- execution has to leave the entire kernel to switch
  to the new implementation

Then, what entity the switch happens for. Again, from weakest to strongest:

SWITCH_FUNCTION
- the switch to the new implementation happens on a per-function
   basis

SWITCH_THREAD
- the switch to the new implementation is per-thread.

SWITCH_KERNEL
- the switch to the new implementation happens at once for
  the whole kernel

Now with those definitions:

livepatch (null model), as is, is LEAVE_FUNCTION and SWITCH_FUNCTION

kpatch, masami-refcounting and Ksplice are LEAVE_PATCHED_SET and 
SWITCH_KERNEL

kGraft is LEAVE_KERNEL and SWITCH_THREAD

CRIU/kexec is LEAVE_KERNEL and SWITCH_KERNEL

By blending kGraft and masami-refcounting, we could create a consistency
engine capable of almost any combination of these properties and thus
all the consistency models.

However, I'm currently thinking that the most interesting model is
LEAVE_PATCHED_SET and SWITCH_THREAD, as it is reliable, fast converging,
doesn't require annotating kernel threads nor fails with frequent
sleepers like futexes. 

It provides the least consistency that is required to be able to change
the calling convention of functions and still allows for semantic
dependencies.

What do you think?



PS.: Livepatch's null model isn't in fact the weakest possible, as it still
guarantees executing complete intact functions, this thanks to ftrace.
That is much more than what would direct overwriting of the function in
memory achieve.

This is also the reason why Ksplice is locked to a very specific
consistency model. Ksplice can patch only when the kernel is stopped and
the model is built from that.

masami-refcounting, kpatch, kGraft, livepatch have a lot more freedom,
thanks to ftrace, into what the consistency model should look like.

PPS.: I haven't included any handling of changed data structures in
this, that's another set of properties.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-07 Thread Vojtech Pavlik
 you only
have to look at newfunc/olddata and newfunc/newdata as the
transformation goes on.

I don't see either of these as really that much simpler. But I do see value
in offering both.

 On the other hand, SWITCH_KERNEL doesn't have those problems.  It does
 have the problem you mentioned, roughly 2% of the time, where it can't
 patch functions which are always in use.  But in that case we can skip
 the backtrace check ~90% of the time.  

An interesting bit is that when you skip the backtrace check you're
actually reverting to LEAVE_FUNCION SWITCH_FUNCTION, forfeiting all
consistency and not LEAVE_FUNCTION SWITCH_KERNEL as one would expect.

Hence for those 2% of cases (going with your number, because it's a
guess anyway) LEAVE_PATCHED_SET SWITCH_THREAD would in fact be a safer
option.

 So it's really maybe something
 like 0.2% of patches which can't be patched with SWITCH_KERNEL.  But
 even then I think we could overcome that by getting creative, e.g. using
 the multiple patch approach.
 
 So my perspective is that SWITCH_THREAD causes big headaches 10% of the
 time, whereas SWITCH_KERNEL causes small headaches 1.8% of the time, and
 big headaches 0.2% of the time :-)

My preferred way would be to go with SWITCH_THREAD for the simpler stuff
and do a SWITCH_KERNEL for the 10% of complex patches. This because
(LEAVE_PATCHED_SET) SWITCH_THREAD finishes much quicker. But I'm biased
there. ;)

It seems more and more to me that we will actually want the more
powerful engine coping with the various options.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote:

> On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote:
> > I don't think this specific example was generated. 
> > 
> > I also don't think including the whole kpatch automation into the kernel
> > tree is a viable development model for it. (Same would apply for kGraft
> > automation.)
> 
> Why?  We (IMHO incorrectly) used the argument of tight coupling to put
> perf into the kernel tree.  Generating kernel live patches is way more
> integrated that it absolutely has to go into the tree to be able to do
> proper development on it in an integrated fashion.

One reason is that there are currently at least two generators using
very different methods of generation (in addition to the option of doing
the patch module by hand), and neither of them are currently in a state
where they would be ready for inclusion into the kernel (although the
kpatch one is clearly closer to that).

A generator is not required for using the infrastructure and is merely a
means of preparing the live patch with less effort.

I'm not opposed at all to adding the generator(s) eventually.

However, given that their use is optional, I would prefer not to have to
wait on finishing and cleaning up the generator(s) to include of the
in-kernel live patching infrastructure.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:44:46AM -0800, Christoph Hellwig wrote:

> On Thu, Nov 06, 2014 at 08:39:06AM -0600, Seth Jennings wrote:
> > An example patch module can be found here:
> > https://github.com/spartacus06/livepatch/blob/master/patch/patch.c
> 
> Please include the generator for this patch in the kernel tree.
> Providing interfaces for out of tree modules (or generators) is not
> how the kernel internal APIs work.

I don't think this specific example was generated. 

I also don't think including the whole kpatch automation into the kernel
tree is a viable development model for it. (Same would apply for kGraft
automation.)

So I believe including one or more human produced examples makes most sense.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:20:49AM -0600, Seth Jennings wrote:

> Yes, I should explain it.
> 
> This is something that is currently only used in the kpatch approach.
> It allows the patching core to do dynamic relocations on the new
> function code, similar to what the kernel module linker does, but this
> works for non-exported symbols as well.
> 
> This is so the patch module doesn't have to do a kallsyms lookup on
> every non-exported symbol that the new functions use.
> 
> The fields of the dynrela structure are those of a normal ELF rela
> entry, except for the "external" field, which conveys information about
> where the core module should go looking for the symbol referenced in the
> dynrela entry.
> 
> Josh was under the impression that Vojtech was ok with putting the
> dynrela stuff in the core.  Is that not correct (misunderstanding)?

Yes, that is correct, as obviously the kpatch way of generating patches
by extracting code from a compiled kernel would not be viable without
it.

For our own kGraft usage we're choosing to compile patches from C
source, and there we can simply replace the function calls by calls via
pointer looked up through kallsyms.

However, kGraft also has tools to create patches in an automated way,
where the individual functions are extracted from the compiled patched
kernel using a modified objopy and this is hitting exactly the same
issue of having to do relocation of unexported symbols if any are
referenced.

So no objection to the idea. We'll have to look more into the code to
comment on the implementation of the dynrela stuff.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] kernel: add support for live patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:20:49AM -0600, Seth Jennings wrote:

 Yes, I should explain it.
 
 This is something that is currently only used in the kpatch approach.
 It allows the patching core to do dynamic relocations on the new
 function code, similar to what the kernel module linker does, but this
 works for non-exported symbols as well.
 
 This is so the patch module doesn't have to do a kallsyms lookup on
 every non-exported symbol that the new functions use.
 
 The fields of the dynrela structure are those of a normal ELF rela
 entry, except for the external field, which conveys information about
 where the core module should go looking for the symbol referenced in the
 dynrela entry.
 
 Josh was under the impression that Vojtech was ok with putting the
 dynrela stuff in the core.  Is that not correct (misunderstanding)?

Yes, that is correct, as obviously the kpatch way of generating patches
by extracting code from a compiled kernel would not be viable without
it.

For our own kGraft usage we're choosing to compile patches from C
source, and there we can simply replace the function calls by calls via
pointer looked up through kallsyms.

However, kGraft also has tools to create patches in an automated way,
where the individual functions are extracted from the compiled patched
kernel using a modified objopy and this is hitting exactly the same
issue of having to do relocation of unexported symbols if any are
referenced.

So no objection to the idea. We'll have to look more into the code to
comment on the implementation of the dynrela stuff.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:44:46AM -0800, Christoph Hellwig wrote:

 On Thu, Nov 06, 2014 at 08:39:06AM -0600, Seth Jennings wrote:
  An example patch module can be found here:
  https://github.com/spartacus06/livepatch/blob/master/patch/patch.c
 
 Please include the generator for this patch in the kernel tree.
 Providing interfaces for out of tree modules (or generators) is not
 how the kernel internal APIs work.

I don't think this specific example was generated. 

I also don't think including the whole kpatch automation into the kernel
tree is a viable development model for it. (Same would apply for kGraft
automation.)

So I believe including one or more human produced examples makes most sense.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Kernel Live Patching

2014-11-06 Thread Vojtech Pavlik
On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote:

 On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote:
  I don't think this specific example was generated. 
  
  I also don't think including the whole kpatch automation into the kernel
  tree is a viable development model for it. (Same would apply for kGraft
  automation.)
 
 Why?  We (IMHO incorrectly) used the argument of tight coupling to put
 perf into the kernel tree.  Generating kernel live patches is way more
 integrated that it absolutely has to go into the tree to be able to do
 proper development on it in an integrated fashion.

One reason is that there are currently at least two generators using
very different methods of generation (in addition to the option of doing
the patch module by hand), and neither of them are currently in a state
where they would be ready for inclusion into the kernel (although the
kpatch one is clearly closer to that).

A generator is not required for using the infrastructure and is merely a
means of preparing the live patch with less effort.

I'm not opposed at all to adding the generator(s) eventually.

However, given that their use is optional, I would prefer not to have to
wait on finishing and cleaning up the generator(s) to include of the
in-kernel live patching infrastructure.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace

2014-10-22 Thread Vojtech Pavlik
On Wed, Oct 22, 2014 at 10:26:25AM +0200, Heiko Carstens wrote:

> > I can confirm that kGraft works well on top of current mainline with
> > this patch added.
> > 
> > Another reason for a performance impact when kGraft is enabled is that
> > kGraft still adds two instructions to the syscall path on s390x, as
> > there is no space left for a kgraft TIF in the first eight bits of
> > thread info flags. Renumbering the thread info flags such that _TIF_WORK
> > occupies the first eight bits and TIF_TRACE the next eight would fix
> > that problem: Do you believe it is feasible?
> 
> Hi Vojtech,
> 
> I think you're talking about the SLES12 kernel? 

I'm working with upstream git right now, to make sure kGraft keeps up
with it.

> There you can simply move the TIF_SYSCALL bit to the same byte where
> your TIF_KGR bit resides.

On 3.12 (SLE12) this would allow me to clear the TIF_KGR in a single
instruction when exiting a syscall simply by extending the mask. 

I need to clear it before entering a syscall, too, and TIF_SYSCALL is
set there, not cleared.

What could work is moving TIF_SYSCALL and TIF_KGR to the same byte as
TIF_TRACE. Then I could use the TIF_TRACE check to also clear the
TIF_KGR bit in sysc_tracesys without adding an instruction to the hot
path.

> Upstream is a bit different since the TIF_SYSCALL bit is already gone (got
> replaced with an s390 specific "PIF" bit). However the free TIF bit got
> already eaten up by uprobes..

Yes, and all the 8 bits are eaten now in upstream. That's what got me
thinking about separating TIF_WORK and TIF_TRACE to separate bytes, with
no other flags in them. Then the syscall code would then just check the
whole byte for zero instead of doing test-under-mask.

> However we can think of a better solution for upstream if the combined
> solution of kGraft/kpatch is ready to be merged.

We may, indeed, end up doing things very differently there.

At least initially the plan is to go entirely without enforcing any kind
of consistency when patching, so TIF_KGR and the lazy migration will not
exist, and neither will we be stopping the kernel or checking stacks.

This will make applying patches bigger than a single functions tricky,
but it's a good initial goal.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace

2014-10-22 Thread Vojtech Pavlik
On Wed, Oct 22, 2014 at 10:26:25AM +0200, Heiko Carstens wrote:

  I can confirm that kGraft works well on top of current mainline with
  this patch added.
  
  Another reason for a performance impact when kGraft is enabled is that
  kGraft still adds two instructions to the syscall path on s390x, as
  there is no space left for a kgraft TIF in the first eight bits of
  thread info flags. Renumbering the thread info flags such that _TIF_WORK
  occupies the first eight bits and TIF_TRACE the next eight would fix
  that problem: Do you believe it is feasible?
 
 Hi Vojtech,
 
 I think you're talking about the SLES12 kernel? 

I'm working with upstream git right now, to make sure kGraft keeps up
with it.

 There you can simply move the TIF_SYSCALL bit to the same byte where
 your TIF_KGR bit resides.

On 3.12 (SLE12) this would allow me to clear the TIF_KGR in a single
instruction when exiting a syscall simply by extending the mask. 

I need to clear it before entering a syscall, too, and TIF_SYSCALL is
set there, not cleared.

What could work is moving TIF_SYSCALL and TIF_KGR to the same byte as
TIF_TRACE. Then I could use the TIF_TRACE check to also clear the
TIF_KGR bit in sysc_tracesys without adding an instruction to the hot
path.

 Upstream is a bit different since the TIF_SYSCALL bit is already gone (got
 replaced with an s390 specific PIF bit). However the free TIF bit got
 already eaten up by uprobes..

Yes, and all the 8 bits are eaten now in upstream. That's what got me
thinking about separating TIF_WORK and TIF_TRACE to separate bytes, with
no other flags in them. Then the syscall code would then just check the
whole byte for zero instead of doing test-under-mask.

 However we can think of a better solution for upstream if the combined
 solution of kGraft/kpatch is ready to be merged.

We may, indeed, end up doing things very differently there.

At least initially the plan is to go entirely without enforcing any kind
of consistency when patching, so TIF_KGR and the lazy migration will not
exist, and neither will we be stopping the kernel or checking stacks.

This will make applying patches bigger than a single functions tricky,
but it's a good initial goal.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace

2014-10-21 Thread Vojtech Pavlik
Hello Heiko,

I can confirm that kGraft works well on top of current mainline with
this patch added.

Another reason for a performance impact when kGraft is enabled is that
kGraft still adds two instructions to the syscall path on s390x, as
there is no space left for a kgraft TIF in the first eight bits of
thread info flags. Renumbering the thread info flags such that _TIF_WORK
occupies the first eight bits and TIF_TRACE the next eight would fix
that problem: Do you believe it is feasible?

Vojtech

On Tue, Oct 21, 2014 at 10:30:27AM +0200, Heiko Carstens wrote:
> v3:
> Changed patch 1/2 to incorporate feedback from Steven Rostedt and
> Masami Hiramatsu: rename helper function check_ftrace_location()
> to arch_check_ftrace_location() and convert it to a weak function,
> so architectures can override it without the need for new config
> option.
> 
> v2:
> Changed patch 1/2 to incorporate feedback from Masami Hiramatsu, and
> introduce a new helper function check_ftrace_location().
> The requested ftracetest has been sent as an own patch set, since it
> has no dependency to these patches.
> 
> v1:
> We would like to implement an architecture specific variant of "kprobes
> on ftrace" without using the current HAVE_KPROBES_ON_FTRACE infrastructure
> which is currently only used by x86.
> 
> The rationale for these two patches is:
> - we want to patch the first instruction of the mcount code block to
>   reduce the overhead of the function tracer
> - we'd like to keep the ftrace_caller function as simple as possible and
>   not require it to generate a 100% valid pt_regs structure as required
>   by the combination of DYNAMIC_FTRACE_WITH_REGS and HAVE_KPROBES_ON_FTRACE.
>   This allows us to not generate the psw mask field in the pt_regs
>   structure on each function tracer enabled function, which otherwise would
>   be very expensive. Besides that program check generated pt_regs contents
>   are "more" accurate than program generated ones and don't require any
>   maintenance.
>   And also we can keep the ftrace and kprobes backends quite separated.
> 
> In order to make this work a small common code change is necessary which
> removes a check if kprobe is being placed on an ftrace location (see
> first patch).
> 
> If possible, I'd like to have an ACK from at least one of the kprobes
> maintainers for the first patch and bring it upstream via the s390 tree.
> 
> Thanks,
> Heiko
> 
> Heiko Carstens (2):
>   kprobes: introduce weak arch_check_ftrace_location() helper function
>   s390/ftrace,kprobes: allow to patch first instruction
> 
>  arch/s390/include/asm/ftrace.h  |  52 ++--
>  arch/s390/include/asm/kprobes.h |   1 +
>  arch/s390/include/asm/lowcore.h |   4 +-
>  arch/s390/include/asm/pgtable.h |  12 
>  arch/s390/kernel/asm-offsets.c  |   1 -
>  arch/s390/kernel/early.c|   4 --
>  arch/s390/kernel/ftrace.c   | 132 
> +---
>  arch/s390/kernel/kprobes.c  |  92 
>  arch/s390/kernel/mcount.S   |   1 +
>  arch/s390/kernel/setup.c|   2 -
>  arch/s390/kernel/smp.c  |   1 -
>  include/linux/kprobes.h |   1 +
>  kernel/kprobes.c|  18 +++---
>  scripts/recordmcount.c  |   2 +-
>  scripts/recordmcount.pl |   2 +-
>  15 files changed, 226 insertions(+), 99 deletions(-)
> 
> -- 
> 1.8.5.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace

2014-10-21 Thread Vojtech Pavlik
Hello Heiko,

I can confirm that kGraft works well on top of current mainline with
this patch added.

Another reason for a performance impact when kGraft is enabled is that
kGraft still adds two instructions to the syscall path on s390x, as
there is no space left for a kgraft TIF in the first eight bits of
thread info flags. Renumbering the thread info flags such that _TIF_WORK
occupies the first eight bits and TIF_TRACE the next eight would fix
that problem: Do you believe it is feasible?

Vojtech

On Tue, Oct 21, 2014 at 10:30:27AM +0200, Heiko Carstens wrote:
 v3:
 Changed patch 1/2 to incorporate feedback from Steven Rostedt and
 Masami Hiramatsu: rename helper function check_ftrace_location()
 to arch_check_ftrace_location() and convert it to a weak function,
 so architectures can override it without the need for new config
 option.
 
 v2:
 Changed patch 1/2 to incorporate feedback from Masami Hiramatsu, and
 introduce a new helper function check_ftrace_location().
 The requested ftracetest has been sent as an own patch set, since it
 has no dependency to these patches.
 
 v1:
 We would like to implement an architecture specific variant of kprobes
 on ftrace without using the current HAVE_KPROBES_ON_FTRACE infrastructure
 which is currently only used by x86.
 
 The rationale for these two patches is:
 - we want to patch the first instruction of the mcount code block to
   reduce the overhead of the function tracer
 - we'd like to keep the ftrace_caller function as simple as possible and
   not require it to generate a 100% valid pt_regs structure as required
   by the combination of DYNAMIC_FTRACE_WITH_REGS and HAVE_KPROBES_ON_FTRACE.
   This allows us to not generate the psw mask field in the pt_regs
   structure on each function tracer enabled function, which otherwise would
   be very expensive. Besides that program check generated pt_regs contents
   are more accurate than program generated ones and don't require any
   maintenance.
   And also we can keep the ftrace and kprobes backends quite separated.
 
 In order to make this work a small common code change is necessary which
 removes a check if kprobe is being placed on an ftrace location (see
 first patch).
 
 If possible, I'd like to have an ACK from at least one of the kprobes
 maintainers for the first patch and bring it upstream via the s390 tree.
 
 Thanks,
 Heiko
 
 Heiko Carstens (2):
   kprobes: introduce weak arch_check_ftrace_location() helper function
   s390/ftrace,kprobes: allow to patch first instruction
 
  arch/s390/include/asm/ftrace.h  |  52 ++--
  arch/s390/include/asm/kprobes.h |   1 +
  arch/s390/include/asm/lowcore.h |   4 +-
  arch/s390/include/asm/pgtable.h |  12 
  arch/s390/kernel/asm-offsets.c  |   1 -
  arch/s390/kernel/early.c|   4 --
  arch/s390/kernel/ftrace.c   | 132 
 +---
  arch/s390/kernel/kprobes.c  |  92 
  arch/s390/kernel/mcount.S   |   1 +
  arch/s390/kernel/setup.c|   2 -
  arch/s390/kernel/smp.c  |   1 -
  include/linux/kprobes.h |   1 +
  kernel/kprobes.c|  18 +++---
  scripts/recordmcount.c  |   2 +-
  scripts/recordmcount.pl |   2 +-
  15 files changed, 226 insertions(+), 99 deletions(-)
 
 -- 
 1.8.5.5
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] SOUND: kill gameport bits

2014-08-28 Thread Vojtech Pavlik
On Thu, Aug 28, 2014 at 10:03:55PM +0200, Clemens Ladisch wrote:
> Takashi Iwai wrote:
> > did anyone test the patch at all...?
> 
> Appears to work.  The ymfpci gameport seems to be somewhat unreliable:
> 
>  analog.c: 100 out of 17347 reads (0%) on pci:06:06.1/gameport0 failed
>  analog.c: 122 out of  reads (10%) on pci:06:07.0/gameport0 failed

The analog.c gameport read routine is unreliable by design. 

The 558 chip is not an ADC, it's an one-shot timer from 1971. The analog
position of the joystick is measured by timing bit changes on the
gameport.

analog.c does that without disabling interrupts, as the read can take
several milliseconds. analog.c instead detects when an interrupt influenced
the measurement too much and retries.

The retries are counted and reported.

10% is a largeish number, but still something the analog.c driver can
cope with and give reliable results. 

> There is still some dependence of the CPU speed on the loop counts
> calculated by gameport_time(), but it's not very high; the following are
> for 3200 and 800 MHz, 5 times each:
> 
>  gameport gameport7: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 651kHz
>  gameport gameport8: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 651kHz
>  gameport gameport9: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 651kHz
>  gameport gameport10: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 651kHz
>  gameport gameport11: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 651kHz
>  gameport gameport12: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 612kHz
>  gameport gameport13: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 612kHz
>  gameport gameport14: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 612kHz
>  gameport gameport15: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 611kHz
>  gameport gameport16: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
> 612kHz

The gameport speed is speed of the i/o to the port. It may change as
frequencies in the system change. It's used for timeouts on digital
gameport protocols only and thus a variation of less than 20% shouldn't cause
trouble.

The analog.c driver uses its own timing calibration to make the
analog_cooked_read() reliable even when the speed of the i/o operations
changes.

What is important is that the GET_TIME() macro is fast (0.1 usec or
less), precise (sub-microsecond resolution) and reliable. Digital
protocols also rely on udelay() and mdelay() waiting precisely for the
amount of time specified.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] SOUND: kill gameport bits

2014-08-28 Thread Vojtech Pavlik
On Thu, Aug 28, 2014 at 10:03:55PM +0200, Clemens Ladisch wrote:
 Takashi Iwai wrote:
  did anyone test the patch at all...?
 
 Appears to work.  The ymfpci gameport seems to be somewhat unreliable:
 
  analog.c: 100 out of 17347 reads (0%) on pci:06:06.1/gameport0 failed
  analog.c: 122 out of  reads (10%) on pci:06:07.0/gameport0 failed

The analog.c gameport read routine is unreliable by design. 

The 558 chip is not an ADC, it's an one-shot timer from 1971. The analog
position of the joystick is measured by timing bit changes on the
gameport.

analog.c does that without disabling interrupts, as the read can take
several milliseconds. analog.c instead detects when an interrupt influenced
the measurement too much and retries.

The retries are counted and reported.

10% is a largeish number, but still something the analog.c driver can
cope with and give reliable results. 

 There is still some dependence of the CPU speed on the loop counts
 calculated by gameport_time(), but it's not very high; the following are
 for 3200 and 800 MHz, 5 times each:
 
  gameport gameport7: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 651kHz
  gameport gameport8: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 651kHz
  gameport gameport9: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 651kHz
  gameport gameport10: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 651kHz
  gameport gameport11: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 651kHz
  gameport gameport12: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 612kHz
  gameport gameport13: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 612kHz
  gameport gameport14: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 612kHz
  gameport gameport15: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 611kHz
  gameport gameport16: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 
 612kHz

The gameport speed is speed of the i/o to the port. It may change as
frequencies in the system change. It's used for timeouts on digital
gameport protocols only and thus a variation of less than 20% shouldn't cause
trouble.

The analog.c driver uses its own timing calibration to make the
analog_cooked_read() reliable even when the speed of the i/o operations
changes.

What is important is that the GET_TIME() macro is fast (0.1 usec or
less), precise (sub-microsecond resolution) and reliable. Digital
protocols also rely on udelay() and mdelay() waiting precisely for the
amount of time specified.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] SOUND: kill gameport bits

2014-08-20 Thread Vojtech Pavlik
On Tue, Aug 19, 2014 at 10:18:15PM -0700, Dmitry Torokhov wrote:

> Are you actively testing gameport interfaces with real joysticks/gamepads on
> these cards? And what software is still in use that runs on these old boxes
> (with mainline kernel)?

I still do have a huge box of gameport hardware in my office, it's just
that I haven't opened it for a number of years. However, if this thread
spurred enough interest in gameport devices, I would be willing to open
it and do the needed fixes.

If not, I think dropping makes sense. I still would shed a tear for all
those weird devices in the box, and possibly design an ATMega-based
USB<->Gameport adapter that actually works and supports even the digital
joysticks.

> > Also, I'm left wondering why e.g. my Athlon XP system (a very popular
> > choice for longer times) would be affected by Cpufreq...
> > And there are no details on how exactly cpufreq is a problem or how this
> > timing issue could be fixed...
> 
> If you take a look at gameport_measure_speed() in gameport.c you will see that
> it counts cycles for timing, which obviously does not work that well when CPU
> frequency changes.
> 
> The bugs have been opened in bugzilla/reported on lists ages ago but nobody
> stepped up to fix that.

It wouldn't be hard to fix: That code was developed when the timing
infrastructure in the kernel was non-existent, making use of it today
would make things a lot easier.

> > The obvious workaround for such an ensuing dearth of hardware support
> > could be USB 15-pin gameport adapters - but are they even supported on
> > Linux? Haven't seen info on this...
> > And even if supported, these adapters (at least the non-perfect ones, as
> > can be seen from reviews on a well-known online shop site) are said to
> > be hit-or-miss.
> > 
> > http://www.flightsim.com/vbfs/showthread.php?238938-joystick-GamePort-to-USB-adapter-question
> > http://reviews.thesource.ca/9026/2600164/nexxtech-usb-gameport-adapter-reviews/reviews.htm
> > 
> 
> They have better chance of being supported ;) I had a couple a few years back
> and they did work for me.

They do work for analog joysticks if you don't want any extended
functionality. I have a couple in said box.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] SOUND: kill gameport bits

2014-08-20 Thread Vojtech Pavlik
On Tue, Aug 19, 2014 at 10:18:15PM -0700, Dmitry Torokhov wrote:

 Are you actively testing gameport interfaces with real joysticks/gamepads on
 these cards? And what software is still in use that runs on these old boxes
 (with mainline kernel)?

I still do have a huge box of gameport hardware in my office, it's just
that I haven't opened it for a number of years. However, if this thread
spurred enough interest in gameport devices, I would be willing to open
it and do the needed fixes.

If not, I think dropping makes sense. I still would shed a tear for all
those weird devices in the box, and possibly design an ATMega-based
USB-Gameport adapter that actually works and supports even the digital
joysticks.

  Also, I'm left wondering why e.g. my Athlon XP system (a very popular
  choice for longer times) would be affected by Cpufreq...
  And there are no details on how exactly cpufreq is a problem or how this
  timing issue could be fixed...
 
 If you take a look at gameport_measure_speed() in gameport.c you will see that
 it counts cycles for timing, which obviously does not work that well when CPU
 frequency changes.
 
 The bugs have been opened in bugzilla/reported on lists ages ago but nobody
 stepped up to fix that.

It wouldn't be hard to fix: That code was developed when the timing
infrastructure in the kernel was non-existent, making use of it today
would make things a lot easier.

  The obvious workaround for such an ensuing dearth of hardware support
  could be USB 15-pin gameport adapters - but are they even supported on
  Linux? Haven't seen info on this...
  And even if supported, these adapters (at least the non-perfect ones, as
  can be seen from reviews on a well-known online shop site) are said to
  be hit-or-miss.
  
  http://www.flightsim.com/vbfs/showthread.php?238938-joystick-GamePort-to-USB-adapter-question
  http://reviews.thesource.ca/9026/2600164/nexxtech-usb-gameport-adapter-reviews/reviews.htm
  
 
 They have better chance of being supported ;) I had a couple a few years back
 and they did work for me.

They do work for analog joysticks if you don't want any extended
functionality. I have a couple in said box.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] s390: add support for DYNAMIC_FTRACE_WITH_REGS

2014-07-03 Thread Vojtech Pavlik
Add support for DYNAMIC_FTRACE_WITH_REGS to 64-bit and 31-bit s390
architectures. This is required for kGraft and kpatch to work on s390.

It's done by adding a _regs variant of ftrace_caller that preserves
registers and puts them on stack in a struct pt_regs layout and
allows modification of return address by changing the PSW (instruction
pointer) member od struct pt_regs. 

Signed-off-by: Vojtech Pavlik 
Reviewed-by: Jiri Kosina 
Cc: Steven Rostedt 

---

 arch/s390/Kconfig   |1 
 arch/s390/include/asm/ftrace.h  |4 ++
 arch/s390/include/asm/lowcore.h |   12 ---
 arch/s390/kernel/asm-offsets.c  |1 
 arch/s390/kernel/early.c|3 +
 arch/s390/kernel/ftrace.c   |   68 +++-
 arch/s390/kernel/mcount.S   |   43 +
 arch/s390/kernel/mcount64.S |   38 ++
 arch/s390/kernel/setup.c|1 
 arch/s390/kernel/smp.c  |1 
 include/linux/ftrace.h  |1 
 11 files changed, 160 insertions(+), 13 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index bb63499..1de14d3 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DYNAMIC_FTRACE
+   select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index bf246da..df87fd2 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -17,6 +17,10 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
 
 #endif /* __ASSEMBLY__ */
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+#define ARCH_SUPPORTS_FTRACE_OPS 1
+#endif
+
 #ifdef CONFIG_64BIT
 #define MCOUNT_INSN_SIZE  12
 #else
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 4349197..31b064a 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -142,9 +142,10 @@ struct _lowcore {
__u32   percpu_offset;  /* 0x02f0 */
__u32   machine_flags;  /* 0x02f4 */
__u32   ftrace_func;/* 0x02f8 */
-   __u32   spinlock_lockval;   /* 0x02fc */
+   __u32   ftrace_regs_func;   /* 0x02fc */
+   __u32   spinlock_lockval;   /* 0x0300 */
 
-   __u8pad_0x0300[0x0e00-0x0300];  /* 0x0300 */
+   __u8pad_0x0304[0x0e00-0x0304];  /* 0x0304 */
 
/*
 * 0xe00 contains the address of the IPL Parameter Information
@@ -287,9 +288,10 @@ struct _lowcore {
__u64   vdso_per_cpu_data;  /* 0x0380 */
__u64   machine_flags;  /* 0x0388 */
__u64   ftrace_func;/* 0x0390 */
-   __u64   gmap;   /* 0x0398 */
-   __u32   spinlock_lockval;   /* 0x03a0 */
-   __u8pad_0x03a0[0x0400-0x03a4];  /* 0x03a4 */
+   __u64   ftrace_regs_func;   /* 0x0398 */
+   __u64   gmap;   /* 0x03a0 */
+   __u32   spinlock_lockval;   /* 0x03a8 */
+   __u8pad_0x03ac[0x0400-0x03ac];  /* 0x03ac */
 
/* Per cpu primary space access list */
__u32   paste[16];  /* 0x0400 */
diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index afe1715..12ef6ce 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -150,6 +150,7 @@ int main(void)
DEFINE(__LC_MCCK_CLOCK, offsetof(struct _lowcore, mcck_clock));
DEFINE(__LC_MACHINE_FLAGS, offsetof(struct _lowcore, machine_flags));
DEFINE(__LC_FTRACE_FUNC, offsetof(struct _lowcore, ftrace_func));
+   DEFINE(__LC_FTRACE_REGS_FUNC, offsetof(struct _lowcore, 
ftrace_regs_func));
DEFINE(__LC_DUMP_REIPL, offsetof(struct _lowcore, ipib));
BLANK();
DEFINE(__LC_CPU_TIMER_SAVE_AREA, offsetof(struct _lowcore, 
cpu_timer_save_area));
diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c
index 0dff972..2ab52d2 100644
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -493,5 +493,8 @@ void __init startup_init(void)
 #ifdef CONFIG_DYNAMIC_FTRACE
S390_lowcore.ftrace_func = (unsigned long)ftrace_caller;
 #endif
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+   S390_lowcore.ftrace_regs_func = (unsigned long)ftrace_regs_caller;
+#endif
lockdep_on();
 }
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 54d6493..7d1ec79 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -21,6 +21,7 @@
 
 void ftrace_disable_code(void);
 void ftrace_enable_insn(void);
+void ftrace_enable_regs_insn(void);
 
 #ifdef CONFIG_64BIT
 /*
@@ -56,7 +57,13 @@ asm(
"

[PATCH] s390: add support for DYNAMIC_FTRACE_WITH_REGS

2014-07-03 Thread Vojtech Pavlik
Add support for DYNAMIC_FTRACE_WITH_REGS to 64-bit and 31-bit s390
architectures. This is required for kGraft and kpatch to work on s390.

It's done by adding a _regs variant of ftrace_caller that preserves
registers and puts them on stack in a struct pt_regs layout and
allows modification of return address by changing the PSW (instruction
pointer) member od struct pt_regs. 

Signed-off-by: Vojtech Pavlik vojt...@suse.cz
Reviewed-by: Jiri Kosina jkos...@suse.cz
Cc: Steven Rostedt rost...@goodmis.org

---

 arch/s390/Kconfig   |1 
 arch/s390/include/asm/ftrace.h  |4 ++
 arch/s390/include/asm/lowcore.h |   12 ---
 arch/s390/kernel/asm-offsets.c  |1 
 arch/s390/kernel/early.c|3 +
 arch/s390/kernel/ftrace.c   |   68 +++-
 arch/s390/kernel/mcount.S   |   43 +
 arch/s390/kernel/mcount64.S |   38 ++
 arch/s390/kernel/setup.c|1 
 arch/s390/kernel/smp.c  |1 
 include/linux/ftrace.h  |1 
 11 files changed, 160 insertions(+), 13 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index bb63499..1de14d3 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DYNAMIC_FTRACE
+   select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index bf246da..df87fd2 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -17,6 +17,10 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
 
 #endif /* __ASSEMBLY__ */
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+#define ARCH_SUPPORTS_FTRACE_OPS 1
+#endif
+
 #ifdef CONFIG_64BIT
 #define MCOUNT_INSN_SIZE  12
 #else
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 4349197..31b064a 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -142,9 +142,10 @@ struct _lowcore {
__u32   percpu_offset;  /* 0x02f0 */
__u32   machine_flags;  /* 0x02f4 */
__u32   ftrace_func;/* 0x02f8 */
-   __u32   spinlock_lockval;   /* 0x02fc */
+   __u32   ftrace_regs_func;   /* 0x02fc */
+   __u32   spinlock_lockval;   /* 0x0300 */
 
-   __u8pad_0x0300[0x0e00-0x0300];  /* 0x0300 */
+   __u8pad_0x0304[0x0e00-0x0304];  /* 0x0304 */
 
/*
 * 0xe00 contains the address of the IPL Parameter Information
@@ -287,9 +288,10 @@ struct _lowcore {
__u64   vdso_per_cpu_data;  /* 0x0380 */
__u64   machine_flags;  /* 0x0388 */
__u64   ftrace_func;/* 0x0390 */
-   __u64   gmap;   /* 0x0398 */
-   __u32   spinlock_lockval;   /* 0x03a0 */
-   __u8pad_0x03a0[0x0400-0x03a4];  /* 0x03a4 */
+   __u64   ftrace_regs_func;   /* 0x0398 */
+   __u64   gmap;   /* 0x03a0 */
+   __u32   spinlock_lockval;   /* 0x03a8 */
+   __u8pad_0x03ac[0x0400-0x03ac];  /* 0x03ac */
 
/* Per cpu primary space access list */
__u32   paste[16];  /* 0x0400 */
diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index afe1715..12ef6ce 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -150,6 +150,7 @@ int main(void)
DEFINE(__LC_MCCK_CLOCK, offsetof(struct _lowcore, mcck_clock));
DEFINE(__LC_MACHINE_FLAGS, offsetof(struct _lowcore, machine_flags));
DEFINE(__LC_FTRACE_FUNC, offsetof(struct _lowcore, ftrace_func));
+   DEFINE(__LC_FTRACE_REGS_FUNC, offsetof(struct _lowcore, 
ftrace_regs_func));
DEFINE(__LC_DUMP_REIPL, offsetof(struct _lowcore, ipib));
BLANK();
DEFINE(__LC_CPU_TIMER_SAVE_AREA, offsetof(struct _lowcore, 
cpu_timer_save_area));
diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c
index 0dff972..2ab52d2 100644
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -493,5 +493,8 @@ void __init startup_init(void)
 #ifdef CONFIG_DYNAMIC_FTRACE
S390_lowcore.ftrace_func = (unsigned long)ftrace_caller;
 #endif
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+   S390_lowcore.ftrace_regs_func = (unsigned long)ftrace_regs_caller;
+#endif
lockdep_on();
 }
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 54d6493..7d1ec79 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -21,6 +21,7 @@
 
 void ftrace_disable_code(void);
 void ftrace_enable_insn(void);
+void ftrace_enable_regs_insn(void);
 
 #ifdef CONFIG_64BIT

Re: [PATCH 00/21] kGraft

2014-06-26 Thread Vojtech Pavlik
On Wed, Jun 25, 2014 at 05:54:50PM +0200, Jiri Kosina wrote:
> > -   wait_event_freezable(khubd_wait,
> > +   wait_event_freezable(khubd_wait,
> >   ({ kgr_task_safe(current);
> > 
> > The changes are somewhat ugly with all the kgraft crap leaking into plces
> > like jbd and freezer and usb. That says to me its not well isolated and
> > there must be a better way of hiding that kgr stuff so we don't have to
> > kepe putting more code into all the kthreads, and inevitably missing them
> > or have people getting it wrong.
> 
> Tejun commented on this very issue during the first RFC submission a 
> couple weeks ago. Part of his proposal was actually to take this as a good 
> pretext to review the existing kernel threads, as the idea is that a big 
> portion of those could easily be converted to workqueues.

In the past few years there was a significant proliferation of kernel
threads just for cases where something needs to be done from a process
context now and then.

This is underlined by the fact that on an average installation there are
more kernel threads than real userspace processes.

Converting these bits to workqueues would then also take care of their
interaction with freezer, kthread_stop. The main loop code wouldn't have
to be replicated over and over with slight variations. kgr_taks_safe
would then plug into that rather elegantly.

In the absence of this rework, kGraft hijacks the freezer and
kthread_stop to pinpoint the 'end' of the main loop of most kernel
threads and annotates the rest that doesn't handle freezing or stopping
with kgr_task_safe directly.


> It of course has its own implications, such as not being able to 
> prioritize that easily, but there is probably a lot of low-hanging fruits 
> where driver authors and their grandmas have been creating kthreads where 
> workqueue would suffice.

Indeed, on the other hand, there are enough workqueues today and on
realtime systems the need for prioritizing them exists already.

> But it's a very long term goal, and it's probably impractical to make 
> kGraft dependent on it.

Should the kernel thread changes turn to be a big issue, we could do the
initial submission without them and depend on the kthread->workqueue
rework. Once we add support for multiple patches in progress, the fact
that we don't support kernel threads would not be as painful.
 
> > You add instructions to various hotpaths (despite the CONFIG
> > comment).
> 
> Kernel entry is hidden behind _TIF_WORK_SYSCALL_ENTRY and
> _TIF_ALLWORK_MASK, so it doesn't add any overhead if kGraft is not
> enabled. What other hot paths do you refer to?

As Jiří says, I don't see any hot path where kGraft would add
instructions: Kernel exit and entry are handled outside the hot path
in the _TIF_WORK_SYSCALL_ENTRY and _TIF_ALLWORK_MASK handlers. [patch 15/21]

It adds instructions into the main loops of kernel threads, but those
can hardly be called hot paths as they mostly sleep for long times.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/21] kGraft

2014-06-26 Thread Vojtech Pavlik
On Wed, Jun 25, 2014 at 05:54:50PM +0200, Jiri Kosina wrote:
  -   wait_event_freezable(khubd_wait,
  +   wait_event_freezable(khubd_wait,
({ kgr_task_safe(current);
  
  The changes are somewhat ugly with all the kgraft crap leaking into plces
  like jbd and freezer and usb. That says to me its not well isolated and
  there must be a better way of hiding that kgr stuff so we don't have to
  kepe putting more code into all the kthreads, and inevitably missing them
  or have people getting it wrong.
 
 Tejun commented on this very issue during the first RFC submission a 
 couple weeks ago. Part of his proposal was actually to take this as a good 
 pretext to review the existing kernel threads, as the idea is that a big 
 portion of those could easily be converted to workqueues.

In the past few years there was a significant proliferation of kernel
threads just for cases where something needs to be done from a process
context now and then.

This is underlined by the fact that on an average installation there are
more kernel threads than real userspace processes.

Converting these bits to workqueues would then also take care of their
interaction with freezer, kthread_stop. The main loop code wouldn't have
to be replicated over and over with slight variations. kgr_taks_safe
would then plug into that rather elegantly.

In the absence of this rework, kGraft hijacks the freezer and
kthread_stop to pinpoint the 'end' of the main loop of most kernel
threads and annotates the rest that doesn't handle freezing or stopping
with kgr_task_safe directly.


 It of course has its own implications, such as not being able to 
 prioritize that easily, but there is probably a lot of low-hanging fruits 
 where driver authors and their grandmas have been creating kthreads where 
 workqueue would suffice.

Indeed, on the other hand, there are enough workqueues today and on
realtime systems the need for prioritizing them exists already.

 But it's a very long term goal, and it's probably impractical to make 
 kGraft dependent on it.

Should the kernel thread changes turn to be a big issue, we could do the
initial submission without them and depend on the kthread-workqueue
rework. Once we add support for multiple patches in progress, the fact
that we don't support kernel threads would not be as painful.
 
  You add instructions to various hotpaths (despite the CONFIG
  comment).
 
 Kernel entry is hidden behind _TIF_WORK_SYSCALL_ENTRY and
 _TIF_ALLWORK_MASK, so it doesn't add any overhead if kGraft is not
 enabled. What other hot paths do you refer to?

As Jiří says, I don't see any hot path where kGraft would add
instructions: Kernel exit and entry are handled outside the hot path
in the _TIF_WORK_SYSCALL_ENTRY and _TIF_ALLWORK_MASK handlers. [patch 15/21]

It adds instructions into the main loops of kernel threads, but those
can hardly be called hot paths as they mostly sleep for long times.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 64bit x86: NMI nesting still buggy?

2014-05-21 Thread Vojtech Pavlik
On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote:
> On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote:
> > Alright, Andy's iret optimization efforts do immediately bring a
> > followup question -- why is this not a problem with iret-based return
> > from #MC possibly interrupting NMI?
> 
> Yeah, and frankly, I don't see this nesting fun at all protected against
> a #MC interrupting it at any point actually. Because once the #MC
> handler returns, it goes into paranoid_exit and that place doesn't
> account for NMIs at all, AFAICS.
> 
> Which would mean:
> 
> * NMI goes off
> * MCE happens, we switch to machine_check which is paranoidzeroentry
> * #MC handler is done -> paranoid_exit -> IRET
> -> boom! Or if not "boom", at least, the NMI gets forgotten.
> 
> Am I missing something?

I think to get a full BOOM you need a bit more complex process, namely:

* NMI triggered
* NMI handler starts
* MCE happens
* Second NMI triggered and queued
* handler done, IRET
* Second NMI handler starts and overwrites NMI return address on stack
* Second NMI handler ends
* First NMI handler ends and goes into an infinite IRET loop, always
  returning to the beginning of itself

But you do have all the ingredients.

And I don't see any other way out than not calling IRET for MCEs.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 64bit x86: NMI nesting still buggy?

2014-05-21 Thread Vojtech Pavlik
On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote:
 On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote:
  Alright, Andy's iret optimization efforts do immediately bring a
  followup question -- why is this not a problem with iret-based return
  from #MC possibly interrupting NMI?
 
 Yeah, and frankly, I don't see this nesting fun at all protected against
 a #MC interrupting it at any point actually. Because once the #MC
 handler returns, it goes into paranoid_exit and that place doesn't
 account for NMIs at all, AFAICS.
 
 Which would mean:
 
 * NMI goes off
 * MCE happens, we switch to machine_check which is paranoidzeroentry
 * #MC handler is done - paranoid_exit - IRET
 - boom! Or if not boom, at least, the NMI gets forgotten.
 
 Am I missing something?

I think to get a full BOOM you need a bit more complex process, namely:

* NMI triggered
* NMI handler starts
* MCE happens
* Second NMI triggered and queued
* handler done, IRET
* Second NMI handler starts and overwrites NMI return address on stack
* Second NMI handler ends
* First NMI handler ends and goes into an infinite IRET loop, always
  returning to the beginning of itself

But you do have all the ingredients.

And I don't see any other way out than not calling IRET for MCEs.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching

2014-05-17 Thread Vojtech Pavlik
On Sat, May 17, 2014 at 03:09:57AM +0900, Masami Hiramatsu wrote:
> (2014/05/17 1:27), Jiri Kosina wrote:
> > On Tue, 6 May 2014, Steven Rostedt wrote:
> > 
> >>> However, I also think if users can accept such freezing wait-time,
> >>> it means they can also accept kexec based "checkpoint-restart" patching.
> >>> So, I think the final goal of the kpatch will be live patching without
> >>> stopping the machine. I'm discussing the issue on github #138, but that is
> >>> off-topic. :)
> >>
> >> I agree with Ingo too. Being conservative at first is the right
> >> approach here. We should start out with a stop_machine making sure that
> >> everything is sane before we continue. Sure, that's not much different
> >> than a kexec, but lets take things one step at a time.
> >>
> >> ftrace did the stop_machine (and still does for some archs), and slowly
> >> moved to a more efficient method. kpatch/kgraft should follow suit.
> > 
> > I don't really agree here.
> > 
> > I actually believe that "lazy" switching kgraft is doing provides a little 
> > bit more in the sense of consistency than stop_machine()-based aproach.
> > 
> > Consider this scenario:
> > 
> > void foo()
> > {
> > for (i=0; i<1; i++) {
> > bar(i);
> > something_else(i);
> > }
> > }
> 
> In this case, I'd recommend you to add foo() to replacing target
> as dummy. Then, kpatch can ensure foo() is actually not running. :)

The problem is: Where do you stop?

Adding the whole list of functions that transitively may ever call bar()
can be pretty much the whole kernel. And given that you can be calling
via pointer, you can't often even tell who is calling bar().

So yes, a developer could manually include foo() in the to be patched
list, so that it is checked that foo() isn't being executed while
patching.

But that would have to be done entirely manually after thoroughly
understanding the implications of the patch.

> > Let's say you want to live-patch bar(). With stop_machine()-based aproach, 
> > you can easily end-up with old bar() and new bar() being called in two 
> > consecutive iterations before the loop is even exited, right? (especially 
> > on preemptible kernel, or if something_else() goes to sleep).
> > 
> > With lazy-switching implemented in kgraft, this can never happen.
> 
> And I guess similar thing may happen with kgraft. If old function and
> new function share a non-auto variable and they modify it different way,
> the result will be unexpected by the mutual interference.

By non-auto I assume you mean globally accessible variable or data
structures.

Yes, with kGraft the new function has to be backward compatible with
pre-patch global data layout and semantics.

The parameters, return value and their semantics are free to change,
though, and kGraft guarantees consistency there.

-- 
Vojtech Pavlik
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching

2014-05-17 Thread Vojtech Pavlik
On Sat, May 17, 2014 at 03:09:57AM +0900, Masami Hiramatsu wrote:
 (2014/05/17 1:27), Jiri Kosina wrote:
  On Tue, 6 May 2014, Steven Rostedt wrote:
  
  However, I also think if users can accept such freezing wait-time,
  it means they can also accept kexec based checkpoint-restart patching.
  So, I think the final goal of the kpatch will be live patching without
  stopping the machine. I'm discussing the issue on github #138, but that is
  off-topic. :)
 
  I agree with Ingo too. Being conservative at first is the right
  approach here. We should start out with a stop_machine making sure that
  everything is sane before we continue. Sure, that's not much different
  than a kexec, but lets take things one step at a time.
 
  ftrace did the stop_machine (and still does for some archs), and slowly
  moved to a more efficient method. kpatch/kgraft should follow suit.
  
  I don't really agree here.
  
  I actually believe that lazy switching kgraft is doing provides a little 
  bit more in the sense of consistency than stop_machine()-based aproach.
  
  Consider this scenario:
  
  void foo()
  {
  for (i=0; i1; i++) {
  bar(i);
  something_else(i);
  }
  }
 
 In this case, I'd recommend you to add foo() to replacing target
 as dummy. Then, kpatch can ensure foo() is actually not running. :)

The problem is: Where do you stop?

Adding the whole list of functions that transitively may ever call bar()
can be pretty much the whole kernel. And given that you can be calling
via pointer, you can't often even tell who is calling bar().

So yes, a developer could manually include foo() in the to be patched
list, so that it is checked that foo() isn't being executed while
patching.

But that would have to be done entirely manually after thoroughly
understanding the implications of the patch.

  Let's say you want to live-patch bar(). With stop_machine()-based aproach, 
  you can easily end-up with old bar() and new bar() being called in two 
  consecutive iterations before the loop is even exited, right? (especially 
  on preemptible kernel, or if something_else() goes to sleep).
  
  With lazy-switching implemented in kgraft, this can never happen.
 
 And I guess similar thing may happen with kgraft. If old function and
 new function share a non-auto variable and they modify it different way,
 the result will be unexpected by the mutual interference.

By non-auto I assume you mean globally accessible variable or data
structures.

Yes, with kGraft the new function has to be backward compatible with
pre-patch global data layout and semantics.

The parameters, return value and their semantics are free to change,
though, and kGraft guarantees consistency there.

-- 
Vojtech Pavlik
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 09/16] kgr: mark task_safe in some kthreads

2014-05-14 Thread Vojtech Pavlik
On Wed, May 14, 2014 at 04:59:05PM +0200, Jiri Slaby wrote:

> I see the worst case scenario. (For curious readers, it is for example
> this kthread body:
> while (1) {
>   some_paired_call(); /* invokes pre-patched code */
>   if (kthread_should_stop()) { /* kgraft switches to the new code */
> its_paired_function(); /* invokes patched code (wrong) */
> break;
>   }
>   its_paired_function(); /* the same (wrong) */
> })
> 
> What to do with that now? We have come up with a couple possibilities.
> Would you consider try_to_freeze() a good state-defining function? As it
> is called when a kthread expects weird things can happen, it should be
> safe to switch to the patched version in our opinion.
> 
> The other possibility is to patch every kthread loop (~300) and insert
> kgr_task_safe() semi-manually at some proper place.
> 
> Or if you have any other suggestions we would appreciate that?

A heretic idea would be to convert all kernel threads into functions
that do not sleep and exit after a single iteration and are called from
a central kthread main loop function. That would get all of
kthread_should_stop() and try_to_freeze() and kgr_task_safe() nicely
into one place and at the same time put enough constraint on what the
thread function can do to prevent it from breaking the assumptions of
each of these calls. 

-- 
Vojtech Pavlik
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 09/16] kgr: mark task_safe in some kthreads

2014-05-14 Thread Vojtech Pavlik
On Wed, May 14, 2014 at 04:59:05PM +0200, Jiri Slaby wrote:

 I see the worst case scenario. (For curious readers, it is for example
 this kthread body:
 while (1) {
   some_paired_call(); /* invokes pre-patched code */
   if (kthread_should_stop()) { /* kgraft switches to the new code */
 its_paired_function(); /* invokes patched code (wrong) */
 break;
   }
   its_paired_function(); /* the same (wrong) */
 })
 
 What to do with that now? We have come up with a couple possibilities.
 Would you consider try_to_freeze() a good state-defining function? As it
 is called when a kthread expects weird things can happen, it should be
 safe to switch to the patched version in our opinion.
 
 The other possibility is to patch every kthread loop (~300) and insert
 kgr_task_safe() semi-manually at some proper place.
 
 Or if you have any other suggestions we would appreciate that?

A heretic idea would be to convert all kernel threads into functions
that do not sleep and exit after a single iteration and are called from
a central kthread main loop function. That would get all of
kthread_should_stop() and try_to_freeze() and kgr_task_safe() nicely
into one place and at the same time put enough constraint on what the
thread function can do to prevent it from breaking the assumptions of
each of these calls. 

-- 
Vojtech Pavlik
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 64bit x86: NMI nesting still buggy?

2014-05-01 Thread Vojtech Pavlik
On Tue, Apr 29, 2014 at 05:24:32PM +0200, Jiri Kosina wrote:
> On Tue, 29 Apr 2014, Steven Rostedt wrote:
> 
> > > According to 38.4 of [1], when SMM mode is entered while the CPU is 
> > > handling NMI, the end result might be that upon exit from SMM, NMIs will 
> > > be re-enabled and latched NMI delivered as nested [2].
> > 
> > Note, if this were true, then the x86_64 hardware would be extremely
> > buggy. That's because NMIs are not made to be nested. If SMM's come in
> > during an NMI and re-enables the NMI, then *all* software would break.
> > That would basically make NMIs useless.
> > 
> > The only time I've ever witness problems (and I stress NMIs all the
> > time), is when the NMI itself does a fault. Which my patch set handles
> > properly. 
> 
> Yes, it indeed does. 
> 
> In the scenario I have outlined, the race window is extremely small, plus 
> NMIs don't happen that often, plus SMIs don't happen that often, plus 
> (hopefully) many BIOSes don't enable NMIs upon SMM exit.
> 
> The problem is, that Intel documentation is clear in this respect, and 
> explicitly states it can happen. And we are violating that, which makes me 
> rather nervous -- it'd be very nice to know what is the background of 38.4 
> section text in the Intel docs.

If we cannot disable IST for NMI on x86_64, because it'd break SYSCALL,
and thus cannot handle this situation well, then we should at least try
to detect it post-factum.

In the NMI handler, after ascertaining that the first NMI is executing
(in_nmi not yet set) we check the return address stored on the stack.

If it points anywhere inside the NMI handler (in reality only in the
space between the NMI handler start and the check), a SMI-triggered
nested NMI has happened.

Then we should be able to at least report it before dying.

If it doesn't ever happen: Great, this wasn't a real concern. If it
does, we can pester BIOS vendors.

-- 
Vojtech Pavlik
Director SuSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >