Re: [PATCH] bcache: add cond_resched() in __bch_cache_cmp()
On Thu, Mar 07, 2019 at 11:36:18PM +0800, Coly Li wrote: > On 2019/3/7 11:06 下午, Shile Zhang wrote: > > > > On 2019/3/7 18:34, Coly Li wrote: > >> On 2019/3/7 1:15 下午, shile.zh...@linux.alibaba.com wrote: > >>> From: Shile Zhang > >>> > >>> Read /sys/fs/bcache//cacheN/priority_stats can take very long > >>> time with huge cache after long run. > >>> > >>> Signed-off-by: Shile Zhang > >> Hi Shile, > >> > >> Do you test your change ? It will be helpful with more performance data > >> (what problem that you improved). > > > > In case of 960GB SSD cache device, once read of the 'priority_stats' > > costs about 600ms in our test environment. > > > > After the fix, how much time it takes ? > > > > The perf tool shown that near 50% CPU time consumed by 'sort()', this > > means once sort will hold the CPU near 300ms. > > > > In our case, the statistics collector reads the 'priority_stats' > > periodically, it will trigger the schedule latency jitters of the > > > > task which shared same CPU core. > > > > Hmm, it seems you just make the sort slower, and nothing more changes. > Am I right ? Well, it has to make the sort slower, but it'll also avoid hogging the CPU (on a non-preemptible kernel), avoiding a potential soft lockup warning and allowing other tasks to run. -- Vojtech Pavlik VP SUSE Labs
Re: [rfd] saving old mice -- button glitching/debouncing
On Sat, Dec 15, 2018 at 10:47:22AM +0100, Pavel Machek wrote: > > > b) would it be acceptable if done properly? (cmd line option to > > > enable, avoiding duplicate/wrong events?) > > > > Well, for one, you shouldn't be using a timer, all the debouncing can be > > done by math on the event timestamps. > > Not... really, right? You need to send an release some time after > button indicates release if bounce did not happen. It is similar to > autorepeat needing a timer. You can send the first release and ignore all presses and releases in a time window after that. > Let me gain some experience with the patch. I don't think hardware > does as heavy debouncing as you describe. Microswitches vibrate significantly when clicked. Without hardware debouncing, you'd have many clicks instead of one even on a new mouse. -- Vojtech Pavlik
Re: [rfd] saving old mice -- button glitching/debouncing
On Sat, Dec 15, 2018 at 12:24:37AM +0100, Pavel Machek wrote: > Patch is obviously not ready; but: > > a) would it be useful to people Probably not. Mice already do internal software/hardware debouncing on switches. If you see duplicate clicks making it all the way to the kernel driver, something is very wrong with the switch, to the point where it'll soon fail completely. Hence the value of doing extra processing in the kernel is very limited. > b) would it be acceptable if done properly? (cmd line option to > enable, avoiding duplicate/wrong events?) Well, for one, you shouldn't be using a timer, all the debouncing can be done by math on the event timestamps. You'd also have a hard time distinguishing between doubleclicks and bounces. Since the mice already filter the quick bounces and there is significant delay (100ms on USB, similar on PS/2) between updates from the hardware, a real doubleclick can look absolutely the same as what you observe as a bounce. As such, it couldn't be enabled by default. If you really want to go this way, a userspace solution is the better choice. -- Vojtech Pavlik
Re: [PATCH] Input: iforce - Add the Saitek R440 Force Wheel
On Wed, Nov 14, 2018 at 10:52:40AM +0100, Tim Schumacher wrote: > Signed-off-by: Tim Schumacher > --- > Please note that I do NOT own this device. > > I'm adding this based on the fact that this is an iforce-based > device and that the Windows driver for the R440 works for the > Logitech WingMan Formula Force after replacing the device/vendor > IDs (I got the vendor/device IDs from there as well). > > Please don't add this patch if adding devices based on that is > not ok. > > Also (not related to this patch specifically), does anyone know > what the question marks at the end of some device definitions > are supposed to mean? I believe those are unconfirmed devices based on similar reasoning to yours. > --- > drivers/input/joystick/iforce/iforce-main.c | 1 + > drivers/input/joystick/iforce/iforce-usb.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/drivers/input/joystick/iforce/iforce-main.c > b/drivers/input/joystick/iforce/iforce-main.c > index 58d5cfe46526..432deecaeff9 100644 > --- a/drivers/input/joystick/iforce/iforce-main.c > +++ b/drivers/input/joystick/iforce/iforce-main.c > @@ -67,6 +67,7 @@ static struct iforce_device iforce_device[] = { > { 0x05ef, 0x, "AVB Top Shot Force Feedback Racing Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x061c, 0xc0a4, "ACT LABS Force RS", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x061c, 0xc084, "ACT LABS Force RS", > btn_wheel, abs_wheel, ff_iforce }, > + { 0x06a3, 0xff04, "Saitek R440 Force Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x06f8, 0x0001, "Guillemot Race Leader Force Feedback", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x06f8, 0x0001, "Guillemot Jet Leader Force Feedback", > btn_joystick, abs_joystick_rudder, ff_iforce }, > { 0x06f8, 0x0004, "Guillemot Force Feedback Racing Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > diff --git a/drivers/input/joystick/iforce/iforce-usb.c > b/drivers/input/joystick/iforce/iforce-usb.c > index 78073259c9a1..cdea61ae838d 100644 > --- a/drivers/input/joystick/iforce/iforce-usb.c > +++ b/drivers/input/joystick/iforce/iforce-usb.c > @@ -214,6 +214,7 @@ static const struct usb_device_id iforce_usb_ids[] = { > { USB_DEVICE(0x05ef, 0x) }, /* AVB Top Shot FFB Racing > Wheel */ > { USB_DEVICE(0x061c, 0xc0a4) }, /* ACT LABS Force RS */ > { USB_DEVICE(0x061c, 0xc084) }, /* ACT LABS Force RS */ > + { USB_DEVICE(0x06a3, 0xff04) }, /* Saitek R440 Force Wheel */ > { USB_DEVICE(0x06f8, 0x0001) }, /* Guillemot Race Leader Force > Feedback */ > { USB_DEVICE(0x06f8, 0x0003) }, /* Guillemot Jet Leader Force > Feedback */ > { USB_DEVICE(0x06f8, 0x0004) }, /* Guillemot Force Feedback > Racing Wheel */ > -- > 2.19.1.450.ga4b8ab536 > -- Vojtech Pavlik Director SUSE Labs
Re: [PATCH] Input: iforce - Add the Saitek R440 Force Wheel
On Wed, Nov 14, 2018 at 10:52:40AM +0100, Tim Schumacher wrote: > Signed-off-by: Tim Schumacher > --- > Please note that I do NOT own this device. > > I'm adding this based on the fact that this is an iforce-based > device and that the Windows driver for the R440 works for the > Logitech WingMan Formula Force after replacing the device/vendor > IDs (I got the vendor/device IDs from there as well). > > Please don't add this patch if adding devices based on that is > not ok. > > Also (not related to this patch specifically), does anyone know > what the question marks at the end of some device definitions > are supposed to mean? I believe those are unconfirmed devices based on similar reasoning to yours. > --- > drivers/input/joystick/iforce/iforce-main.c | 1 + > drivers/input/joystick/iforce/iforce-usb.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/drivers/input/joystick/iforce/iforce-main.c > b/drivers/input/joystick/iforce/iforce-main.c > index 58d5cfe46526..432deecaeff9 100644 > --- a/drivers/input/joystick/iforce/iforce-main.c > +++ b/drivers/input/joystick/iforce/iforce-main.c > @@ -67,6 +67,7 @@ static struct iforce_device iforce_device[] = { > { 0x05ef, 0x, "AVB Top Shot Force Feedback Racing Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x061c, 0xc0a4, "ACT LABS Force RS", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x061c, 0xc084, "ACT LABS Force RS", > btn_wheel, abs_wheel, ff_iforce }, > + { 0x06a3, 0xff04, "Saitek R440 Force Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x06f8, 0x0001, "Guillemot Race Leader Force Feedback", > btn_wheel, abs_wheel, ff_iforce }, //? > { 0x06f8, 0x0001, "Guillemot Jet Leader Force Feedback", > btn_joystick, abs_joystick_rudder, ff_iforce }, > { 0x06f8, 0x0004, "Guillemot Force Feedback Racing Wheel", > btn_wheel, abs_wheel, ff_iforce }, //? > diff --git a/drivers/input/joystick/iforce/iforce-usb.c > b/drivers/input/joystick/iforce/iforce-usb.c > index 78073259c9a1..cdea61ae838d 100644 > --- a/drivers/input/joystick/iforce/iforce-usb.c > +++ b/drivers/input/joystick/iforce/iforce-usb.c > @@ -214,6 +214,7 @@ static const struct usb_device_id iforce_usb_ids[] = { > { USB_DEVICE(0x05ef, 0x) }, /* AVB Top Shot FFB Racing > Wheel */ > { USB_DEVICE(0x061c, 0xc0a4) }, /* ACT LABS Force RS */ > { USB_DEVICE(0x061c, 0xc084) }, /* ACT LABS Force RS */ > + { USB_DEVICE(0x06a3, 0xff04) }, /* Saitek R440 Force Wheel */ > { USB_DEVICE(0x06f8, 0x0001) }, /* Guillemot Race Leader Force > Feedback */ > { USB_DEVICE(0x06f8, 0x0003) }, /* Guillemot Jet Leader Force > Feedback */ > { USB_DEVICE(0x06f8, 0x0004) }, /* Guillemot Force Feedback > Racing Wheel */ > -- > 2.19.1.450.ga4b8ab536 > -- Vojtech Pavlik Director SUSE Labs
Re: [PATCH] Input: stop telling users to snail-mail Vojtech
On Tue, Jul 24, 2018 at 11:45:54AM -0700, Dmitry Torokhov wrote: > I do not think Vojtech wants snail mail these days, even if the > address were correct (I do not know if it is), so let's remove > snail-mail instructions from the sources. Thanks. Nobody has ever sent me mail this way, but nevertheless, the address isn't valid anymore, so removing it makes good sense. -- Vojtech Pavlik Director SUSE Labs
Re: [PATCH] Input: stop telling users to snail-mail Vojtech
On Tue, Jul 24, 2018 at 11:45:54AM -0700, Dmitry Torokhov wrote: > I do not think Vojtech wants snail mail these days, even if the > address were correct (I do not know if it is), so let's remove > snail-mail instructions from the sources. Thanks. Nobody has ever sent me mail this way, but nevertheless, the address isn't valid anymore, so removing it makes good sense. -- Vojtech Pavlik Director SUSE Labs
Re: bcache with existing ext4 filesystem
On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote: > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote: > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote: > > > Question for you was... Is the first 1KiB of each ext4 filesystem still > > > free and "reserved for a bootloader"? > > > > Yes. > > Thanks. > > > > If I needed more for bcache superblock (8KiB, IIRC), would that be > > > easy to accomplish on existing filesystem? > > > > Huh? Why would the bcache superblock matter when you're talking about > > the ext4 layout? The bcache superblock will be on the bcache > > device/partition, and the ext4 superblock will be on the ext4 > > device/partition. > > I'd like to enable bcache on already existing ext4 partition. AFAICT > normal situation, even on the backing device, is: > > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data | > > Unfortunately, that would mean shifting 400GB data 8KB forward, and > compatibility problems. So I'd prefer adding bcache superblock into > the reserved space, so I can have caching _and_ compatibility with > grub2 etc (and avoid 400GB move): The common way to do that is to move the beginning of the partition, assuming your ext4 lives in a partition. I don't see how overlapping the ext4 and the bcache backing device starts would give you what you want, because bcache assumes the backing device data starts with an offset. -- Vojtech Pavlik
Re: bcache with existing ext4 filesystem
On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote: > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote: > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote: > > > Question for you was... Is the first 1KiB of each ext4 filesystem still > > > free and "reserved for a bootloader"? > > > > Yes. > > Thanks. > > > > If I needed more for bcache superblock (8KiB, IIRC), would that be > > > easy to accomplish on existing filesystem? > > > > Huh? Why would the bcache superblock matter when you're talking about > > the ext4 layout? The bcache superblock will be on the bcache > > device/partition, and the ext4 superblock will be on the ext4 > > device/partition. > > I'd like to enable bcache on already existing ext4 partition. AFAICT > normal situation, even on the backing device, is: > > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data | > > Unfortunately, that would mean shifting 400GB data 8KB forward, and > compatibility problems. So I'd prefer adding bcache superblock into > the reserved space, so I can have caching _and_ compatibility with > grub2 etc (and avoid 400GB move): The common way to do that is to move the beginning of the partition, assuming your ext4 lives in a partition. I don't see how overlapping the ext4 and the bcache backing device starts would give you what you want, because bcache assumes the backing device data starts with an offset. -- Vojtech Pavlik
Re: [PATCH 7/7] DWARF: add the config option
On Sun, May 07, 2017 at 04:48:36PM -0500, Josh Poimboeuf wrote: > > Can objtool verify the unwinder at each address in the kernel, or is that > > an AI-complete problem? > > It can't verify the *unwinder*, but it can verify the data which is fed > to the unwinder (either DWARF or the structs I proposed above). For > each function, it follows every possible code path, and it can keep > track of the stack pointer while doing so. In that case, the kernel build process can verify the DWARF data and its compatibility with the kernel unwinder by running the unwinder against each kernel code address verifying the output and bail if there is a bug in the toolchain that affects it. -- Vojtech Pavlik Director SuSE Labs
Re: [PATCH 7/7] DWARF: add the config option
On Sun, May 07, 2017 at 04:48:36PM -0500, Josh Poimboeuf wrote: > > Can objtool verify the unwinder at each address in the kernel, or is that > > an AI-complete problem? > > It can't verify the *unwinder*, but it can verify the data which is fed > to the unwinder (either DWARF or the structs I proposed above). For > each function, it follows every possible code path, and it can keep > track of the stack pointer while doing so. In that case, the kernel build process can verify the DWARF data and its compatibility with the kernel unwinder by running the unwinder against each kernel code address verifying the output and bail if there is a bug in the toolchain that affects it. -- Vojtech Pavlik Director SuSE Labs
Re: [PATCH] Input: joystick: gf2k - change msleep to usleep_range for small msecs
On Tue, Nov 29, 2016 at 01:11:49AM +0530, Aniroop Mathur wrote: > msleep(1~20) may not do what the caller intends, and will often sleep longer. > (~20 ms actual sleep for any value given in the 1~20ms range) > This is not the desired behaviour for many cases like device resume time, > device suspend time, device enable time, connection time, probe time, > loops, retry logic, etc > msleep is built on jiffies / legacy timers which are not precise whereas > usleep_range is build on top of hrtimers so the wakeups are precise. > Thus, change msleep to usleep_range for precise wakeups. > > For example: > On a machine with tick rate / HZ as 100, msleep(4) will make the process to > sleep for a minimum period of 10 ms whereas usleep_range(4000, 4100) will make > sure that the process does not sleep for more than 4100 us or 4.1ms And once more, patch not needed. > > Signed-off-by: Aniroop Mathur <a.mat...@samsung.com> > --- > drivers/input/joystick/gf2k.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/input/joystick/gf2k.c b/drivers/input/joystick/gf2k.c > index 0f519db..e9d5095 100644 > --- a/drivers/input/joystick/gf2k.c > +++ b/drivers/input/joystick/gf2k.c > @@ -42,7 +42,7 @@ MODULE_LICENSE("GPL"); > > #define GF2K_START 400 /* The time we wait for the first bit > [400 us] */ > #define GF2K_STROBE 40 /* The time we wait for the first bit > [40 us] */ > -#define GF2K_TIMEOUT 4 /* Wait for everything to settle [4 ms] > */ > +#define GF2K_TIMEOUT 4000/* Wait for everything to settle [4000 > us] */ > #define GF2K_LENGTH 80 /* Max number of triplets in a packet */ > > /* > @@ -138,7 +138,7 @@ static void gf2k_trigger_seq(struct gameport *gameport, > short *seq) > i = 0; > do { > gameport_trigger(gameport); > - t = gameport_time(gameport, GF2K_TIMEOUT * 1000); > + t = gameport_time(gameport, GF2K_TIMEOUT); > while ((gameport_read(gameport) & 1) && t) t--; > udelay(seq[i]); > } while (seq[++i]); > @@ -259,11 +259,11 @@ static int gf2k_connect(struct gameport *gameport, > struct gameport_driver *drv) > > gf2k_trigger_seq(gameport, gf2k_seq_reset); > > - msleep(GF2K_TIMEOUT); > + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100); > > gf2k_trigger_seq(gameport, gf2k_seq_digital); > > - msleep(GF2K_TIMEOUT); > + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100); > > if (gf2k_read_packet(gameport, GF2K_LENGTH, data) < 12) { > err = -ENODEV; > -- > 2.6.2 > -- Vojtech Pavlik
Re: [PATCH] Input: joystick: gf2k - change msleep to usleep_range for small msecs
On Tue, Nov 29, 2016 at 01:11:49AM +0530, Aniroop Mathur wrote: > msleep(1~20) may not do what the caller intends, and will often sleep longer. > (~20 ms actual sleep for any value given in the 1~20ms range) > This is not the desired behaviour for many cases like device resume time, > device suspend time, device enable time, connection time, probe time, > loops, retry logic, etc > msleep is built on jiffies / legacy timers which are not precise whereas > usleep_range is build on top of hrtimers so the wakeups are precise. > Thus, change msleep to usleep_range for precise wakeups. > > For example: > On a machine with tick rate / HZ as 100, msleep(4) will make the process to > sleep for a minimum period of 10 ms whereas usleep_range(4000, 4100) will make > sure that the process does not sleep for more than 4100 us or 4.1ms And once more, patch not needed. > > Signed-off-by: Aniroop Mathur > --- > drivers/input/joystick/gf2k.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/input/joystick/gf2k.c b/drivers/input/joystick/gf2k.c > index 0f519db..e9d5095 100644 > --- a/drivers/input/joystick/gf2k.c > +++ b/drivers/input/joystick/gf2k.c > @@ -42,7 +42,7 @@ MODULE_LICENSE("GPL"); > > #define GF2K_START 400 /* The time we wait for the first bit > [400 us] */ > #define GF2K_STROBE 40 /* The time we wait for the first bit > [40 us] */ > -#define GF2K_TIMEOUT 4 /* Wait for everything to settle [4 ms] > */ > +#define GF2K_TIMEOUT 4000/* Wait for everything to settle [4000 > us] */ > #define GF2K_LENGTH 80 /* Max number of triplets in a packet */ > > /* > @@ -138,7 +138,7 @@ static void gf2k_trigger_seq(struct gameport *gameport, > short *seq) > i = 0; > do { > gameport_trigger(gameport); > - t = gameport_time(gameport, GF2K_TIMEOUT * 1000); > + t = gameport_time(gameport, GF2K_TIMEOUT); > while ((gameport_read(gameport) & 1) && t) t--; > udelay(seq[i]); > } while (seq[++i]); > @@ -259,11 +259,11 @@ static int gf2k_connect(struct gameport *gameport, > struct gameport_driver *drv) > > gf2k_trigger_seq(gameport, gf2k_seq_reset); > > - msleep(GF2K_TIMEOUT); > + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100); > > gf2k_trigger_seq(gameport, gf2k_seq_digital); > > - msleep(GF2K_TIMEOUT); > + usleep_range(GF2K_TIMEOUT, GF2K_TIMEOUT + 100); > > if (gf2k_read_packet(gameport, GF2K_LENGTH, data) < 12) { > err = -ENODEV; > -- > 2.6.2 > -- Vojtech Pavlik
Re: [PATCH] Input: joystick: analog - change msleep to usleep_range for small msecs
On Tue, Nov 29, 2016 at 01:11:31AM +0530, Aniroop Mathur wrote: > msleep(1~20) may not do what the caller intends, and will often sleep longer. > (~20 ms actual sleep for any value given in the 1~20ms range) > This is not the desired behaviour for many cases like device resume time, > device suspend time, device enable time, connection time, probe time, > loops, retry logic, etc > msleep is built on jiffies / legacy timers which are not precise whereas > usleep_range is build on top of hrtimers so the wakeups are precise. > Thus, change msleep to usleep_range for precise wakeups. > > For example: > On a machine with tick rate / HZ as 100, msleep(3) will make the process to > sleep for a minimum period of 10 ms whereas usleep_range(3000, 3100) will make > sure that the process does not sleep for more than 3100 us or 3.1ms Again, not needed, if the MAX_TIME sleeps are longer, nobody cares. > > Signed-off-by: Aniroop Mathur <a.mat...@samsung.com> > --- > drivers/input/joystick/analog.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c > index 3d8ff09..2891704 100644 > --- a/drivers/input/joystick/analog.c > +++ b/drivers/input/joystick/analog.c > @@ -88,7 +88,7 @@ MODULE_PARM_DESC(map, "Describes analog joysticks > type/capabilities"); > #define ANALOG_EXTENSIONS0x7ff00 > #define ANALOG_GAMEPAD 0x8 > > -#define ANALOG_MAX_TIME 3 /* 3 ms */ > +#define ANALOG_MAX_TIME 3000/* 3000 us */ > #define ANALOG_LOOP_TIME 2000/* 2 * loop */ > #define ANALOG_SAITEK_DELAY 200 /* 200 us */ > #define ANALOG_SAITEK_TIME 2000/* 2000 us */ > @@ -257,7 +257,7 @@ static int analog_cooked_read(struct analog_port *port) > int i, j; > > loopout = (ANALOG_LOOP_TIME * port->loop) / 1000; > - timeout = ANALOG_MAX_TIME * port->speed; > + timeout = (ANALOG_MAX_TIME / 1000) * port->speed; > > local_irq_save(flags); > gameport_trigger(gameport); > @@ -625,20 +625,20 @@ static int analog_init_port(struct gameport *gameport, > struct gameport_driver *d > > gameport_trigger(gameport); > t = gameport_read(gameport); > - msleep(ANALOG_MAX_TIME); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > port->mask = (gameport_read(gameport) ^ t) & t & 0xf; > port->fuzz = (port->speed * ANALOG_FUZZ_MAGIC) / port->loop / > 1000 + ANALOG_FUZZ_BITS; > > for (i = 0; i < ANALOG_INIT_RETRIES; i++) { > if (!analog_cooked_read(port)) > break; > - msleep(ANALOG_MAX_TIME); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > } > > u = v = 0; > > - msleep(ANALOG_MAX_TIME); > - t = gameport_time(gameport, ANALOG_MAX_TIME * 1000); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > + t = gameport_time(gameport, ANALOG_MAX_TIME); > gameport_trigger(gameport); > while ((gameport_read(port->gameport) & port->mask) && (u < t)) > u++; > -- > 2.6.2 > -- Vojtech Pavlik
Re: [PATCH] Input: joystick: analog - change msleep to usleep_range for small msecs
On Tue, Nov 29, 2016 at 01:11:31AM +0530, Aniroop Mathur wrote: > msleep(1~20) may not do what the caller intends, and will often sleep longer. > (~20 ms actual sleep for any value given in the 1~20ms range) > This is not the desired behaviour for many cases like device resume time, > device suspend time, device enable time, connection time, probe time, > loops, retry logic, etc > msleep is built on jiffies / legacy timers which are not precise whereas > usleep_range is build on top of hrtimers so the wakeups are precise. > Thus, change msleep to usleep_range for precise wakeups. > > For example: > On a machine with tick rate / HZ as 100, msleep(3) will make the process to > sleep for a minimum period of 10 ms whereas usleep_range(3000, 3100) will make > sure that the process does not sleep for more than 3100 us or 3.1ms Again, not needed, if the MAX_TIME sleeps are longer, nobody cares. > > Signed-off-by: Aniroop Mathur > --- > drivers/input/joystick/analog.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c > index 3d8ff09..2891704 100644 > --- a/drivers/input/joystick/analog.c > +++ b/drivers/input/joystick/analog.c > @@ -88,7 +88,7 @@ MODULE_PARM_DESC(map, "Describes analog joysticks > type/capabilities"); > #define ANALOG_EXTENSIONS0x7ff00 > #define ANALOG_GAMEPAD 0x8 > > -#define ANALOG_MAX_TIME 3 /* 3 ms */ > +#define ANALOG_MAX_TIME 3000/* 3000 us */ > #define ANALOG_LOOP_TIME 2000/* 2 * loop */ > #define ANALOG_SAITEK_DELAY 200 /* 200 us */ > #define ANALOG_SAITEK_TIME 2000/* 2000 us */ > @@ -257,7 +257,7 @@ static int analog_cooked_read(struct analog_port *port) > int i, j; > > loopout = (ANALOG_LOOP_TIME * port->loop) / 1000; > - timeout = ANALOG_MAX_TIME * port->speed; > + timeout = (ANALOG_MAX_TIME / 1000) * port->speed; > > local_irq_save(flags); > gameport_trigger(gameport); > @@ -625,20 +625,20 @@ static int analog_init_port(struct gameport *gameport, > struct gameport_driver *d > > gameport_trigger(gameport); > t = gameport_read(gameport); > - msleep(ANALOG_MAX_TIME); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > port->mask = (gameport_read(gameport) ^ t) & t & 0xf; > port->fuzz = (port->speed * ANALOG_FUZZ_MAGIC) / port->loop / > 1000 + ANALOG_FUZZ_BITS; > > for (i = 0; i < ANALOG_INIT_RETRIES; i++) { > if (!analog_cooked_read(port)) > break; > - msleep(ANALOG_MAX_TIME); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > } > > u = v = 0; > > - msleep(ANALOG_MAX_TIME); > - t = gameport_time(gameport, ANALOG_MAX_TIME * 1000); > + usleep_range(ANALOG_MAX_TIME, ANALOG_MAX_TIME + 100); > + t = gameport_time(gameport, ANALOG_MAX_TIME); > gameport_trigger(gameport); > while ((gameport_read(port->gameport) & port->mask) && (u < t)) > u++; > -- > 2.6.2 > -- Vojtech Pavlik
Re: [PATCH] Input: joystick: sidewinder - change msleep to usleep_range for small msecs
ital */ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* > Retry reading packet */ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > dbg("Init 1b: Length %d.", i); > if (!i) { /* No > data -> FAIL */ > err = -ENODEV; > @@ -636,7 +636,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > dbg("Init 2: Mode %d. ID Length %d.", m, j); > > if (j <= 0) { /* Read > ID failed. Happens in 1-bit mode on PP */ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* > Retry reading packet */ > m |= sw_guess_mode(buf, i); > dbg("Init 2b: Mode %d. Length %d.", m, i); > @@ -644,7 +644,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > err = -ENODEV; > goto fail2; > } > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > j = sw_read_packet(gameport, idbuf, SW_LENGTH, i); /* > Retry reading ID */ > dbg("Init 2c: ID Length %d.", j); > } > @@ -655,7 +655,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > > do { > k--; > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* Read > data packet */ > dbg("Init 3: Mode %d. Length %d. Last %d. Tries %d.", m, i, l, > k); > > -- > 2.6.2 > -- Vojtech Pavlik
Re: [PATCH] Input: joystick: sidewinder - change msleep to usleep_range for small msecs
/ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* > Retry reading packet */ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > dbg("Init 1b: Length %d.", i); > if (!i) { /* No > data -> FAIL */ > err = -ENODEV; > @@ -636,7 +636,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > dbg("Init 2: Mode %d. ID Length %d.", m, j); > > if (j <= 0) { /* Read > ID failed. Happens in 1-bit mode on PP */ > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* > Retry reading packet */ > m |= sw_guess_mode(buf, i); > dbg("Init 2b: Mode %d. Length %d.", m, i); > @@ -644,7 +644,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > err = -ENODEV; > goto fail2; > } > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > j = sw_read_packet(gameport, idbuf, SW_LENGTH, i); /* > Retry reading ID */ > dbg("Init 2c: ID Length %d.", j); > } > @@ -655,7 +655,7 @@ static int sw_connect(struct gameport *gameport, struct > gameport_driver *drv) > > do { > k--; > - msleep(SW_TIMEOUT); > + usleep_range(SW_TIMEOUT, SW_TIMEOUT + 100); > i = sw_read_packet(gameport, buf, SW_LENGTH, 0);/* Read > data packet */ > dbg("Init 3: Mode %d. Length %d. Last %d. Tries %d.", m, i, l, > k); > > -- > 2.6.2 > -- Vojtech Pavlik
Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model
On Mon, Apr 04, 2016 at 01:33:07PM -0500, Josh Poimboeuf wrote: > On Mon, Apr 04, 2016 at 08:27:59PM +0200, Vojtech Pavlik wrote: > > On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote: > > > > > Hm, can you explain why it should be copied from the parent? > > > > > > I'm thinking the above code is correct for today, but it should still be > > > changed to be more future-proof. > > > > > > Here's my thinking: > > > > > > A forked task starts out with no stack, so if I understand correctly, it > > > can safely start out in the goal universe, regardless of which universe > > > its parent belongs to. > > > > > > However, the current ret_from_fork code is a mess, and Andy Lutomirski > > > has mentioned that he would like to give newly forked tasks a proper > > > stack such that instead of jumping to ret_from_fork, they would just > > > return from schedule(). In that case, it would no longer be safe to > > > start the new task in the goal universe because it could be "sleeping" > > > on a to-be-patched function. > > > > > > So for proper future proofing, newly forked tasks should be started in > > > the initial universe (rather than starting in the goal universe or > > > inheriting the parent's universe). They can then be transitioned over > > > to the goal universe like any other task. How does that sound? > > > > How could a newly forked task start in the old universe if its parent > > has already been migrated? Any context it inherits will already be from > > the new universe. > > Can you be more specific about "context" and why inheritance of it would > be a problem? Currently a forked task starts out with no stack, and as such it can start in the goal universe. If we create a synthetic stack, then we may need to start in the initial universe, as the synthetic stack would likely be created using initial universe return addresses. If we simply copy the stack of the parent process, which is in my opionion the safest way, as it places little assumptions on the compiler, then it may contain either old or new addresses and we need to copy the universe flag along. -- Vojtech Pavlik Director SUSE Labs
Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model
On Mon, Apr 04, 2016 at 01:33:07PM -0500, Josh Poimboeuf wrote: > On Mon, Apr 04, 2016 at 08:27:59PM +0200, Vojtech Pavlik wrote: > > On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote: > > > > > Hm, can you explain why it should be copied from the parent? > > > > > > I'm thinking the above code is correct for today, but it should still be > > > changed to be more future-proof. > > > > > > Here's my thinking: > > > > > > A forked task starts out with no stack, so if I understand correctly, it > > > can safely start out in the goal universe, regardless of which universe > > > its parent belongs to. > > > > > > However, the current ret_from_fork code is a mess, and Andy Lutomirski > > > has mentioned that he would like to give newly forked tasks a proper > > > stack such that instead of jumping to ret_from_fork, they would just > > > return from schedule(). In that case, it would no longer be safe to > > > start the new task in the goal universe because it could be "sleeping" > > > on a to-be-patched function. > > > > > > So for proper future proofing, newly forked tasks should be started in > > > the initial universe (rather than starting in the goal universe or > > > inheriting the parent's universe). They can then be transitioned over > > > to the goal universe like any other task. How does that sound? > > > > How could a newly forked task start in the old universe if its parent > > has already been migrated? Any context it inherits will already be from > > the new universe. > > Can you be more specific about "context" and why inheritance of it would > be a problem? Currently a forked task starts out with no stack, and as such it can start in the goal universe. If we create a synthetic stack, then we may need to start in the initial universe, as the synthetic stack would likely be created using initial universe return addresses. If we simply copy the stack of the parent process, which is in my opionion the safest way, as it places little assumptions on the compiler, then it may contain either old or new addresses and we need to copy the universe flag along. -- Vojtech Pavlik Director SUSE Labs
Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model
On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote: > Hm, can you explain why it should be copied from the parent? > > I'm thinking the above code is correct for today, but it should still be > changed to be more future-proof. > > Here's my thinking: > > A forked task starts out with no stack, so if I understand correctly, it > can safely start out in the goal universe, regardless of which universe > its parent belongs to. > > However, the current ret_from_fork code is a mess, and Andy Lutomirski > has mentioned that he would like to give newly forked tasks a proper > stack such that instead of jumping to ret_from_fork, they would just > return from schedule(). In that case, it would no longer be safe to > start the new task in the goal universe because it could be "sleeping" > on a to-be-patched function. > > So for proper future proofing, newly forked tasks should be started in > the initial universe (rather than starting in the goal universe or > inheriting the parent's universe). They can then be transitioned over > to the goal universe like any other task. How does that sound? How could a newly forked task start in the old universe if its parent has already been migrated? Any context it inherits will already be from the new universe. -- Vojtech Pavlik Director SuSE Labs
Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model
On Mon, Apr 04, 2016 at 01:21:38PM -0500, Josh Poimboeuf wrote: > Hm, can you explain why it should be copied from the parent? > > I'm thinking the above code is correct for today, but it should still be > changed to be more future-proof. > > Here's my thinking: > > A forked task starts out with no stack, so if I understand correctly, it > can safely start out in the goal universe, regardless of which universe > its parent belongs to. > > However, the current ret_from_fork code is a mess, and Andy Lutomirski > has mentioned that he would like to give newly forked tasks a proper > stack such that instead of jumping to ret_from_fork, they would just > return from schedule(). In that case, it would no longer be safe to > start the new task in the goal universe because it could be "sleeping" > on a to-be-patched function. > > So for proper future proofing, newly forked tasks should be started in > the initial universe (rather than starting in the goal universe or > inheriting the parent's universe). They can then be transitioned over > to the goal universe like any other task. How does that sound? How could a newly forked task start in the old universe if its parent has already been migrated? Any context it inherits will already be from the new universe. -- Vojtech Pavlik Director SuSE Labs
Re: [PATCH] livepatch: Update maintainers
On Wed, Mar 16, 2016 at 10:03:36AM -0500, Josh Poimboeuf wrote: > Seth and Vojtech are no longer active maintainers of livepatch, so > remove them in favor of Jessica and Miroslav. > > Also add Petr as a designated reviewer. > > Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com> > --- > MAINTAINERS | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 860e306..e04e0a5 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -6583,9 +6583,10 @@ F: drivers/platform/x86/hp_accel.c > > LIVE PATCHING > M: Josh Poimboeuf <jpoim...@redhat.com> > -M: Seth Jennings <sjenn...@redhat.com> > +M: Jessica Yu <j...@redhat.com> > M: Jiri Kosina <ji...@kernel.org> > -M: Vojtech Pavlik <vojt...@suse.com> > +M: Miroslav Benes <mbe...@suse.cz> > +R: Petr Mladek <pmla...@suse.com> > S: Maintained > F: kernel/livepatch/ > F: include/linux/livepatch.h > -- > 2.4.3 Acked-by: Vojtech Pavlik <vojt...@suse.com> -- Vojtech Pavlik Director SuSE Labs
Re: [PATCH] livepatch: Update maintainers
On Wed, Mar 16, 2016 at 10:03:36AM -0500, Josh Poimboeuf wrote: > Seth and Vojtech are no longer active maintainers of livepatch, so > remove them in favor of Jessica and Miroslav. > > Also add Petr as a designated reviewer. > > Signed-off-by: Josh Poimboeuf > --- > MAINTAINERS | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 860e306..e04e0a5 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -6583,9 +6583,10 @@ F: drivers/platform/x86/hp_accel.c > > LIVE PATCHING > M: Josh Poimboeuf > -M: Seth Jennings > +M: Jessica Yu > M: Jiri Kosina > -M: Vojtech Pavlik > +M: Miroslav Benes > +R: Petr Mladek > S: Maintained > F: kernel/livepatch/ > F: include/linux/livepatch.h > -- > 2.4.3 Acked-by: Vojtech Pavlik -- Vojtech Pavlik Director SuSE Labs
[PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
Fix writeback_thread never finishing writing back all dirty data in bcache when partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. Signed-off-by: Vojtech Pavlik --- This is a fix for the current upstream bcache, not the devel branch. If partial_stripes_expensive is set for a cache set, then writeback_thread always attempts to write full stripes out back to the backing device first. However, since it does that based on a bitmap and not a simple linear search, like the rest of the code of refill_dirty(), it changes the last_scanned pointer so that never points to 'end'. refill_dirty() then never tries to scan from 'start', resulting in the writeback_thread looping, consuming 100% of CPU, but never making progress in writing out the incomplete dirty stripes in the cache. Scanning the tree after not finding enough full stripes fixes the issue. Incomplete dirty stripes are written to the backing device, the device eventually reaches a clean state if there is nothing dirtying data and writeback_thread sleeps. This also fixes the problem of the cache device not being possible to detach in the partial_stripes_expensive scenario. It may be more efficient to separate the last_scanned field for normal and stripe scans instead. drivers/md/bcache/writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index f1986bc..6f8b81d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc) refill_full_stripes(dc); if (array_freelist_empty(>freelist)) return false; + bch_refill_keybuf(dc->disk.c, buf, , dirty_pred); } if (bkey_cmp(>last_scanned, ) >= 0) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
Fix writeback_thread never finishing writing back all dirty data in bcache when partial_stripes_expensive is set, and spinning, consuming 100% of CPU instead. Signed-off-by: Vojtech Pavlik <vojt...@suse.com> --- This is a fix for the current upstream bcache, not the devel branch. If partial_stripes_expensive is set for a cache set, then writeback_thread always attempts to write full stripes out back to the backing device first. However, since it does that based on a bitmap and not a simple linear search, like the rest of the code of refill_dirty(), it changes the last_scanned pointer so that never points to 'end'. refill_dirty() then never tries to scan from 'start', resulting in the writeback_thread looping, consuming 100% of CPU, but never making progress in writing out the incomplete dirty stripes in the cache. Scanning the tree after not finding enough full stripes fixes the issue. Incomplete dirty stripes are written to the backing device, the device eventually reaches a clean state if there is nothing dirtying data and writeback_thread sleeps. This also fixes the problem of the cache device not being possible to detach in the partial_stripes_expensive scenario. It may be more efficient to separate the last_scanned field for normal and stripe scans instead. drivers/md/bcache/writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index f1986bc..6f8b81d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -382,6 +382,7 @@ static bool refill_dirty(struct cached_dev *dc) refill_full_stripes(dc); if (array_freelist_empty(>freelist)) return false; + bch_refill_keybuf(dc->disk.c, buf, , dirty_pred); } if (bkey_cmp(>last_scanned, ) >= 0) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On Wed, Jun 10, 2015 at 10:21:36AM -0700, Andy Lutomirski wrote: > On Jun 10, 2015 5:07 AM, "Josh Poimboeuf" wrote: > > > > Add a new CONFIG_ASM_VALIDATION option which adds an asmvalidate host > > tool which runs on every compiled .S file. Its goal is to enforce sane > > rules on all asm code, so that stack debug metadata (frame/back chain > > pointers and/or DWARF CFI metadata) can be made reliable. > > > > It enforces the following rules: > > > > 1. Each callable function must be annotated with the ELF STT_FUNC type. > >This is typically done using the ENTRY/ENDPROC macros. If > >asmvalidate finds a return instruction outside of a function, it > >flags an error, since that usually indicates callable code which > >should be annotated accordingly. > > > > 2. Each callable function must never leave its own bounds (i.e. with a > >jump to outside the function) except when returning. > > Won't that break with sibling/tail calls? GCC can generate those, and > the ia32_ptregs_common label is an example of such a thing. It only validates hand-written assembly, so how it could? -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On Wed, Jun 10, 2015 at 10:21:36AM -0700, Andy Lutomirski wrote: On Jun 10, 2015 5:07 AM, Josh Poimboeuf jpoim...@redhat.com wrote: Add a new CONFIG_ASM_VALIDATION option which adds an asmvalidate host tool which runs on every compiled .S file. Its goal is to enforce sane rules on all asm code, so that stack debug metadata (frame/back chain pointers and/or DWARF CFI metadata) can be made reliable. It enforces the following rules: 1. Each callable function must be annotated with the ELF STT_FUNC type. This is typically done using the ENTRY/ENDPROC macros. If asmvalidate finds a return instruction outside of a function, it flags an error, since that usually indicates callable code which should be annotated accordingly. 2. Each callable function must never leave its own bounds (i.e. with a jump to outside the function) except when returning. Won't that break with sibling/tail calls? GCC can generate those, and the ia32_ptregs_common label is an example of such a thing. It only validates hand-written assembly, so how it could? -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] Add virtio-input driver.
On Wed, Mar 25, 2015 at 01:51:43PM +1030, Rusty Russell wrote: > Gerd Hoffmann writes: > > virtio-input is basically evdev-events-over-virtio, so this driver isn't > > much more than reading configuration from config space and forwarding > > incoming events to the linux input layer. > > > > Signed-off-by: Gerd Hoffmann > > Is the input layer sane? I've never dealt with it, so I don't know. I'm rather biased having designed it, but I'd say it's reasonable. It certainly has limitations and design mistakes, but none are too bad. One testimony to that Android has based its own Input API around it. > What's it used for? For all human input devices, like keyboards, mice, touchscreens, etc. > Imagine a future virtio standard which incorporates this. And a Windows > or FreeBSD implementation of the device and or driver. How ugly would > they be? A windows translation layer is fairly easy, people porting software from Windows to Linux told me numerous times that adapting isn't hard. I also believe that at least one of the BSD's has a compatible implementation these days based on the fact that I was asked to allow copying the headers in the past. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] Add virtio-input driver.
On Wed, Mar 25, 2015 at 01:51:43PM +1030, Rusty Russell wrote: Gerd Hoffmann kra...@redhat.com writes: virtio-input is basically evdev-events-over-virtio, so this driver isn't much more than reading configuration from config space and forwarding incoming events to the linux input layer. Signed-off-by: Gerd Hoffmann kra...@redhat.com Is the input layer sane? I've never dealt with it, so I don't know. I'm rather biased having designed it, but I'd say it's reasonable. It certainly has limitations and design mistakes, but none are too bad. One testimony to that Android has based its own Input API around it. What's it used for? For all human input devices, like keyboards, mice, touchscreens, etc. Imagine a future virtio standard which incorporates this. And a Windows or FreeBSD implementation of the device and or driver. How ugly would they be? A windows translation layer is fairly easy, people porting software from Windows to Linux told me numerous times that adapting isn't hard. I also believe that at least one of the BSD's has a compatible implementation these days based on the fact that I was asked to allow copying the headers in the past. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote: > > Your upgrade proposal is an *enormous* disruption to the > > system: > > > > - a latency of "well below 10" seconds is completely > > unacceptable to most users who want to patch the kernel > > of a production system _while_ it's in production. > > I think this statement is false for the following reasons. The statement is very true. > - I'd say the majority of system operators of production > systems can live with a couple of seconds of delay at a > well defined moment of the day or week - with gradual, > pretty much open ended improvements in that latency > down the line. In the most usual corporate setting any noticeable outage, even out of business hours, requires an ahead notice, and an agreement of all stakeholders - teams that depend on the system. If a live patching technology introduces an outage, it's not "live" and because of these bureaucratic reasons, it will not be used and a regular reboot will be scheduled instead. > - I think your argument ignores the fact that live > upgrades would extend the scope of 'users willing to > patch the kernel of a production system' _enormously_. > > For example, I have a production system with this much > uptime: > >10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 > > While currently I'm reluctant to reboot the system to > upgrade the kernel (due to a reboot's intrusiveness), > and that is why it has achieved a relatively high > uptime, but I'd definitely allow the kernel to upgrade > at 0:00am just fine. (I'd even give it up to a few > minutes, as long as TCP connections don't time out.) > > And I don't think my usecase is special. I agree that this is useful. But it is a different problem that only partially overlaps with what we're trying to achieve with live patching. If you can make full kernel upgrades to work this way, which I doubt is achievable in the next 10 years due to all the research and infrastructure needed, then you certainly gain an additional group of users. And a great tool. A large portion of those that ask for live patching won't use it, though. But honestly, I prefer a solution that works for small patches now, than a solution for unlimited patches sometime in next decade. > What gradual improvements in live upgrade latency am I > talking about? > > - For example the majority of pure user-space process >pages in RAM could be saved from the old kernel over >into the new kernel - i.e. they'd stay in place in RAM, >but they'd be re-hashed for the new data structures. >This avoids a big chunk of checkpointing overhead. I'd have hoped this would be a given. If you can't preserve memory contents and have to re-load from disk, you can just as well reboot entirely, the time needed will not be much more.. > - Likewise, most of the page cache could be saved from an >old kernel to a new kernel as well - further reducing >checkpointing overhead. > > - The PROT_NONE mechanism of the current NUMA balancing >code could be used to transparently mark user-space >pages as 'checkpointed'. This would reduce system >interruption as only 'newly modified' pages would have >to be checkpointed when the upgrade happens. > > - Hardware devices could be marked as 'already in well >defined state', skipping the more expensive steps of >driver initialization. > > - Possibly full user-space page tables could be preserved >over an upgrade: this way user-space execution would be >unaffected even in the micro level: cache layout, TLB >patterns, etc. > > There's lots of gradual speedups possible with such a model > IMO. Yes, as I say above, guaranteeing decades of employment. ;) > With live kernel patching we run into a brick wall of > complexity straight away: we have to analyze the nature of > the kernel modification, in the context of live patching, > and that only works for the simplest of kernel > modifications. But you're able to _use_ it. > With live kernel upgrades no such brick wall exists, just > about any transition between kernel versions is possible. The brick wall you run to is "I need to implement full kernel state serialization before I can do anything at all." That's something that isn't even clear _how_ to do. Particularly with Linux kernel's development model where internal ABI and structures are always in flux it may not even be realistic. > Granted, with live kernel upgrades it's much more complex > to get the 'simple' case into an even rudimentarily working > fashion (full userspace state has to be enumerated,
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote: > > * Jiri Kosina wrote: > > > [...] We could optimize the kernel the craziest way we > > can, but hardware takes its time to reinitialize. And in > > most cases, you'd really need to reinitalize it; [...] > > If we want to reinitialize a device, most of the longer > initialization latencies during bootup these days involve > things like: 'poke hardware, see if there's any response'. > Those are mostly going away quickly with modern, > well-enumerated hardware interfaces. > > Just try a modprobe of a random hardware driver - most > initialization sequences are very fast. (That's how people > are able to do cold bootups in less than 1 second.) Have you ever tried to boot a system with a large (> 100) number of drives connected over FC? That takes time to discover and you have to do the discovery as the configuration could have changed while you were not looking. Or a machine with terabytes of memory? Just initializing the memory takes minutes. Or a desktop with USB? And you have to reinitialize the USB bus and the state of all the USB devices, because an application might be accessing files on an USB drive. > In theory this could also be optimized: we could avoid the > reinitialization step through an upgrade via relatively > simple means, for example if drivers define their own > version and the new kernel's driver checks whether the > previous state is from a compatible driver. Then the new > driver could do a shorter initialization sequence. There you're clearly getting in the "so complex to maintain that it'll never work reliably" territory. > But I'd only do it only in special cases, where for some > reason the initialization sequence takes longer time and it > makes sense to share hardware discovery information between > two versions of the driver. I'm not convinced such a > mechanism is necessary in the general case. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote: > > This is the most common argument that's raised when live > > patching is discussed. "Why do need live patching when we > > have redundancy?" > > My argument is that if we start off with a latency of 10 > seconds and improve that gradually, it will be good for > everyone with a clear, actionable route for even those who > cannot take a 10 seconds delay today. Sure, we can do it that way. Or do it in the other direction. Today we have a tool (livepatch) in the kernel that can apply trivial single-function fixes without a measurable disruption to applications. And we can improve it gradually to expand the range of fixes it can apply. Dependent functions can be done by kGraft's lazy migration. Limited data structure changes can be handled by shadowing. Major data structure and/or locking changes require stopping the kernel, and trapping all tasks at the kernel/userspace boundary is clearly the cleanest way to do that. I comes at a steep latency cost, though. Full code replacement without change scope consideration requires full serialization and deserialization of hardware and userspace interface state, which is something we don't have today and would require work on every single driver. Possible, but probably a decade of effort. With this approach you have something useful at every point and every piece of effort put in gives you a rewars. > Lets see the use cases: > > > [...] Examples would be legacy applications which can't > > run in an active-active cluster and need to be restarted > > on failover. > > Most clusters (say web frontends) can take a stoppage of a > couple of seconds. It's easy to find examples of workloads that can be stopped. It doesn't rule out a significant set of those where stopping them is very expensive. > > Another usecase is large HPC clusters, where all nodes > > have to run carefully synchronized. Once one gets behind > > in a calculation cycle, others have to wait for the > > results and the efficiency of the whole cluster goes > > down. [...] > > I think calculation nodes on large HPC clusters qualify as > the specialized case that I mentioned, where the update > latency could be brought down into the 1 second range. > > But I don't think calculation nodes are patched in the > typical case: you might want to patch Internet facing > frontend systems, the rest is left as undisturbed as > possible. So I'm not even sure this is a typical usecase. They're not patched for security bugs, but stability bugs are an important issue for multi-month calculations. > In any case, there's no hard limit on how fast such a > kernel upgrade can get in principle, and the folks who care > about that latency will sure help out optimizing it and > many HPC projects are well funded. So far, unless you come up with an effective solutions, if you're catching all tasks at the kernel/userspace boundary (the "Kragle" approach), the service interruption is effectively unbounded due to tasks in D state. > > The value of live patching is in near zero disruption. > > Latency is a good attribute of a kernel upgrade mechanism, > but it's by far not the only attribute and we should > definitely not design limitations into the approach and > hurt all the other attributes, just to optimize that single > attribute. It's an attribute I'm not willing to give up. On the other hand, I definitely wouldn't argue against having modes of operation where the latency is higher and the tool is more powerful. > I.e. don't make it a single-issue project. There is no need to worry about that. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote: This is the most common argument that's raised when live patching is discussed. Why do need live patching when we have redundancy? My argument is that if we start off with a latency of 10 seconds and improve that gradually, it will be good for everyone with a clear, actionable route for even those who cannot take a 10 seconds delay today. Sure, we can do it that way. Or do it in the other direction. Today we have a tool (livepatch) in the kernel that can apply trivial single-function fixes without a measurable disruption to applications. And we can improve it gradually to expand the range of fixes it can apply. Dependent functions can be done by kGraft's lazy migration. Limited data structure changes can be handled by shadowing. Major data structure and/or locking changes require stopping the kernel, and trapping all tasks at the kernel/userspace boundary is clearly the cleanest way to do that. I comes at a steep latency cost, though. Full code replacement without change scope consideration requires full serialization and deserialization of hardware and userspace interface state, which is something we don't have today and would require work on every single driver. Possible, but probably a decade of effort. With this approach you have something useful at every point and every piece of effort put in gives you a rewars. Lets see the use cases: [...] Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Most clusters (say web frontends) can take a stoppage of a couple of seconds. It's easy to find examples of workloads that can be stopped. It doesn't rule out a significant set of those where stopping them is very expensive. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. [...] I think calculation nodes on large HPC clusters qualify as the specialized case that I mentioned, where the update latency could be brought down into the 1 second range. But I don't think calculation nodes are patched in the typical case: you might want to patch Internet facing frontend systems, the rest is left as undisturbed as possible. So I'm not even sure this is a typical usecase. They're not patched for security bugs, but stability bugs are an important issue for multi-month calculations. In any case, there's no hard limit on how fast such a kernel upgrade can get in principle, and the folks who care about that latency will sure help out optimizing it and many HPC projects are well funded. So far, unless you come up with an effective solutions, if you're catching all tasks at the kernel/userspace boundary (the Kragle approach), the service interruption is effectively unbounded due to tasks in D state. The value of live patching is in near zero disruption. Latency is a good attribute of a kernel upgrade mechanism, but it's by far not the only attribute and we should definitely not design limitations into the approach and hurt all the other attributes, just to optimize that single attribute. It's an attribute I'm not willing to give up. On the other hand, I definitely wouldn't argue against having modes of operation where the latency is higher and the tool is more powerful. I.e. don't make it a single-issue project. There is no need to worry about that. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote: Your upgrade proposal is an *enormous* disruption to the system: - a latency of well below 10 seconds is completely unacceptable to most users who want to patch the kernel of a production system _while_ it's in production. I think this statement is false for the following reasons. The statement is very true. - I'd say the majority of system operators of production systems can live with a couple of seconds of delay at a well defined moment of the day or week - with gradual, pretty much open ended improvements in that latency down the line. In the most usual corporate setting any noticeable outage, even out of business hours, requires an ahead notice, and an agreement of all stakeholders - teams that depend on the system. If a live patching technology introduces an outage, it's not live and because of these bureaucratic reasons, it will not be used and a regular reboot will be scheduled instead. - I think your argument ignores the fact that live upgrades would extend the scope of 'users willing to patch the kernel of a production system' _enormously_. For example, I have a production system with this much uptime: 10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05 While currently I'm reluctant to reboot the system to upgrade the kernel (due to a reboot's intrusiveness), and that is why it has achieved a relatively high uptime, but I'd definitely allow the kernel to upgrade at 0:00am just fine. (I'd even give it up to a few minutes, as long as TCP connections don't time out.) And I don't think my usecase is special. I agree that this is useful. But it is a different problem that only partially overlaps with what we're trying to achieve with live patching. If you can make full kernel upgrades to work this way, which I doubt is achievable in the next 10 years due to all the research and infrastructure needed, then you certainly gain an additional group of users. And a great tool. A large portion of those that ask for live patching won't use it, though. But honestly, I prefer a solution that works for small patches now, than a solution for unlimited patches sometime in next decade. What gradual improvements in live upgrade latency am I talking about? - For example the majority of pure user-space process pages in RAM could be saved from the old kernel over into the new kernel - i.e. they'd stay in place in RAM, but they'd be re-hashed for the new data structures. This avoids a big chunk of checkpointing overhead. I'd have hoped this would be a given. If you can't preserve memory contents and have to re-load from disk, you can just as well reboot entirely, the time needed will not be much more.. - Likewise, most of the page cache could be saved from an old kernel to a new kernel as well - further reducing checkpointing overhead. - The PROT_NONE mechanism of the current NUMA balancing code could be used to transparently mark user-space pages as 'checkpointed'. This would reduce system interruption as only 'newly modified' pages would have to be checkpointed when the upgrade happens. - Hardware devices could be marked as 'already in well defined state', skipping the more expensive steps of driver initialization. - Possibly full user-space page tables could be preserved over an upgrade: this way user-space execution would be unaffected even in the micro level: cache layout, TLB patterns, etc. There's lots of gradual speedups possible with such a model IMO. Yes, as I say above, guaranteeing decades of employment. ;) With live kernel patching we run into a brick wall of complexity straight away: we have to analyze the nature of the kernel modification, in the context of live patching, and that only works for the simplest of kernel modifications. But you're able to _use_ it. With live kernel upgrades no such brick wall exists, just about any transition between kernel versions is possible. The brick wall you run to is I need to implement full kernel state serialization before I can do anything at all. That's something that isn't even clear _how_ to do. Particularly with Linux kernel's development model where internal ABI and structures are always in flux it may not even be realistic. Granted, with live kernel upgrades it's much more complex to get the 'simple' case into an even rudimentarily working fashion (full userspace state has to be enumerated, saved and restored), but once we are there, it's a whole new category of goodness and it probably covers 90%+ of the live kernel patching usecases on day 1 already ... Feel free to start working on it. I'll stick with live patching. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe
Re: live kernel upgrades (was: live kernel patching design)
On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: [...] We could optimize the kernel the craziest way we can, but hardware takes its time to reinitialize. And in most cases, you'd really need to reinitalize it; [...] If we want to reinitialize a device, most of the longer initialization latencies during bootup these days involve things like: 'poke hardware, see if there's any response'. Those are mostly going away quickly with modern, well-enumerated hardware interfaces. Just try a modprobe of a random hardware driver - most initialization sequences are very fast. (That's how people are able to do cold bootups in less than 1 second.) Have you ever tried to boot a system with a large ( 100) number of drives connected over FC? That takes time to discover and you have to do the discovery as the configuration could have changed while you were not looking. Or a machine with terabytes of memory? Just initializing the memory takes minutes. Or a desktop with USB? And you have to reinitialize the USB bus and the state of all the USB devices, because an application might be accessing files on an USB drive. In theory this could also be optimized: we could avoid the reinitialization step through an upgrade via relatively simple means, for example if drivers define their own version and the new kernel's driver checks whether the previous state is from a compatible driver. Then the new driver could do a shorter initialization sequence. There you're clearly getting in the so complex to maintain that it'll never work reliably territory. But I'd only do it only in special cases, where for some reason the initialization sequence takes longer time and it makes sense to share hardware discovery information between two versions of the driver. I'm not convinced such a mechanism is necessary in the general case. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote: > > Of course, if you are random Joe User, you can do whatever you want, i.e. > > also compile your own home-brew patches and apply them randomly and brick > > your system that way. But that's in no way different to what you as Joe > > User can do today; there is nothing that will prevent you from shooting > > yourself in a foot if you are creative. > > Sorry if I ask something that got already discussed, I did not follow > the whole live-patching discussion. > > How much of the userspace tools will be public available? > With live-patching mainline the kernel offers the mechanism, but > random Joe user still needs > the tools to create good live patches. All the tools for kGraft and kpatch are available in public git repositories. Also, while kGraft has tools to automate the generation of patches, these are generally not required to create a patch. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote: Of course, if you are random Joe User, you can do whatever you want, i.e. also compile your own home-brew patches and apply them randomly and brick your system that way. But that's in no way different to what you as Joe User can do today; there is nothing that will prevent you from shooting yourself in a foot if you are creative. Sorry if I ask something that got already discussed, I did not follow the whole live-patching discussion. How much of the userspace tools will be public available? With live-patching mainline the kernel offers the mechanism, but random Joe user still needs the tools to create good live patches. All the tools for kGraft and kpatch are available in public git repositories. Also, while kGraft has tools to automate the generation of patches, these are generally not required to create a patch. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina wrote: > > > But if you ask the folks who are hungry for live bug patching, they > > wouldn't care. > > > > You mentioned "10 seconds", that's more or less equal to infinity to them. > > 10 seconds outage is unacceptable, but we're running our service on a > single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. "Why do need live patching when we have redundancy?" People who are asking for live patching typically do have failover in place, but prefer not to have to use it when they don't have to. In many cases, the failover just can't be made transparent to the outside world and there is a short outage. Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. There are people who run realtime on them for that reason. Dumping all data and restarting the HPC cluster takes a lot of time and many nodes (out of tens of thousands) may not come back up, making the restore from media difficult. Doing a rolling upgrade causes the nodes one by one stall by 10+ seconds, which times 10k is a long time, too. And even the case where you have a perfect setup with everything redundant and with instant failover does benefit from live patching. Since you have to plan for failure, you have to plan for failure while patching, too. With live patching you need 2 servers minimum (or N+1), without you need 3 (or N+2), as one will be offline while during the upgrade process. 10 seconds of outage may be acceptable in a disaster scenario. Not necessarily for a regular update scenario. The value of live patching is in near zero disruption. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live kernel upgrades (was: live kernel patching design)
On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote: On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina jkos...@suse.cz wrote: But if you ask the folks who are hungry for live bug patching, they wouldn't care. You mentioned 10 seconds, that's more or less equal to infinity to them. 10 seconds outage is unacceptable, but we're running our service on a single machine with no failover. Who is doing this?? This is the most common argument that's raised when live patching is discussed. Why do need live patching when we have redundancy? People who are asking for live patching typically do have failover in place, but prefer not to have to use it when they don't have to. In many cases, the failover just can't be made transparent to the outside world and there is a short outage. Examples would be legacy applications which can't run in an active-active cluster and need to be restarted on failover. Or trading systems, where the calculations must be strictly serialized and response times are counted in tens of microseconds. Another usecase is large HPC clusters, where all nodes have to run carefully synchronized. Once one gets behind in a calculation cycle, others have to wait for the results and the efficiency of the whole cluster goes down. There are people who run realtime on them for that reason. Dumping all data and restarting the HPC cluster takes a lot of time and many nodes (out of tens of thousands) may not come back up, making the restore from media difficult. Doing a rolling upgrade causes the nodes one by one stall by 10+ seconds, which times 10k is a long time, too. And even the case where you have a perfect setup with everything redundant and with instant failover does benefit from live patching. Since you have to plan for failure, you have to plan for failure while patching, too. With live patching you need 2 servers minimum (or N+1), without you need 3 (or N+2), as one will be offline while during the upgrade process. 10 seconds of outage may be acceptable in a disaster scenario. Not necessarily for a regular update scenario. The value of live patching is in near zero disruption. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: > > ... the choice the sysadmins have here is either have the > > system running in an intermediate state, or have the > > system completely dead for the *same time*. Because to > > finish the transition successfully, all the tasks have to > > be woken up in any case. > > That statement is false: an 'intermediate state' system > where 'new' tasks are still running is still running and > will interfere with the resolution of 'old' tasks. Can you suggest a way how they would interfere? The transition happens on entering or returning from a syscall, there is no influence between individual tasks. If you mean that the patch author has to consider the fact that both old and new code will be running simultaneously, then yes, they have to. > > But I do get your point; what you are basically saying is > > that your preference is what kgraft is doing, and option > > to allow for a global synchronization point should be > > added in parallel to the gradual lazy migration. > > I think you misunderstood: the 'simple' method I outlined > does not just 'synchronize', it actually executes the live > patching atomically, once all tasks are gathered and we > know they are _all_ in a safe state. The 'simple' method has to catch and freeze all tasks one by one in syscall entry/exit, at the kernel/userspace boundary, until all are frozen and then patch the system atomically. This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. This is exactly what you need to do for kGraft to complete patching. This may take a significant amount of time to achieve and you won't be able to use a userspace script to send the signals, because it'd be frozen immediately. > I.e. it's in essence the strong stop-all atomic patching > model of 'kpatch', combined with the reliable avoidance of > kernel stacks that 'kgraft' uses. > That should be the starting point, because it's the most > reliable method. In the consistency models discussion, this was marked the "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())
On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote: ... the choice the sysadmins have here is either have the system running in an intermediate state, or have the system completely dead for the *same time*. Because to finish the transition successfully, all the tasks have to be woken up in any case. That statement is false: an 'intermediate state' system where 'new' tasks are still running is still running and will interfere with the resolution of 'old' tasks. Can you suggest a way how they would interfere? The transition happens on entering or returning from a syscall, there is no influence between individual tasks. If you mean that the patch author has to consider the fact that both old and new code will be running simultaneously, then yes, they have to. But I do get your point; what you are basically saying is that your preference is what kgraft is doing, and option to allow for a global synchronization point should be added in parallel to the gradual lazy migration. I think you misunderstood: the 'simple' method I outlined does not just 'synchronize', it actually executes the live patching atomically, once all tasks are gathered and we know they are _all_ in a safe state. The 'simple' method has to catch and freeze all tasks one by one in syscall entry/exit, at the kernel/userspace boundary, until all are frozen and then patch the system atomically. This means that each and every sleeping task in the system has to be woken up in some way (sending a signal ...) to exit from a syscall it is sleeping in. Same for CPU hogs. All kernel threads need to be parked. This is exactly what you need to do for kGraft to complete patching. This may take a significant amount of time to achieve and you won't be able to use a userspace script to send the signals, because it'd be frozen immediately. I.e. it's in essence the strong stop-all atomic patching model of 'kpatch', combined with the reliable avoidance of kernel stacks that 'kgraft' uses. That should be the starting point, because it's the most reliable method. In the consistency models discussion, this was marked the LEAVE_KERNEL+SWITCH_KERNEL model. It's indeed the strongest model of all, but also comes at the highest cost in terms of impact on running tasks. It's so high (the interruption may be seconds or more) that it was deemed not worth implementing. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > > > they don't accidentally enter the kernel while you flip them. > > > > > > Something > > > > > > like so should do. > > > > > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, > > > > > > flip > > > > > > them then clear TIF_ENTER_WAIT. > > > > > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > > > > > I don't see the benefit in holding them in a loop - you can just as well > > > > flip them from the syscall code as kGraft does. > > > > > > But we were talking specifically about HPC tasks which never make > > > syscalls. > > > > Yes. I'm saying that rather than guaranteeing they don't enter the > > kernel (by having them spin) you can flip them in case they try to do > > that instead. That solves the race condition just as well. > > Ok, gotcha. > > We'd still need a safe way to check if they're in user space though. Having a safe way would be very nice and actually quite useful in other cases, too. For this specific purpose, however, we don't need a very safe way, though. We don't require atomicity in any way, we don't mind even if it creates false negatives, only false positives would be bad. kGraft looks at the stacktrace of CPU hogs and if it finds no kernel addresses there, it assumes userspace. Not very nice, but does the job. > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set > right at the border. Then for running tasks it's as simple as: > > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > klp_switch_task_universe(task); -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf wrote: >> Yes. I'm saying that rather than guaranteeing they don't enter the >> kernel (by having them spin) you can flip them in case they try to do >> that instead. That solves the race condition just as well. > >Ok, gotcha. > >We'd still need a safe way to check if they're in user space though. > >How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set >right at the border. Then for running tasks it's as simple as: > >if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) > klp_switch_task_universe(task); The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee > > > > they don't accidentally enter the kernel while you flip them. Something > > > > like so should do. > > > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > > > them then clear TIF_ENTER_WAIT. > > > > > > Ah, that's a good idea. But how do we check if they're in user space? > > > > I don't see the benefit in holding them in a loop - you can just as well > > flip them from the syscall code as kGraft does. > > But we were talking specifically about HPC tasks which never make > syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: > > No, these tasks will _never_ make syscalls. So you need to guarantee > > they don't accidentally enter the kernel while you flip them. Something > > like so should do. > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip > > them then clear TIF_ENTER_WAIT. > > Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. > > I still absolutely hate you need to disturb userspace like that. Signals > > are quite visible and perturb userspace state. > > Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it > results in EINTR being returned from a system call. It's not ideal, but > it's much less intrusive than something like suspend. > > But anyway we leave it up to the user to decide whether they want to > take that risk, or wait for the task to wake up on its own, or cancel > the patch. > > > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for > > what they thought was a good reason. You'd wreck their state. > > Hm, that's a good point. Maybe use the freezer instead of signals? > > (Again this would only be for those user tasks which are sleeping on a > patched function) > > > > But now I'm thinking that kthreads will almost never be a problem. Most > > > kthreads are basically this: > > > > You guys are way too optimistic; maybe its because I've worked on > > realtime stuff too much, but I'm always looking at worst cases. If you > > can't handle those, I feel you might as well not bother :-) > > Well, I think we're already resigned to the fact that live patching > won't work for every patch, every time. And that the patch author must > know what they're doing, and must do it carefully. > > Our target worst case is that the patching fails gracefully and the user > can't patch their system with that particular patch. Or that the system > remains in a partially patched state forever, if the user is ok with > that. > > > > Patching thread_fn wouldn't be possible unless we killed the thread. > > > > It is, see kthread_park(). > > When the kthread returns from kthread_parkme(), it'll still be running > the old thread_fn code, regardless of whether we flipped the task's > patch state. We could instrument kthread_parkme() to be able to return to a different function, but it'd be a bit crazy indeed. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 6/9] livepatch: create per-task consistency model
On Thu, Feb 19, 2015 at 10:52:51AM +0100, Peter Zijlstra wrote: > On Wed, Feb 18, 2015 at 09:44:44PM +0100, Vojtech Pavlik wrote: > > For live patching it doesn't matter whether code is running, sleeping or > > frozen. > > > > What matters is whether there is state before patching that may not be > > valid after patching. > > > > For userspace tasks, the exit from a syscall is a perfect moment for > > switching to the "after" state, as all stacks, and thus all local > > variables are gone and no local state exists in the kernel for the > > thread. > > > > The freezer is a logical choice for kernel threads, however, given that > > kernel threads have no defined entry/exit point and execute within a > > single main function, local variables stay and thus local state persists > > from before to after freezing. > > > > Defining that no local state within a kernel thread may be relied upon > > after exiting from the freezer is certainly possible, and is already > > true for many kernel threads. > > > > It isn't a given property of the freezer itself, though. And isn't > > obvious for author of new kernel threads either. > > > > The ideal solution would be to convert the majority of kernel threads to > > workqueues, because then there is a defined entry/exit point over which > > state isn't transferred. That is a lot of work, though, and has other > > drawbacks, particularly in the realtime space. > > kthread_park() functionality seems to be exactly what you want. It might be exactly that, indeed. The requrement of not just cleaning up, but also not using contents of local variables from before parking would need to be documented. And kernel threads would need to start using it, too. I have been able to find one instance where this functionality is actually used. So it is again a matter of a massive patch adding that, like with the approach of converting kernel threads to workqueues. By the way, if kthread_park() was implemented all through the kernel, would we still need the freezer for kernel threads at all? Since parking seems to be stronger than freezing, it could also be used for that purpose. Vojtech -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. Having a safe way would be very nice and actually quite useful in other cases, too. For this specific purpose, however, we don't need a very safe way, though. We don't require atomicity in any way, we don't mind even if it creates false negatives, only false positives would be bad. kGraft looks at the stacktrace of CPU hogs and if it finds no kernel addresses there, it assumes userspace. Not very nice, but does the job. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 6/9] livepatch: create per-task consistency model
On Thu, Feb 19, 2015 at 10:52:51AM +0100, Peter Zijlstra wrote: On Wed, Feb 18, 2015 at 09:44:44PM +0100, Vojtech Pavlik wrote: For live patching it doesn't matter whether code is running, sleeping or frozen. What matters is whether there is state before patching that may not be valid after patching. For userspace tasks, the exit from a syscall is a perfect moment for switching to the after state, as all stacks, and thus all local variables are gone and no local state exists in the kernel for the thread. The freezer is a logical choice for kernel threads, however, given that kernel threads have no defined entry/exit point and execute within a single main function, local variables stay and thus local state persists from before to after freezing. Defining that no local state within a kernel thread may be relied upon after exiting from the freezer is certainly possible, and is already true for many kernel threads. It isn't a given property of the freezer itself, though. And isn't obvious for author of new kernel threads either. The ideal solution would be to convert the majority of kernel threads to workqueues, because then there is a defined entry/exit point over which state isn't transferred. That is a lot of work, though, and has other drawbacks, particularly in the realtime space. kthread_park() functionality seems to be exactly what you want. It might be exactly that, indeed. The requrement of not just cleaning up, but also not using contents of local variables from before parking would need to be documented. And kernel threads would need to start using it, too. I have been able to find one instance where this functionality is actually used. So it is again a matter of a massive patch adding that, like with the approach of converting kernel threads to workqueues. By the way, if kthread_park() was implemented all through the kernel, would we still need the freezer for kernel threads at all? Since parking seems to be stronger than freezing, it could also be used for that purpose. Vojtech -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. I still absolutely hate you need to disturb userspace like that. Signals are quite visible and perturb userspace state. Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it results in EINTR being returned from a system call. It's not ideal, but it's much less intrusive than something like suspend. But anyway we leave it up to the user to decide whether they want to take that risk, or wait for the task to wake up on its own, or cancel the patch. Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for what they thought was a good reason. You'd wreck their state. Hm, that's a good point. Maybe use the freezer instead of signals? (Again this would only be for those user tasks which are sleeping on a patched function) But now I'm thinking that kthreads will almost never be a problem. Most kthreads are basically this: You guys are way too optimistic; maybe its because I've worked on realtime stuff too much, but I'm always looking at worst cases. If you can't handle those, I feel you might as well not bother :-) Well, I think we're already resigned to the fact that live patching won't work for every patch, every time. And that the patch author must know what they're doing, and must do it carefully. Our target worst case is that the patching fails gracefully and the user can't patch their system with that particular patch. Or that the system remains in a partially patched state forever, if the user is ok with that. Patching thread_fn wouldn't be possible unless we killed the thread. It is, see kthread_park(). When the kthread returns from kthread_parkme(), it'll still be running the old thread_fn code, regardless of whether we flipped the task's patch state. We could instrument kthread_parkme() to be able to return to a different function, but it'd be a bit crazy indeed. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote: On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote: On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote: No, these tasks will _never_ make syscalls. So you need to guarantee they don't accidentally enter the kernel while you flip them. Something like so should do. You set TIF_ENTER_WAIT on them, check they're still in userspace, flip them then clear TIF_ENTER_WAIT. Ah, that's a good idea. But how do we check if they're in user space? I don't see the benefit in holding them in a loop - you can just as well flip them from the syscall code as kGraft does. But we were talking specifically about HPC tasks which never make syscalls. Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: add sched_task_call()
On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf jpoim...@redhat.com wrote: Yes. I'm saying that rather than guaranteeing they don't enter the kernel (by having them spin) you can flip them in case they try to do that instead. That solves the race condition just as well. Ok, gotcha. We'd still need a safe way to check if they're in user space though. How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set right at the border. Then for running tasks it's as simple as: if (test_tsk_thread_flag(task, TIF_IN_USERSPACE)) klp_switch_task_universe(task); The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 6/9] livepatch: create per-task consistency model
On Wed, Feb 18, 2015 at 09:17:55PM +0100, Ingo Molnar wrote: > > * Jiri Kosina wrote: > > > On Thu, 12 Feb 2015, Peter Zijlstra wrote: > > > > > And what's wrong with using known good spots like the freezer? > > > > Quoting Tejun from the thread Jiri Slaby likely had on > > mind: > > > > "The fact that they may coincide often can be useful as a > > guideline or whatever but I'm completely against just > > mushing it together when it isn't correct. This kind of > > things quickly lead to ambiguous situations where people > > are not sure about the specific semantics or guarantees > > of the construct and implement weird voodoo code followed > > by voodoo fixes. We already had a full round of that > > with the kernel freezer itself, where people thought that > > the freezer magically makes PM work properly for a > > subsystem. Let's please not do that again." > > I don't follow this vague argument. > > The concept of 'freezing' all userspace execution is pretty > unambiguous: tasks that are running are trapped out at > known safe points such as context switch points or syscall > entry. Once all tasks have stopped, the system is frozen in > the sense that only the code we want is running, so you can > run special code without worrying about races. > > What's the problem with that? Why would it be fundamentally > unsuitable for live patching? For live patching it doesn't matter whether code is running, sleeping or frozen. What matters is whether there is state before patching that may not be valid after patching. For userspace tasks, the exit from a syscall is a perfect moment for switching to the "after" state, as all stacks, and thus all local variables are gone and no local state exists in the kernel for the thread. The freezer is a logical choice for kernel threads, however, given that kernel threads have no defined entry/exit point and execute within a single main function, local variables stay and thus local state persists from before to after freezing. Defining that no local state within a kernel thread may be relied upon after exiting from the freezer is certainly possible, and is already true for many kernel threads. It isn't a given property of the freezer itself, though. And isn't obvious for author of new kernel threads either. The ideal solution would be to convert the majority of kernel threads to workqueues, because then there is a defined entry/exit point over which state isn't transferred. That is a lot of work, though, and has other drawbacks, particularly in the realtime space. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 6/9] livepatch: create per-task consistency model
On Wed, Feb 18, 2015 at 09:17:55PM +0100, Ingo Molnar wrote: * Jiri Kosina jkos...@suse.cz wrote: On Thu, 12 Feb 2015, Peter Zijlstra wrote: And what's wrong with using known good spots like the freezer? Quoting Tejun from the thread Jiri Slaby likely had on mind: The fact that they may coincide often can be useful as a guideline or whatever but I'm completely against just mushing it together when it isn't correct. This kind of things quickly lead to ambiguous situations where people are not sure about the specific semantics or guarantees of the construct and implement weird voodoo code followed by voodoo fixes. We already had a full round of that with the kernel freezer itself, where people thought that the freezer magically makes PM work properly for a subsystem. Let's please not do that again. I don't follow this vague argument. The concept of 'freezing' all userspace execution is pretty unambiguous: tasks that are running are trapped out at known safe points such as context switch points or syscall entry. Once all tasks have stopped, the system is frozen in the sense that only the code we want is running, so you can run special code without worrying about races. What's the problem with that? Why would it be fundamentally unsuitable for live patching? For live patching it doesn't matter whether code is running, sleeping or frozen. What matters is whether there is state before patching that may not be valid after patching. For userspace tasks, the exit from a syscall is a perfect moment for switching to the after state, as all stacks, and thus all local variables are gone and no local state exists in the kernel for the thread. The freezer is a logical choice for kernel threads, however, given that kernel threads have no defined entry/exit point and execute within a single main function, local variables stay and thus local state persists from before to after freezing. Defining that no local state within a kernel thread may be relied upon after exiting from the freezer is certainly possible, and is already true for many kernel threads. It isn't a given property of the freezer itself, though. And isn't obvious for author of new kernel threads either. The ideal solution would be to convert the majority of kernel threads to workqueues, because then there is a defined entry/exit point over which state isn't transferred. That is a lot of work, though, and has other drawbacks, particularly in the realtime space. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv7 0/3] Kernel Live Patching
On Wed, Dec 17, 2014 at 01:22:21PM +0530, Balbir Singh wrote: > On Wed, Dec 17, 2014 at 12:16 PM, Jiri Kosina wrote: > > On Wed, 17 Dec 2014, Balbir Singh wrote: > > > >> >> Could you describe what this does to signing? I presume the patched > >> >> module should cause a taint on module signing? > >> > > >> > Hmm, why should it? > >> > >> I wanted to clarify it from a different perspective > >> > >> If the base image is signed by X and the patched module is signed by > >> Y, is that supported. What does it imply in the case of live-patching? > > > > Why should that matter? Both are signed by keys that kernel is configured > > to trust, which makes them equal (even though they are technically > > different). > > > > I am not sure they are equal, others can comment Since any loaded kernel module can do virtually anything on a machine, there can only be one level of trust. As such, all trusted keys are equally trusted. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv7 0/3] Kernel Live Patching
On Wed, Dec 17, 2014 at 01:22:21PM +0530, Balbir Singh wrote: On Wed, Dec 17, 2014 at 12:16 PM, Jiri Kosina jkos...@suse.cz wrote: On Wed, 17 Dec 2014, Balbir Singh wrote: Could you describe what this does to signing? I presume the patched module should cause a taint on module signing? Hmm, why should it? I wanted to clarify it from a different perspective If the base image is signed by X and the patched module is signed by Y, is that supported. What does it imply in the case of live-patching? Why should that matter? Both are signed by keys that kernel is configured to trust, which makes them equal (even though they are technically different). I am not sure they are equal, others can comment Since any loaded kernel module can do virtually anything on a machine, there can only be one level of trust. As such, all trusted keys are equally trusted. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 2/3] kernel: add support for live patching
On Mon, Nov 24, 2014 at 02:26:08PM +0100, Thomas Gleixner wrote: > On Mon, 24 Nov 2014, Jiri Kosina wrote: > > On Mon, 24 Nov 2014, Thomas Gleixner wrote: > > > How is determined whether a change can be applied w/o a consistency > > > mechanism or not? > > > > By a human being producing the "live patch" code. > > > > If the semantics of the patch requires consistency mechanism to be applied > > (such as "old and new function must not run in parallel, because locking > > rules would be violated", or "return value from a function that is being > > called in a loop is changing its meaning", etc.), then this first naive > > implementation simply can't be used. > > > > For simple things though, such as "add a missing bounds check to sys_foo() > > prologue and return -EINVAL if out-of-bounds", this is sufficient. > > > > It's being designed in a way that more advanced consistency models (such > > as the ones kgraft and kpatch are currently implementing) can be built on > > top of it. > > > > The person writing the patch would always need to understand what he is > > doing to be able to pick correct consistency model to be used. I > > personally think this is a good thing -- this is nothing where we should > > be relying on any kinds of tools. > > But why want we to provide a mechanism which has no consistency > enforcement at all? > > Surely you can argue that the person who is doing that is supposed to > know what he's doing, but what's the downside of enforcing consistency > mechanisms on all live code changes? The consistency engine implementing the consistency model is the most complex part of the live patching technology. We want to have something small, easy to understand pushed out first, to build on top of that. Plus we're still discussing which exact consistency model to use for upstream live patching (there are many considerations) and whether one is enough, or whether an engine that can do more than one is required. The consistency models of kpatch and kGraft aren't directly compatible. I think we're on a good way towards a single model, but we'll see when it's implemented within the live patching framework just posted. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 2/3] kernel: add support for live patching
On Mon, Nov 24, 2014 at 12:13:20PM +0100, Thomas Gleixner wrote: > > This commit introduces code for the live patching core. It implements > > an ftrace-based mechanism and kernel interface for doing live patching > > of kernel and kernel module functions. > > > > It represents the greatest common functionality set between kpatch and > > kgraft and can accept patches built using either method. > > > > This first version does not implement any consistency mechanism that > > ensures that old and new code do not run together. In practice, ~90% of > > CVEs are safe to apply in this way, since they simply add a conditional > > check. However, any function change that can not execute safely with > > the old version of the function can _not_ be safely applied in this > > version. > > To be honest this sounds frightening. > > How is determined whether a change can be applied w/o a consistency > mechanism or not? If there are any syntactic (function prototype - arguments, return value type) or semantic dependencies (data value semantics, locking order, ...) between multiple functions the patch changes, then it cannot be applied. If the changes are small and localized, don't depend on the order in which individual functions are replaced, when both new and old code can run in parallel on different CPUs or in sequence in a single thread safely, then it can be applied. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 2/3] kernel: add support for live patching
On Mon, Nov 24, 2014 at 12:13:20PM +0100, Thomas Gleixner wrote: This commit introduces code for the live patching core. It implements an ftrace-based mechanism and kernel interface for doing live patching of kernel and kernel module functions. It represents the greatest common functionality set between kpatch and kgraft and can accept patches built using either method. This first version does not implement any consistency mechanism that ensures that old and new code do not run together. In practice, ~90% of CVEs are safe to apply in this way, since they simply add a conditional check. However, any function change that can not execute safely with the old version of the function can _not_ be safely applied in this version. To be honest this sounds frightening. How is determined whether a change can be applied w/o a consistency mechanism or not? If there are any syntactic (function prototype - arguments, return value type) or semantic dependencies (data value semantics, locking order, ...) between multiple functions the patch changes, then it cannot be applied. If the changes are small and localized, don't depend on the order in which individual functions are replaced, when both new and old code can run in parallel on different CPUs or in sequence in a single thread safely, then it can be applied. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 2/3] kernel: add support for live patching
On Mon, Nov 24, 2014 at 02:26:08PM +0100, Thomas Gleixner wrote: On Mon, 24 Nov 2014, Jiri Kosina wrote: On Mon, 24 Nov 2014, Thomas Gleixner wrote: How is determined whether a change can be applied w/o a consistency mechanism or not? By a human being producing the live patch code. If the semantics of the patch requires consistency mechanism to be applied (such as old and new function must not run in parallel, because locking rules would be violated, or return value from a function that is being called in a loop is changing its meaning, etc.), then this first naive implementation simply can't be used. For simple things though, such as add a missing bounds check to sys_foo() prologue and return -EINVAL if out-of-bounds, this is sufficient. It's being designed in a way that more advanced consistency models (such as the ones kgraft and kpatch are currently implementing) can be built on top of it. The person writing the patch would always need to understand what he is doing to be able to pick correct consistency model to be used. I personally think this is a good thing -- this is nothing where we should be relying on any kinds of tools. But why want we to provide a mechanism which has no consistency enforcement at all? Surely you can argue that the person who is doing that is supposed to know what he's doing, but what's the downside of enforcing consistency mechanisms on all live code changes? The consistency engine implementing the consistency model is the most complex part of the live patching technology. We want to have something small, easy to understand pushed out first, to build on top of that. Plus we're still discussing which exact consistency model to use for upstream live patching (there are many considerations) and whether one is enough, or whether an engine that can do more than one is required. The consistency models of kpatch and kGraft aren't directly compatible. I think we're on a good way towards a single model, but we'll see when it's implemented within the live patching framework just posted. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
he switch per-thread like > > kGraft does, not for the whole system, when the respective counters > > reach zero. > > I'm not sure what happens if a process sleeps on the patched-set? Then the patching process will be stuck until it is woken up somehow. But it's still much better to only have to care about processes sleeping on the patched-set than about processes sleeping anywhere (kGraft). > If we switch the other threads, when this sleeping thread wakes up > that will see the old functions (and old data). Yes, until the patching process is complete, data must be kept in the old format, even by new functions. > So I think we need both SWITCH_THREAD and SWITCH_KERNEL options in > that case. With data shadowing that's not required. It still may be worth having it. > What I'm thinking is to merge the code (technique) of both and > allow to choose the "switch-timing" based on the patch's consistency > requirement. That's what I'm thinking about, too. But I'm also thinking, "this will be complex, is it really needed"? > Anyway, I'd like to support for this effort from kernel side. > At least I have to solve ftrace regs conflict by IPMODIFY flag and > a headache kretprobe failure case by sharing per-thread retstack > with ftrace-callgraph. While I don't know enough about the IPMODIFY flag, I wholeheartedly support sharing the return stacks between kprobes and ftrace graph caller. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
the old functions (and old data). Yes, until the patching process is complete, data must be kept in the old format, even by new functions. So I think we need both SWITCH_THREAD and SWITCH_KERNEL options in that case. With data shadowing that's not required. It still may be worth having it. What I'm thinking is to merge the code (technique) of both and allow to choose the switch-timing based on the patch's consistency requirement. That's what I'm thinking about, too. But I'm also thinking, this will be complex, is it really needed? Anyway, I'd like to support for this effort from kernel side. At least I have to solve ftrace regs conflict by IPMODIFY flag and a headache kretprobe failure case by sharing per-thread retstack with ftrace-callgraph. While I don't know enough about the IPMODIFY flag, I wholeheartedly support sharing the return stacks between kprobes and ftrace graph caller. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: Re: [PATCH 0/2] Kernel Live Patching
e patched set. (Or you can just include it manually in the set.) Then, you can be sure that the place which calls func() is not on the stack when patching. This way, in your classification, PATCH_KERNEL can be as good as PATCH_THREAD. In my classification, I'm saying that LEAVE_PATCHED_SET is as good as LEAVE_KERNEL. > >> (*) Instead of checking stacks, at first, wait for all threads leaving > >> the kernel once, after that, wait for refcount becomes zero and switch > >> all the patched functions. > > > > This is a very beautiful idea. > > > > It does away with both the stack parsing and the kernel stopping, > > achieving kGraft's goals, while preserving kpatch's consistency model. > > > > Sadly, it combines the disadvantages of both kpatch and kGraft: From > > kpatch it takes the inability to patch functions where threads are > > sleeping often and as such never leave them at once. From kGraft it > > takes the need to annotate kernel threads and wake sleepers from > > userspace. > > But how frequently the former case happens? It seems very very rare. > And if we aim to enable both kpatch mode and kGraft mode in the kernel, > anyway we'll have something for the latter cases. The kpatch problem case isn't that rare. It just happened with a CVE in futexes recently. It will happen if you try to patch anything that is on the stack when a TTY or TCP read is waiting for data as another example. The kGraft problem case will happen when you load a 3rd party module with a non-annotated kernel thread. Or a different problem will happen when you have an application sleeping that will exit when receiving any signal. Both the cases can be handled with tricks and workarounds. But it'd be much nicer to have a patching engine that is reliable. > > So while it is beautiful, it's less practical than either kpatch or > > kGraft alone. > > Ah, sorry for confusing, I don't tend to integrate kpatch and kGraft. > Actually, it is just about modifying kpatch, since it may shorten > stack-checking time. > This means that does not change the consistency model. > We certainly need both of kGraft mode and kpatch mode. What I'm proposing is a LEAVE_PATCHED_SET + SWITCH_THREAD mode. It's less consistency, but it is enough. And it is more reliable (likely to succeed in finite time) than either kpatch or kGraft. It'd be mostly based on your refcounting code, including stack checking (when a process sleeps, counter gets set based on number of patched functions on the stack), possibly including setting the counter to 0 on syscall entry/exit, but it'd make the switch per-thread like kGraft does, not for the whole system, when the respective counters reach zero. This handles the frequent sleeper case, it doesn't need annotated kernel thread main loops, it will not need the user to wake up every process in the system unless it sleeps in a patched function. And it can handle all the patches that kpatch and kGraft can (it needs shadowing for some). > > Yes, this is what I call 'extending the patched set'. You can do that > > either by deliberately changing the prototype of the patched function > > being called, which causes the calling function to be considered > > different, or just add it to the set of functions considered manually. > > I'd prefer latter one :) or just gives hints of watching targets. Me too. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: Re: [PATCH 0/2] Kernel Live Patching
stopping, achieving kGraft's goals, while preserving kpatch's consistency model. Sadly, it combines the disadvantages of both kpatch and kGraft: From kpatch it takes the inability to patch functions where threads are sleeping often and as such never leave them at once. From kGraft it takes the need to annotate kernel threads and wake sleepers from userspace. But how frequently the former case happens? It seems very very rare. And if we aim to enable both kpatch mode and kGraft mode in the kernel, anyway we'll have something for the latter cases. The kpatch problem case isn't that rare. It just happened with a CVE in futexes recently. It will happen if you try to patch anything that is on the stack when a TTY or TCP read is waiting for data as another example. The kGraft problem case will happen when you load a 3rd party module with a non-annotated kernel thread. Or a different problem will happen when you have an application sleeping that will exit when receiving any signal. Both the cases can be handled with tricks and workarounds. But it'd be much nicer to have a patching engine that is reliable. So while it is beautiful, it's less practical than either kpatch or kGraft alone. Ah, sorry for confusing, I don't tend to integrate kpatch and kGraft. Actually, it is just about modifying kpatch, since it may shorten stack-checking time. This means that does not change the consistency model. We certainly need both of kGraft mode and kpatch mode. What I'm proposing is a LEAVE_PATCHED_SET + SWITCH_THREAD mode. It's less consistency, but it is enough. And it is more reliable (likely to succeed in finite time) than either kpatch or kGraft. It'd be mostly based on your refcounting code, including stack checking (when a process sleeps, counter gets set based on number of patched functions on the stack), possibly including setting the counter to 0 on syscall entry/exit, but it'd make the switch per-thread like kGraft does, not for the whole system, when the respective counters reach zero. This handles the frequent sleeper case, it doesn't need annotated kernel thread main loops, it will not need the user to wake up every process in the system unless it sleeps in a patched function. And it can handle all the patches that kpatch and kGraft can (it needs shadowing for some). Yes, this is what I call 'extending the patched set'. You can do that either by deliberately changing the prototype of the patched function being called, which causes the calling function to be considered different, or just add it to the set of functions considered manually. I'd prefer latter one :) or just gives hints of watching targets. Me too. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [PATCH 0/2] Kernel Live Patching
On Tue, Nov 11, 2014 at 10:24:03AM +0900, Masami Hiramatsu wrote: > Hmm, I doubt this can cover all. what I'm thinking is a combination of > LEAVE_KERNEL and SWITCH_KERNEL by using my refcounting and kGraft's > per-thread "new universe" flagging(*). It switches all threads but not > change entire kernel as kexec does. While your approach described below indeed forces all threads to leave kernel once, to initialize the reference counts, this can be considered a preparatory phase before the actual patching begins. The actual consistency model remains unchanged from what kpatch offers today, which guarantees that at the time of switch, no execution thread is inside the set of patched functions and that the switch happens at once for all threads. Hence I'd still classify the consistency model offered as LEAVE_PATCHED_SET SWITCH_KERNEL. > So, I think the patch may be classified by following four types > > PATCH_FUNCTION - Patching per function. This ignores context, just >change the function. >User must ensure that the new function can co-exist >with old functions on the same context (e.g. recursive >call can cause inconsistency). > > PATCH_THREAD - Patching per thread. If a thread leave the kernel, >changes are applied for that thread. >User must ensure that the new functions can co-exist >with old functions per-thread. Inter-thread shared >data acquisition(locks) should not be involved. > > PATCH_KERNEL - Patching all threads. This wait for all threads leave the >all target functions. >User must ensure that the new functions can co-exist >with old functions on a thread (note that if there is a >loop, old one can be called first n times, and new one >can be called afterwords).(**) Yes, but only when the function calling it is not included in the patched set, which is only a problem for semantic changes accompanied by no change in the function prototyppe. This can be avoided by changing the prototype deliberately. > RENEW_KERNEL - Renew entire kernel and reset internally. No patch limitation, >but involving kernel resetting. This may take a time. And involves recording the userspace-kernel interface state exactly. The interface is fairly large, so this can become hairy. > (*) Instead of checking stacks, at first, wait for all threads leaving > the kernel once, after that, wait for refcount becomes zero and switch > all the patched functions. This is a very beautiful idea. It does away with both the stack parsing and the kernel stopping, achieving kGraft's goals, while preserving kpatch's consistency model. Sadly, it combines the disadvantages of both kpatch and kGraft: From kpatch it takes the inability to patch functions where threads are sleeping often and as such never leave them at once. From kGraft it takes the need to annotate kernel threads and wake sleepers from userspace. So while it is beautiful, it's less practical than either kpatch or kGraft alone. > (**) For the loops, if it is a simple loop or some simple lock calls, > we can wait for all threads leave the caller function to avoid inconsistency > by using refcounting. Yes, this is what I call 'extending the patched set'. You can do that either by deliberately changing the prototype of the patched function being called, which causes the calling function to be considered different, or just add it to the set of functions considered manually. > > There would be the refcounting engine, counting entries/exits of the > > area of interest (nothing for LEAVE_FUNCTION, patched functions for > > LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit for > > LEAVE_KERNEL), and it'd do the counting either per thread, flagging a > > thread as 'new universe' when the count goes to zero, or flipping a > > 'new universe' switch for the whole kernel when the count goes down to zero. > > Ah, that's similar thing what I'd like to try next :) Cool. > Sorry, here is an off-topic talk. I think a problem of kGraft's > LEAVE_KERNEL work is that the sleeping processes. To ensure all the > threads are changing to new universe, we need to wakeup all the > threads, or we need stack-dumping to find someone is sleeping on the > target functions. What would the kGraft do for this issue? Yes, kGraft uses an userspace helper to find such sleeper and wake them by sending a SIGUSR1 or SIGSTOP/SIGCONT. It's one of the disadvantages of kGraft that sleeper threads have to be handled possibly case by case. Also, kernel threads are problematic for kGraft (as you may have seen in earlier kGraft upstream submissions) as they never leave the kernel. -- Vojtech Pavlik
Re: [PATCH 0/2] Kernel Live Patching
functions in the list. So you would not be skipping the stack checking entirely, just allowing certain functions from the patched set to be on the stack while you switch to the new universe. That indeed would make it a mixed SWITCH_FUNCTION/SWITCH_KERNEL situation. The big caveat in such a situation is that you must not change the calling convention or semantics of any function called directly from the function you skipped the stack check for. As doing so would crash the kernel when an old call calls a new function. > > But there are a few (probably much less than 10%) cases like the locking > > one I used above, where SWITCH_THREAD just isn't going to cut it and for > > those I would need SWITCH_KERNEL or get very creative with refactoring > > the patch to do things differently. > > I'm not opposed to having both if necessary. But I think the code would > be _much_ simpler if we could agree on a single consistency model that > can be used in all cases. Plus there wouldn't be such a strong > requirement to get incremental patching to work safely (which will add > more complexity). > > I actually agree with you that LEAVE_PATCHED_SET + SWITCH_THREAD is > pretty nice. Cool! Do you see it as the next step consistency model we would focus on implementing in livepatch after the null model is complete and upstream? (That wouldn't preclude extending it or implementing more models later.) > So I'd like to hear more about cases where you think we _need_ > SWITCH_KERNEL. As I mentioned above, I think many of those cases can be > solved by using data structure versioning with shadow data fields. I have tried, but so far I can't find a situation whether we would absolutely need SWITCH_KERNEL, assuming we have LEAVE_PATCHED_SET + SWITCH_THREAD + TRANSFORM_ON_ACCESS. > > > Why would LEAVE_PATCHED_SET SWITCH_THREAD finish much quicker than > > > LEAVE_PATCHED_SET SWITCH_KERNEL? Wouldn't they be about the same? > > > > Because with LEAVE_PATCHED_SET SWITCH_THREAD you're waiting for each > > thread to leave the patched set and when they each have done that at > > least once, you're done. Even if some are already back within the set. > > Ok, so you're talking about the case when you're trying to patch a > function which is always active. Agreed :-) Yes. > > With LEAVE_PATCHED_SET SWITCH_KERNEL, you have to find the perfect > > moment when all of the threads are outside of the patched set at the > > same time. Depending on how often the functions are used and how large > > the set is, reaching that moment will get harder. > > Yeah, I think this is the biggest drawback of SWITCH_KERNEL. More > likely to fail (or never succeed). If some threads are sleeping in a loop inside the patched set: With SWITCH_THREAD you can wake (eg. by a signal) the threads from userspace as a last resort and that will complete your patching. With SWITCH_KERNEL you'd have somehow to wake them at the same time hoping they also leave the patched set together. That's unlikely to happen when many threads are involved. In addition the "in progress" behavior is nicer for SWITCH_THREAD, as any new thread will already be running patched code. With SWITCH_KERNEL, you're waiting with applying the fix until the perfect moment. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
SWITCH_KERNEL or get very creative with refactoring the patch to do things differently. I'm not opposed to having both if necessary. But I think the code would be _much_ simpler if we could agree on a single consistency model that can be used in all cases. Plus there wouldn't be such a strong requirement to get incremental patching to work safely (which will add more complexity). I actually agree with you that LEAVE_PATCHED_SET + SWITCH_THREAD is pretty nice. Cool! Do you see it as the next step consistency model we would focus on implementing in livepatch after the null model is complete and upstream? (That wouldn't preclude extending it or implementing more models later.) So I'd like to hear more about cases where you think we _need_ SWITCH_KERNEL. As I mentioned above, I think many of those cases can be solved by using data structure versioning with shadow data fields. I have tried, but so far I can't find a situation whether we would absolutely need SWITCH_KERNEL, assuming we have LEAVE_PATCHED_SET + SWITCH_THREAD + TRANSFORM_ON_ACCESS. Why would LEAVE_PATCHED_SET SWITCH_THREAD finish much quicker than LEAVE_PATCHED_SET SWITCH_KERNEL? Wouldn't they be about the same? Because with LEAVE_PATCHED_SET SWITCH_THREAD you're waiting for each thread to leave the patched set and when they each have done that at least once, you're done. Even if some are already back within the set. Ok, so you're talking about the case when you're trying to patch a function which is always active. Agreed :-) Yes. With LEAVE_PATCHED_SET SWITCH_KERNEL, you have to find the perfect moment when all of the threads are outside of the patched set at the same time. Depending on how often the functions are used and how large the set is, reaching that moment will get harder. Yeah, I think this is the biggest drawback of SWITCH_KERNEL. More likely to fail (or never succeed). If some threads are sleeping in a loop inside the patched set: With SWITCH_THREAD you can wake (eg. by a signal) the threads from userspace as a last resort and that will complete your patching. With SWITCH_KERNEL you'd have somehow to wake them at the same time hoping they also leave the patched set together. That's unlikely to happen when many threads are involved. In addition the in progress behavior is nicer for SWITCH_THREAD, as any new thread will already be running patched code. With SWITCH_KERNEL, you're waiting with applying the fix until the perfect moment. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [PATCH 0/2] Kernel Live Patching
On Tue, Nov 11, 2014 at 10:24:03AM +0900, Masami Hiramatsu wrote: Hmm, I doubt this can cover all. what I'm thinking is a combination of LEAVE_KERNEL and SWITCH_KERNEL by using my refcounting and kGraft's per-thread new universe flagging(*). It switches all threads but not change entire kernel as kexec does. While your approach described below indeed forces all threads to leave kernel once, to initialize the reference counts, this can be considered a preparatory phase before the actual patching begins. The actual consistency model remains unchanged from what kpatch offers today, which guarantees that at the time of switch, no execution thread is inside the set of patched functions and that the switch happens at once for all threads. Hence I'd still classify the consistency model offered as LEAVE_PATCHED_SET SWITCH_KERNEL. So, I think the patch may be classified by following four types PATCH_FUNCTION - Patching per function. This ignores context, just change the function. User must ensure that the new function can co-exist with old functions on the same context (e.g. recursive call can cause inconsistency). PATCH_THREAD - Patching per thread. If a thread leave the kernel, changes are applied for that thread. User must ensure that the new functions can co-exist with old functions per-thread. Inter-thread shared data acquisition(locks) should not be involved. PATCH_KERNEL - Patching all threads. This wait for all threads leave the all target functions. User must ensure that the new functions can co-exist with old functions on a thread (note that if there is a loop, old one can be called first n times, and new one can be called afterwords).(**) Yes, but only when the function calling it is not included in the patched set, which is only a problem for semantic changes accompanied by no change in the function prototyppe. This can be avoided by changing the prototype deliberately. RENEW_KERNEL - Renew entire kernel and reset internally. No patch limitation, but involving kernel resetting. This may take a time. And involves recording the userspace-kernel interface state exactly. The interface is fairly large, so this can become hairy. (*) Instead of checking stacks, at first, wait for all threads leaving the kernel once, after that, wait for refcount becomes zero and switch all the patched functions. This is a very beautiful idea. It does away with both the stack parsing and the kernel stopping, achieving kGraft's goals, while preserving kpatch's consistency model. Sadly, it combines the disadvantages of both kpatch and kGraft: From kpatch it takes the inability to patch functions where threads are sleeping often and as such never leave them at once. From kGraft it takes the need to annotate kernel threads and wake sleepers from userspace. So while it is beautiful, it's less practical than either kpatch or kGraft alone. (**) For the loops, if it is a simple loop or some simple lock calls, we can wait for all threads leave the caller function to avoid inconsistency by using refcounting. Yes, this is what I call 'extending the patched set'. You can do that either by deliberately changing the prototype of the patched function being called, which causes the calling function to be considered different, or just add it to the set of functions considered manually. There would be the refcounting engine, counting entries/exits of the area of interest (nothing for LEAVE_FUNCTION, patched functions for LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit for LEAVE_KERNEL), and it'd do the counting either per thread, flagging a thread as 'new universe' when the count goes to zero, or flipping a 'new universe' switch for the whole kernel when the count goes down to zero. Ah, that's similar thing what I'd like to try next :) Cool. Sorry, here is an off-topic talk. I think a problem of kGraft's LEAVE_KERNEL work is that the sleeping processes. To ensure all the threads are changing to new universe, we need to wakeup all the threads, or we need stack-dumping to find someone is sleeping on the target functions. What would the kGraft do for this issue? Yes, kGraft uses an userspace helper to find such sleeper and wake them by sending a SIGUSR1 or SIGSTOP/SIGCONT. It's one of the disadvantages of kGraft that sleeper threads have to be handled possibly case by case. Also, kernel threads are problematic for kGraft (as you may have seen in earlier kGraft upstream submissions) as they never leave the kernel. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 09:45:53PM -0600, Josh Poimboeuf wrote: > On Fri, Nov 07, 2014 at 10:27:35PM +0100, Vojtech Pavlik wrote: > > On Fri, Nov 07, 2014 at 09:45:00AM -0600, Josh Poimboeuf wrote: > > > > > > LEAVE_FUNCTION > > > > LEAVE_PATCHED_SET > > > > LEAVE_KERNEL > > > > > > > > SWITCH_FUNCTION > > > > SWITCH_THREAD > > > > SWITCH_KERNEL > > > > > > > > There would be the refcounting engine, counting entries/exits of the > > area of interest (nothing for LEAVE_FUNCTION, patched functions for > > LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit > > for LEAVE_KERNEL), and it'd do the counting either per thread, > > flagging a thread as 'new universe' when the count goes to zero, or > > flipping a 'new universe' switch for the whole kernel when the count > > goes down to zero. > > > > A patch would have flags which specify a combination of the above > > properties that are needed for successful patching of that specific > > patch. > > Would it really be necessary to support all possible permutations? I > think that would add a lot of complexity to the code. Especially if we > have to support LEAVE_KERNEL, which adds a lot of interdependencies with > the rest of the kernel (kthreads, syscall, irq, etc). > > It seems to me that the two most interesting options are: > - LEAVE_PATCHED_SET + SWITCH_THREAD (Masami-kGraft) > - LEAVE_PATCHED_SET + SWITCH_KERNEL (kpatch and/or Masami-kpatch) I agree here. In fact LEAVE_KERNEL can be approximated by extending the patched set as required to include functions which are not changed per se, but are "contaminated" by propagation of semantic changes in the calling convention, and/or data format. This applies to cases like this (needs LEAVE_KERNEL or extending patched set beyond changed functions): --- int bar() { [...] - return x; + return x + 1; } foo() { int ret = bar(); do { wait_interruptible(); } while (ret == bar()); } --- Or like this (needs SWITCH_KERNEL so won't work with kGraft, but also extending patched set, will not work with kpatch as it stands today): --- void lock_a() { - spin_lock(); + spin_lock(); } void lock_b() { - spin_lock(); + spin_lock(); } void unlock_a() } - spin_unlock(); + spin_unlock(); } void unlock_b() { - spin_unlock(); + spin_unlock(); } void foo() { lock_a(); lock_b(); [...] unlock_b(); unlock_a(); } --- So once we can extend the patched set as needed (manually, after review), we can handle all the cases that LEAVE_KERNEL offers, making LEAVE_KERNEL unnecessary. It'd be nice if we wouldn't require actually patching those functions, only include them in the set we have to leave to proceed. The remaining question is the performance impact of using a large set with refcounting. LEAVE_KERNEL's impact as implemented in kGraft is beyond being measurable, it's about 16 added instructions for each thread that get executed. > I do see some appeal for the choose-your-own-consistency-model idea. > But it wouldn't be practical for cumulative patch upgrades, which I > think we both agreed at LPC seems to be safer than incremental > patching. If you only ever have one patch module loaded at any given > time, you only get one consistency model anyway. > In order for multiple consistency models to be practical, I think we'd > need to figure out how to make incremental patching safe. I believe we have to get incremental patching working anyway as it is a valid usecase for many users, just not for major distributions. And we may want to take a look at how to mark parts of a cumulative patch with different consistency models, when we combine eg. the recent futex CVE patch (not working with SWITCH_KERNEL) and anything requiring SWITCH kernel. > > For data layout an semantic changes, there are two approaches: > > > > 1) TRANSFORM_WORLD > I'm kind of surprised to hear that Ksplice does this. I had > considered this approach, but it sounds really tricky, if not > impossible in many cases. By the way, the
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 09:45:53PM -0600, Josh Poimboeuf wrote: On Fri, Nov 07, 2014 at 10:27:35PM +0100, Vojtech Pavlik wrote: On Fri, Nov 07, 2014 at 09:45:00AM -0600, Josh Poimboeuf wrote: LEAVE_FUNCTION LEAVE_PATCHED_SET LEAVE_KERNEL SWITCH_FUNCTION SWITCH_THREAD SWITCH_KERNEL There would be the refcounting engine, counting entries/exits of the area of interest (nothing for LEAVE_FUNCTION, patched functions for LEAVE_PATCHED_SET - same as Masami's work now, or syscall entry/exit for LEAVE_KERNEL), and it'd do the counting either per thread, flagging a thread as 'new universe' when the count goes to zero, or flipping a 'new universe' switch for the whole kernel when the count goes down to zero. A patch would have flags which specify a combination of the above properties that are needed for successful patching of that specific patch. Would it really be necessary to support all possible permutations? I think that would add a lot of complexity to the code. Especially if we have to support LEAVE_KERNEL, which adds a lot of interdependencies with the rest of the kernel (kthreads, syscall, irq, etc). It seems to me that the two most interesting options are: - LEAVE_PATCHED_SET + SWITCH_THREAD (Masami-kGraft) - LEAVE_PATCHED_SET + SWITCH_KERNEL (kpatch and/or Masami-kpatch) I agree here. In fact LEAVE_KERNEL can be approximated by extending the patched set as required to include functions which are not changed per se, but are contaminated by propagation of semantic changes in the calling convention, and/or data format. This applies to cases like this (needs LEAVE_KERNEL or extending patched set beyond changed functions): --- int bar() { [...] - return x; + return x + 1; } foo() { int ret = bar(); do { wait_interruptible(); } while (ret == bar()); } --- Or like this (needs SWITCH_KERNEL so won't work with kGraft, but also extending patched set, will not work with kpatch as it stands today): --- void lock_a() { - spin_lock(x); + spin_lock(y); } void lock_b() { - spin_lock(y); + spin_lock(x); } void unlock_a() } - spin_unlock(x); + spin_unlock(y); } void unlock_b() { - spin_unlock(y); + spin_unlock(x); } void foo() { lock_a(); lock_b(); [...] unlock_b(); unlock_a(); } --- So once we can extend the patched set as needed (manually, after review), we can handle all the cases that LEAVE_KERNEL offers, making LEAVE_KERNEL unnecessary. It'd be nice if we wouldn't require actually patching those functions, only include them in the set we have to leave to proceed. The remaining question is the performance impact of using a large set with refcounting. LEAVE_KERNEL's impact as implemented in kGraft is beyond being measurable, it's about 16 added instructions for each thread that get executed. I do see some appeal for the choose-your-own-consistency-model idea. But it wouldn't be practical for cumulative patch upgrades, which I think we both agreed at LPC seems to be safer than incremental patching. If you only ever have one patch module loaded at any given time, you only get one consistency model anyway. In order for multiple consistency models to be practical, I think we'd need to figure out how to make incremental patching safe. I believe we have to get incremental patching working anyway as it is a valid usecase for many users, just not for major distributions. And we may want to take a look at how to mark parts of a cumulative patch with different consistency models, when we combine eg. the recent futex CVE patch (not working with SWITCH_KERNEL) and anything requiring SWITCH kernel. For data layout an semantic changes, there are two approaches: 1) TRANSFORM_WORLD I'm kind of surprised to hear that Ksplice does this. I had considered this approach, but it sounds really tricky, if not impossible in many cases. By the way, the ability to do this is basically the only advantage of actually stopping the kernel. Ahem, this would be an opportune time for a Ksplice person to chime in with their experiences... Indeed. :) 2) TRANSFORM_ON_ACCESS This is a variation on what we've been doing with kpatch, using shadow data fields to add data and/or version metadata
Re: [PATCH 0/2] Kernel Live Patching
t null model) and while it needs two steps (patch, then enable conversion), it doesn't require two rounds of patching. Also, you don't have to consider oldfunc/newdata as that will never happen. oldfunc/olddata obviously works, so you only have to look at newfunc/olddata and newfunc/newdata as the transformation goes on. I don't see either of these as really that much simpler. But I do see value in offering both. > On the other hand, SWITCH_KERNEL doesn't have those problems. It does > have the problem you mentioned, roughly 2% of the time, where it can't > patch functions which are always in use. But in that case we can skip > the backtrace check ~90% of the time. An interesting bit is that when you skip the backtrace check you're actually reverting to LEAVE_FUNCION SWITCH_FUNCTION, forfeiting all consistency and not LEAVE_FUNCTION SWITCH_KERNEL as one would expect. Hence for those 2% of cases (going with your number, because it's a guess anyway) LEAVE_PATCHED_SET SWITCH_THREAD would in fact be a safer option. > So it's really maybe something > like 0.2% of patches which can't be patched with SWITCH_KERNEL. But > even then I think we could overcome that by getting creative, e.g. using > the multiple patch approach. > > So my perspective is that SWITCH_THREAD causes big headaches 10% of the > time, whereas SWITCH_KERNEL causes small headaches 1.8% of the time, and > big headaches 0.2% of the time :-) My preferred way would be to go with SWITCH_THREAD for the simpler stuff and do a SWITCH_KERNEL for the 10% of complex patches. This because (LEAVE_PATCHED_SET) SWITCH_THREAD finishes much quicker. But I'm biased there. ;) It seems more and more to me that we will actually want the more powerful engine coping with the various options. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 07:11:53AM -0600, Josh Poimboeuf wrote: > 2. Add consistency model(s) (e.g. kpatch stop_machine, kGraft per-task >consistency, Masami's per task ref counting) I have given some thought to the consistency models and how they differ and how they potentially could be unified. I have to thank Masami, because his rewrite of the kpatch model based on refcounting is what brought it closer to the kGraft model and thus allowed me to find the parallels. Let me start by defining the properties of the patching consistency model. First, what entity the execution must be outside of to be able to make the switch, ordered from weakest to strongest: LEAVE_FUNCTION - execution has to leave a patched function to switch to the new implementation LEAVE_PATCHED_SET - execution has to leave the set of patched functions to switch to the new implementation LEAVE_KERNEL - execution has to leave the entire kernel to switch to the new implementation Then, what entity the switch happens for. Again, from weakest to strongest: SWITCH_FUNCTION - the switch to the new implementation happens on a per-function basis SWITCH_THREAD - the switch to the new implementation is per-thread. SWITCH_KERNEL - the switch to the new implementation happens at once for the whole kernel Now with those definitions: livepatch (null model), as is, is LEAVE_FUNCTION and SWITCH_FUNCTION kpatch, masami-refcounting and Ksplice are LEAVE_PATCHED_SET and SWITCH_KERNEL kGraft is LEAVE_KERNEL and SWITCH_THREAD CRIU/kexec is LEAVE_KERNEL and SWITCH_KERNEL By blending kGraft and masami-refcounting, we could create a consistency engine capable of almost any combination of these properties and thus all the consistency models. However, I'm currently thinking that the most interesting model is LEAVE_PATCHED_SET and SWITCH_THREAD, as it is reliable, fast converging, doesn't require annotating kernel threads nor fails with frequent sleepers like futexes. It provides the least consistency that is required to be able to change the calling convention of functions and still allows for semantic dependencies. What do you think? PS.: Livepatch's null model isn't in fact the weakest possible, as it still guarantees executing complete intact functions, this thanks to ftrace. That is much more than what would direct overwriting of the function in memory achieve. This is also the reason why Ksplice is locked to a very specific consistency model. Ksplice can patch only when the kernel is stopped and the model is built from that. masami-refcounting, kpatch, kGraft, livepatch have a lot more freedom, thanks to ftrace, into what the consistency model should look like. PPS.: I haven't included any handling of changed data structures in this, that's another set of properties. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 06:31:54AM -0600, Josh Poimboeuf wrote: > On Thu, Nov 06, 2014 at 09:24:23PM +0100, Vojtech Pavlik wrote: > > On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote: > > > > > On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote: > > > > I don't think this specific example was generated. > > > > > > > > I also don't think including the whole kpatch automation into the kernel > > > > tree is a viable development model for it. (Same would apply for kGraft > > > > automation.) > > > > > > Why? We (IMHO incorrectly) used the argument of tight coupling to put > > > perf into the kernel tree. Generating kernel live patches is way more > > > integrated that it absolutely has to go into the tree to be able to do > > > proper development on it in an integrated fashion. > > > > One reason is that there are currently at least two generators using > > very different methods of generation (in addition to the option of doing > > the patch module by hand), and neither of them are currently in a state > > where they would be ready for inclusion into the kernel (although the > > kpatch one is clearly closer to that). > > What generator does kGraft have? Is that the one that generates the > source patch, or is there one that generates a binary patch module? The generator for kGraft: * extracts a list of changed functions from a patch (rather naïvely so far) * uses DWARF debuginfo of the old kernel to handle things like inlining and create a complete list of functions that need to be replaced * compiles the kernel with -fdata-sections -ffunction-sections * uses a modified objcopy to extract functions from the kernel into a single .o file * creates a stub .c file that references those functions * compiles the .c and links with the .o to build a .ko The main difference is in that the kGraft generator doesn't try to compare the old and new binary objects, but rather works with function lists and the DWARF info of the old code and extracts new functions from the new binary. However, as I said before, we have found enough trouble around eg. IPA-SRA and other optimizations that make any automated approach fragile and in our view more effort than benefit. Hence, we're intend to use the manual way of creating live patches until proven that we were wrong in this assessment. :) -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 06:31:54AM -0600, Josh Poimboeuf wrote: On Thu, Nov 06, 2014 at 09:24:23PM +0100, Vojtech Pavlik wrote: On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote: On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote: I don't think this specific example was generated. I also don't think including the whole kpatch automation into the kernel tree is a viable development model for it. (Same would apply for kGraft automation.) Why? We (IMHO incorrectly) used the argument of tight coupling to put perf into the kernel tree. Generating kernel live patches is way more integrated that it absolutely has to go into the tree to be able to do proper development on it in an integrated fashion. One reason is that there are currently at least two generators using very different methods of generation (in addition to the option of doing the patch module by hand), and neither of them are currently in a state where they would be ready for inclusion into the kernel (although the kpatch one is clearly closer to that). What generator does kGraft have? Is that the one that generates the source patch, or is there one that generates a binary patch module? The generator for kGraft: * extracts a list of changed functions from a patch (rather naïvely so far) * uses DWARF debuginfo of the old kernel to handle things like inlining and create a complete list of functions that need to be replaced * compiles the kernel with -fdata-sections -ffunction-sections * uses a modified objcopy to extract functions from the kernel into a single .o file * creates a stub .c file that references those functions * compiles the .c and links with the .o to build a .ko The main difference is in that the kGraft generator doesn't try to compare the old and new binary objects, but rather works with function lists and the DWARF info of the old code and extracts new functions from the new binary. However, as I said before, we have found enough trouble around eg. IPA-SRA and other optimizations that make any automated approach fragile and in our view more effort than benefit. Hence, we're intend to use the manual way of creating live patches until proven that we were wrong in this assessment. :) -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Fri, Nov 07, 2014 at 07:11:53AM -0600, Josh Poimboeuf wrote: 2. Add consistency model(s) (e.g. kpatch stop_machine, kGraft per-task consistency, Masami's per task ref counting) I have given some thought to the consistency models and how they differ and how they potentially could be unified. I have to thank Masami, because his rewrite of the kpatch model based on refcounting is what brought it closer to the kGraft model and thus allowed me to find the parallels. Let me start by defining the properties of the patching consistency model. First, what entity the execution must be outside of to be able to make the switch, ordered from weakest to strongest: LEAVE_FUNCTION - execution has to leave a patched function to switch to the new implementation LEAVE_PATCHED_SET - execution has to leave the set of patched functions to switch to the new implementation LEAVE_KERNEL - execution has to leave the entire kernel to switch to the new implementation Then, what entity the switch happens for. Again, from weakest to strongest: SWITCH_FUNCTION - the switch to the new implementation happens on a per-function basis SWITCH_THREAD - the switch to the new implementation is per-thread. SWITCH_KERNEL - the switch to the new implementation happens at once for the whole kernel Now with those definitions: livepatch (null model), as is, is LEAVE_FUNCTION and SWITCH_FUNCTION kpatch, masami-refcounting and Ksplice are LEAVE_PATCHED_SET and SWITCH_KERNEL kGraft is LEAVE_KERNEL and SWITCH_THREAD CRIU/kexec is LEAVE_KERNEL and SWITCH_KERNEL By blending kGraft and masami-refcounting, we could create a consistency engine capable of almost any combination of these properties and thus all the consistency models. However, I'm currently thinking that the most interesting model is LEAVE_PATCHED_SET and SWITCH_THREAD, as it is reliable, fast converging, doesn't require annotating kernel threads nor fails with frequent sleepers like futexes. It provides the least consistency that is required to be able to change the calling convention of functions and still allows for semantic dependencies. What do you think? PS.: Livepatch's null model isn't in fact the weakest possible, as it still guarantees executing complete intact functions, this thanks to ftrace. That is much more than what would direct overwriting of the function in memory achieve. This is also the reason why Ksplice is locked to a very specific consistency model. Ksplice can patch only when the kernel is stopped and the model is built from that. masami-refcounting, kpatch, kGraft, livepatch have a lot more freedom, thanks to ftrace, into what the consistency model should look like. PPS.: I haven't included any handling of changed data structures in this, that's another set of properties. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
you only have to look at newfunc/olddata and newfunc/newdata as the transformation goes on. I don't see either of these as really that much simpler. But I do see value in offering both. On the other hand, SWITCH_KERNEL doesn't have those problems. It does have the problem you mentioned, roughly 2% of the time, where it can't patch functions which are always in use. But in that case we can skip the backtrace check ~90% of the time. An interesting bit is that when you skip the backtrace check you're actually reverting to LEAVE_FUNCION SWITCH_FUNCTION, forfeiting all consistency and not LEAVE_FUNCTION SWITCH_KERNEL as one would expect. Hence for those 2% of cases (going with your number, because it's a guess anyway) LEAVE_PATCHED_SET SWITCH_THREAD would in fact be a safer option. So it's really maybe something like 0.2% of patches which can't be patched with SWITCH_KERNEL. But even then I think we could overcome that by getting creative, e.g. using the multiple patch approach. So my perspective is that SWITCH_THREAD causes big headaches 10% of the time, whereas SWITCH_KERNEL causes small headaches 1.8% of the time, and big headaches 0.2% of the time :-) My preferred way would be to go with SWITCH_THREAD for the simpler stuff and do a SWITCH_KERNEL for the 10% of complex patches. This because (LEAVE_PATCHED_SET) SWITCH_THREAD finishes much quicker. But I'm biased there. ;) It seems more and more to me that we will actually want the more powerful engine coping with the various options. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote: > On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote: > > I don't think this specific example was generated. > > > > I also don't think including the whole kpatch automation into the kernel > > tree is a viable development model for it. (Same would apply for kGraft > > automation.) > > Why? We (IMHO incorrectly) used the argument of tight coupling to put > perf into the kernel tree. Generating kernel live patches is way more > integrated that it absolutely has to go into the tree to be able to do > proper development on it in an integrated fashion. One reason is that there are currently at least two generators using very different methods of generation (in addition to the option of doing the patch module by hand), and neither of them are currently in a state where they would be ready for inclusion into the kernel (although the kpatch one is clearly closer to that). A generator is not required for using the infrastructure and is merely a means of preparing the live patch with less effort. I'm not opposed at all to adding the generator(s) eventually. However, given that their use is optional, I would prefer not to have to wait on finishing and cleaning up the generator(s) to include of the in-kernel live patching infrastructure. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Thu, Nov 06, 2014 at 10:44:46AM -0800, Christoph Hellwig wrote: > On Thu, Nov 06, 2014 at 08:39:06AM -0600, Seth Jennings wrote: > > An example patch module can be found here: > > https://github.com/spartacus06/livepatch/blob/master/patch/patch.c > > Please include the generator for this patch in the kernel tree. > Providing interfaces for out of tree modules (or generators) is not > how the kernel internal APIs work. I don't think this specific example was generated. I also don't think including the whole kpatch automation into the kernel tree is a viable development model for it. (Same would apply for kGraft automation.) So I believe including one or more human produced examples makes most sense. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] kernel: add support for live patching
On Thu, Nov 06, 2014 at 10:20:49AM -0600, Seth Jennings wrote: > Yes, I should explain it. > > This is something that is currently only used in the kpatch approach. > It allows the patching core to do dynamic relocations on the new > function code, similar to what the kernel module linker does, but this > works for non-exported symbols as well. > > This is so the patch module doesn't have to do a kallsyms lookup on > every non-exported symbol that the new functions use. > > The fields of the dynrela structure are those of a normal ELF rela > entry, except for the "external" field, which conveys information about > where the core module should go looking for the symbol referenced in the > dynrela entry. > > Josh was under the impression that Vojtech was ok with putting the > dynrela stuff in the core. Is that not correct (misunderstanding)? Yes, that is correct, as obviously the kpatch way of generating patches by extracting code from a compiled kernel would not be viable without it. For our own kGraft usage we're choosing to compile patches from C source, and there we can simply replace the function calls by calls via pointer looked up through kallsyms. However, kGraft also has tools to create patches in an automated way, where the individual functions are extracted from the compiled patched kernel using a modified objopy and this is hitting exactly the same issue of having to do relocation of unexported symbols if any are referenced. So no objection to the idea. We'll have to look more into the code to comment on the implementation of the dynrela stuff. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] kernel: add support for live patching
On Thu, Nov 06, 2014 at 10:20:49AM -0600, Seth Jennings wrote: Yes, I should explain it. This is something that is currently only used in the kpatch approach. It allows the patching core to do dynamic relocations on the new function code, similar to what the kernel module linker does, but this works for non-exported symbols as well. This is so the patch module doesn't have to do a kallsyms lookup on every non-exported symbol that the new functions use. The fields of the dynrela structure are those of a normal ELF rela entry, except for the external field, which conveys information about where the core module should go looking for the symbol referenced in the dynrela entry. Josh was under the impression that Vojtech was ok with putting the dynrela stuff in the core. Is that not correct (misunderstanding)? Yes, that is correct, as obviously the kpatch way of generating patches by extracting code from a compiled kernel would not be viable without it. For our own kGraft usage we're choosing to compile patches from C source, and there we can simply replace the function calls by calls via pointer looked up through kallsyms. However, kGraft also has tools to create patches in an automated way, where the individual functions are extracted from the compiled patched kernel using a modified objopy and this is hitting exactly the same issue of having to do relocation of unexported symbols if any are referenced. So no objection to the idea. We'll have to look more into the code to comment on the implementation of the dynrela stuff. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Thu, Nov 06, 2014 at 10:44:46AM -0800, Christoph Hellwig wrote: On Thu, Nov 06, 2014 at 08:39:06AM -0600, Seth Jennings wrote: An example patch module can be found here: https://github.com/spartacus06/livepatch/blob/master/patch/patch.c Please include the generator for this patch in the kernel tree. Providing interfaces for out of tree modules (or generators) is not how the kernel internal APIs work. I don't think this specific example was generated. I also don't think including the whole kpatch automation into the kernel tree is a viable development model for it. (Same would apply for kGraft automation.) So I believe including one or more human produced examples makes most sense. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Kernel Live Patching
On Thu, Nov 06, 2014 at 10:58:57AM -0800, Christoph Hellwig wrote: On Thu, Nov 06, 2014 at 07:51:57PM +0100, Vojtech Pavlik wrote: I don't think this specific example was generated. I also don't think including the whole kpatch automation into the kernel tree is a viable development model for it. (Same would apply for kGraft automation.) Why? We (IMHO incorrectly) used the argument of tight coupling to put perf into the kernel tree. Generating kernel live patches is way more integrated that it absolutely has to go into the tree to be able to do proper development on it in an integrated fashion. One reason is that there are currently at least two generators using very different methods of generation (in addition to the option of doing the patch module by hand), and neither of them are currently in a state where they would be ready for inclusion into the kernel (although the kpatch one is clearly closer to that). A generator is not required for using the infrastructure and is merely a means of preparing the live patch with less effort. I'm not opposed at all to adding the generator(s) eventually. However, given that their use is optional, I would prefer not to have to wait on finishing and cleaning up the generator(s) to include of the in-kernel live patching infrastructure. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace
On Wed, Oct 22, 2014 at 10:26:25AM +0200, Heiko Carstens wrote: > > I can confirm that kGraft works well on top of current mainline with > > this patch added. > > > > Another reason for a performance impact when kGraft is enabled is that > > kGraft still adds two instructions to the syscall path on s390x, as > > there is no space left for a kgraft TIF in the first eight bits of > > thread info flags. Renumbering the thread info flags such that _TIF_WORK > > occupies the first eight bits and TIF_TRACE the next eight would fix > > that problem: Do you believe it is feasible? > > Hi Vojtech, > > I think you're talking about the SLES12 kernel? I'm working with upstream git right now, to make sure kGraft keeps up with it. > There you can simply move the TIF_SYSCALL bit to the same byte where > your TIF_KGR bit resides. On 3.12 (SLE12) this would allow me to clear the TIF_KGR in a single instruction when exiting a syscall simply by extending the mask. I need to clear it before entering a syscall, too, and TIF_SYSCALL is set there, not cleared. What could work is moving TIF_SYSCALL and TIF_KGR to the same byte as TIF_TRACE. Then I could use the TIF_TRACE check to also clear the TIF_KGR bit in sysc_tracesys without adding an instruction to the hot path. > Upstream is a bit different since the TIF_SYSCALL bit is already gone (got > replaced with an s390 specific "PIF" bit). However the free TIF bit got > already eaten up by uprobes.. Yes, and all the 8 bits are eaten now in upstream. That's what got me thinking about separating TIF_WORK and TIF_TRACE to separate bytes, with no other flags in them. Then the syscall code would then just check the whole byte for zero instead of doing test-under-mask. > However we can think of a better solution for upstream if the combined > solution of kGraft/kpatch is ready to be merged. We may, indeed, end up doing things very differently there. At least initially the plan is to go entirely without enforcing any kind of consistency when patching, so TIF_KGR and the lazy migration will not exist, and neither will we be stopping the kernel or checking stacks. This will make applying patches bigger than a single functions tricky, but it's a good initial goal. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace
On Wed, Oct 22, 2014 at 10:26:25AM +0200, Heiko Carstens wrote: I can confirm that kGraft works well on top of current mainline with this patch added. Another reason for a performance impact when kGraft is enabled is that kGraft still adds two instructions to the syscall path on s390x, as there is no space left for a kgraft TIF in the first eight bits of thread info flags. Renumbering the thread info flags such that _TIF_WORK occupies the first eight bits and TIF_TRACE the next eight would fix that problem: Do you believe it is feasible? Hi Vojtech, I think you're talking about the SLES12 kernel? I'm working with upstream git right now, to make sure kGraft keeps up with it. There you can simply move the TIF_SYSCALL bit to the same byte where your TIF_KGR bit resides. On 3.12 (SLE12) this would allow me to clear the TIF_KGR in a single instruction when exiting a syscall simply by extending the mask. I need to clear it before entering a syscall, too, and TIF_SYSCALL is set there, not cleared. What could work is moving TIF_SYSCALL and TIF_KGR to the same byte as TIF_TRACE. Then I could use the TIF_TRACE check to also clear the TIF_KGR bit in sysc_tracesys without adding an instruction to the hot path. Upstream is a bit different since the TIF_SYSCALL bit is already gone (got replaced with an s390 specific PIF bit). However the free TIF bit got already eaten up by uprobes.. Yes, and all the 8 bits are eaten now in upstream. That's what got me thinking about separating TIF_WORK and TIF_TRACE to separate bytes, with no other flags in them. Then the syscall code would then just check the whole byte for zero instead of doing test-under-mask. However we can think of a better solution for upstream if the combined solution of kGraft/kpatch is ready to be merged. We may, indeed, end up doing things very differently there. At least initially the plan is to go entirely without enforcing any kind of consistency when patching, so TIF_KGR and the lazy migration will not exist, and neither will we be stopping the kernel or checking stacks. This will make applying patches bigger than a single functions tricky, but it's a good initial goal. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace
Hello Heiko, I can confirm that kGraft works well on top of current mainline with this patch added. Another reason for a performance impact when kGraft is enabled is that kGraft still adds two instructions to the syscall path on s390x, as there is no space left for a kgraft TIF in the first eight bits of thread info flags. Renumbering the thread info flags such that _TIF_WORK occupies the first eight bits and TIF_TRACE the next eight would fix that problem: Do you believe it is feasible? Vojtech On Tue, Oct 21, 2014 at 10:30:27AM +0200, Heiko Carstens wrote: > v3: > Changed patch 1/2 to incorporate feedback from Steven Rostedt and > Masami Hiramatsu: rename helper function check_ftrace_location() > to arch_check_ftrace_location() and convert it to a weak function, > so architectures can override it without the need for new config > option. > > v2: > Changed patch 1/2 to incorporate feedback from Masami Hiramatsu, and > introduce a new helper function check_ftrace_location(). > The requested ftracetest has been sent as an own patch set, since it > has no dependency to these patches. > > v1: > We would like to implement an architecture specific variant of "kprobes > on ftrace" without using the current HAVE_KPROBES_ON_FTRACE infrastructure > which is currently only used by x86. > > The rationale for these two patches is: > - we want to patch the first instruction of the mcount code block to > reduce the overhead of the function tracer > - we'd like to keep the ftrace_caller function as simple as possible and > not require it to generate a 100% valid pt_regs structure as required > by the combination of DYNAMIC_FTRACE_WITH_REGS and HAVE_KPROBES_ON_FTRACE. > This allows us to not generate the psw mask field in the pt_regs > structure on each function tracer enabled function, which otherwise would > be very expensive. Besides that program check generated pt_regs contents > are "more" accurate than program generated ones and don't require any > maintenance. > And also we can keep the ftrace and kprobes backends quite separated. > > In order to make this work a small common code change is necessary which > removes a check if kprobe is being placed on an ftrace location (see > first patch). > > If possible, I'd like to have an ACK from at least one of the kprobes > maintainers for the first patch and bring it upstream via the s390 tree. > > Thanks, > Heiko > > Heiko Carstens (2): > kprobes: introduce weak arch_check_ftrace_location() helper function > s390/ftrace,kprobes: allow to patch first instruction > > arch/s390/include/asm/ftrace.h | 52 ++-- > arch/s390/include/asm/kprobes.h | 1 + > arch/s390/include/asm/lowcore.h | 4 +- > arch/s390/include/asm/pgtable.h | 12 > arch/s390/kernel/asm-offsets.c | 1 - > arch/s390/kernel/early.c| 4 -- > arch/s390/kernel/ftrace.c | 132 > +--- > arch/s390/kernel/kprobes.c | 92 > arch/s390/kernel/mcount.S | 1 + > arch/s390/kernel/setup.c| 2 - > arch/s390/kernel/smp.c | 1 - > include/linux/kprobes.h | 1 + > kernel/kprobes.c| 18 +++--- > scripts/recordmcount.c | 2 +- > scripts/recordmcount.pl | 2 +- > 15 files changed, 226 insertions(+), 99 deletions(-) > > -- > 1.8.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] s390 vs. kprobes on ftrace
Hello Heiko, I can confirm that kGraft works well on top of current mainline with this patch added. Another reason for a performance impact when kGraft is enabled is that kGraft still adds two instructions to the syscall path on s390x, as there is no space left for a kgraft TIF in the first eight bits of thread info flags. Renumbering the thread info flags such that _TIF_WORK occupies the first eight bits and TIF_TRACE the next eight would fix that problem: Do you believe it is feasible? Vojtech On Tue, Oct 21, 2014 at 10:30:27AM +0200, Heiko Carstens wrote: v3: Changed patch 1/2 to incorporate feedback from Steven Rostedt and Masami Hiramatsu: rename helper function check_ftrace_location() to arch_check_ftrace_location() and convert it to a weak function, so architectures can override it without the need for new config option. v2: Changed patch 1/2 to incorporate feedback from Masami Hiramatsu, and introduce a new helper function check_ftrace_location(). The requested ftracetest has been sent as an own patch set, since it has no dependency to these patches. v1: We would like to implement an architecture specific variant of kprobes on ftrace without using the current HAVE_KPROBES_ON_FTRACE infrastructure which is currently only used by x86. The rationale for these two patches is: - we want to patch the first instruction of the mcount code block to reduce the overhead of the function tracer - we'd like to keep the ftrace_caller function as simple as possible and not require it to generate a 100% valid pt_regs structure as required by the combination of DYNAMIC_FTRACE_WITH_REGS and HAVE_KPROBES_ON_FTRACE. This allows us to not generate the psw mask field in the pt_regs structure on each function tracer enabled function, which otherwise would be very expensive. Besides that program check generated pt_regs contents are more accurate than program generated ones and don't require any maintenance. And also we can keep the ftrace and kprobes backends quite separated. In order to make this work a small common code change is necessary which removes a check if kprobe is being placed on an ftrace location (see first patch). If possible, I'd like to have an ACK from at least one of the kprobes maintainers for the first patch and bring it upstream via the s390 tree. Thanks, Heiko Heiko Carstens (2): kprobes: introduce weak arch_check_ftrace_location() helper function s390/ftrace,kprobes: allow to patch first instruction arch/s390/include/asm/ftrace.h | 52 ++-- arch/s390/include/asm/kprobes.h | 1 + arch/s390/include/asm/lowcore.h | 4 +- arch/s390/include/asm/pgtable.h | 12 arch/s390/kernel/asm-offsets.c | 1 - arch/s390/kernel/early.c| 4 -- arch/s390/kernel/ftrace.c | 132 +--- arch/s390/kernel/kprobes.c | 92 arch/s390/kernel/mcount.S | 1 + arch/s390/kernel/setup.c| 2 - arch/s390/kernel/smp.c | 1 - include/linux/kprobes.h | 1 + kernel/kprobes.c| 18 +++--- scripts/recordmcount.c | 2 +- scripts/recordmcount.pl | 2 +- 15 files changed, 226 insertions(+), 99 deletions(-) -- 1.8.5.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] SOUND: kill gameport bits
On Thu, Aug 28, 2014 at 10:03:55PM +0200, Clemens Ladisch wrote: > Takashi Iwai wrote: > > did anyone test the patch at all...? > > Appears to work. The ymfpci gameport seems to be somewhat unreliable: > > analog.c: 100 out of 17347 reads (0%) on pci:06:06.1/gameport0 failed > analog.c: 122 out of reads (10%) on pci:06:07.0/gameport0 failed The analog.c gameport read routine is unreliable by design. The 558 chip is not an ADC, it's an one-shot timer from 1971. The analog position of the joystick is measured by timing bit changes on the gameport. analog.c does that without disabling interrupts, as the read can take several milliseconds. analog.c instead detects when an interrupt influenced the measurement too much and retries. The retries are counted and reported. 10% is a largeish number, but still something the analog.c driver can cope with and give reliable results. > There is still some dependence of the CPU speed on the loop counts > calculated by gameport_time(), but it's not very high; the following are > for 3200 and 800 MHz, 5 times each: > > gameport gameport7: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 651kHz > gameport gameport8: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 651kHz > gameport gameport9: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 651kHz > gameport gameport10: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 651kHz > gameport gameport11: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 651kHz > gameport gameport12: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 612kHz > gameport gameport13: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 612kHz > gameport gameport14: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 612kHz > gameport gameport15: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 611kHz > gameport gameport16: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed > 612kHz The gameport speed is speed of the i/o to the port. It may change as frequencies in the system change. It's used for timeouts on digital gameport protocols only and thus a variation of less than 20% shouldn't cause trouble. The analog.c driver uses its own timing calibration to make the analog_cooked_read() reliable even when the speed of the i/o operations changes. What is important is that the GET_TIME() macro is fast (0.1 usec or less), precise (sub-microsecond resolution) and reliable. Digital protocols also rely on udelay() and mdelay() waiting precisely for the amount of time specified. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] SOUND: kill gameport bits
On Thu, Aug 28, 2014 at 10:03:55PM +0200, Clemens Ladisch wrote: Takashi Iwai wrote: did anyone test the patch at all...? Appears to work. The ymfpci gameport seems to be somewhat unreliable: analog.c: 100 out of 17347 reads (0%) on pci:06:06.1/gameport0 failed analog.c: 122 out of reads (10%) on pci:06:07.0/gameport0 failed The analog.c gameport read routine is unreliable by design. The 558 chip is not an ADC, it's an one-shot timer from 1971. The analog position of the joystick is measured by timing bit changes on the gameport. analog.c does that without disabling interrupts, as the read can take several milliseconds. analog.c instead detects when an interrupt influenced the measurement too much and retries. The retries are counted and reported. 10% is a largeish number, but still something the analog.c driver can cope with and give reliable results. There is still some dependence of the CPU speed on the loop counts calculated by gameport_time(), but it's not very high; the following are for 3200 and 800 MHz, 5 times each: gameport gameport7: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 651kHz gameport gameport8: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 651kHz gameport gameport9: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 651kHz gameport gameport10: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 651kHz gameport gameport11: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 651kHz gameport gameport12: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 612kHz gameport gameport13: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 612kHz gameport gameport14: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 612kHz gameport gameport15: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 611kHz gameport gameport16: EMU10K1 is pci:06:06.1/gameport0, io 0xe480, speed 612kHz The gameport speed is speed of the i/o to the port. It may change as frequencies in the system change. It's used for timeouts on digital gameport protocols only and thus a variation of less than 20% shouldn't cause trouble. The analog.c driver uses its own timing calibration to make the analog_cooked_read() reliable even when the speed of the i/o operations changes. What is important is that the GET_TIME() macro is fast (0.1 usec or less), precise (sub-microsecond resolution) and reliable. Digital protocols also rely on udelay() and mdelay() waiting precisely for the amount of time specified. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] SOUND: kill gameport bits
On Tue, Aug 19, 2014 at 10:18:15PM -0700, Dmitry Torokhov wrote: > Are you actively testing gameport interfaces with real joysticks/gamepads on > these cards? And what software is still in use that runs on these old boxes > (with mainline kernel)? I still do have a huge box of gameport hardware in my office, it's just that I haven't opened it for a number of years. However, if this thread spurred enough interest in gameport devices, I would be willing to open it and do the needed fixes. If not, I think dropping makes sense. I still would shed a tear for all those weird devices in the box, and possibly design an ATMega-based USB<->Gameport adapter that actually works and supports even the digital joysticks. > > Also, I'm left wondering why e.g. my Athlon XP system (a very popular > > choice for longer times) would be affected by Cpufreq... > > And there are no details on how exactly cpufreq is a problem or how this > > timing issue could be fixed... > > If you take a look at gameport_measure_speed() in gameport.c you will see that > it counts cycles for timing, which obviously does not work that well when CPU > frequency changes. > > The bugs have been opened in bugzilla/reported on lists ages ago but nobody > stepped up to fix that. It wouldn't be hard to fix: That code was developed when the timing infrastructure in the kernel was non-existent, making use of it today would make things a lot easier. > > The obvious workaround for such an ensuing dearth of hardware support > > could be USB 15-pin gameport adapters - but are they even supported on > > Linux? Haven't seen info on this... > > And even if supported, these adapters (at least the non-perfect ones, as > > can be seen from reviews on a well-known online shop site) are said to > > be hit-or-miss. > > > > http://www.flightsim.com/vbfs/showthread.php?238938-joystick-GamePort-to-USB-adapter-question > > http://reviews.thesource.ca/9026/2600164/nexxtech-usb-gameport-adapter-reviews/reviews.htm > > > > They have better chance of being supported ;) I had a couple a few years back > and they did work for me. They do work for analog joysticks if you don't want any extended functionality. I have a couple in said box. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] SOUND: kill gameport bits
On Tue, Aug 19, 2014 at 10:18:15PM -0700, Dmitry Torokhov wrote: Are you actively testing gameport interfaces with real joysticks/gamepads on these cards? And what software is still in use that runs on these old boxes (with mainline kernel)? I still do have a huge box of gameport hardware in my office, it's just that I haven't opened it for a number of years. However, if this thread spurred enough interest in gameport devices, I would be willing to open it and do the needed fixes. If not, I think dropping makes sense. I still would shed a tear for all those weird devices in the box, and possibly design an ATMega-based USB-Gameport adapter that actually works and supports even the digital joysticks. Also, I'm left wondering why e.g. my Athlon XP system (a very popular choice for longer times) would be affected by Cpufreq... And there are no details on how exactly cpufreq is a problem or how this timing issue could be fixed... If you take a look at gameport_measure_speed() in gameport.c you will see that it counts cycles for timing, which obviously does not work that well when CPU frequency changes. The bugs have been opened in bugzilla/reported on lists ages ago but nobody stepped up to fix that. It wouldn't be hard to fix: That code was developed when the timing infrastructure in the kernel was non-existent, making use of it today would make things a lot easier. The obvious workaround for such an ensuing dearth of hardware support could be USB 15-pin gameport adapters - but are they even supported on Linux? Haven't seen info on this... And even if supported, these adapters (at least the non-perfect ones, as can be seen from reviews on a well-known online shop site) are said to be hit-or-miss. http://www.flightsim.com/vbfs/showthread.php?238938-joystick-GamePort-to-USB-adapter-question http://reviews.thesource.ca/9026/2600164/nexxtech-usb-gameport-adapter-reviews/reviews.htm They have better chance of being supported ;) I had a couple a few years back and they did work for me. They do work for analog joysticks if you don't want any extended functionality. I have a couple in said box. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] s390: add support for DYNAMIC_FTRACE_WITH_REGS
Add support for DYNAMIC_FTRACE_WITH_REGS to 64-bit and 31-bit s390 architectures. This is required for kGraft and kpatch to work on s390. It's done by adding a _regs variant of ftrace_caller that preserves registers and puts them on stack in a struct pt_regs layout and allows modification of return address by changing the PSW (instruction pointer) member od struct pt_regs. Signed-off-by: Vojtech Pavlik Reviewed-by: Jiri Kosina Cc: Steven Rostedt --- arch/s390/Kconfig |1 arch/s390/include/asm/ftrace.h |4 ++ arch/s390/include/asm/lowcore.h | 12 --- arch/s390/kernel/asm-offsets.c |1 arch/s390/kernel/early.c|3 + arch/s390/kernel/ftrace.c | 68 +++- arch/s390/kernel/mcount.S | 43 + arch/s390/kernel/mcount64.S | 38 ++ arch/s390/kernel/setup.c|1 arch/s390/kernel/smp.c |1 include/linux/ftrace.h |1 11 files changed, 160 insertions(+), 13 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index bb63499..1de14d3 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -113,6 +113,7 @@ config S390 select HAVE_C_RECORDMCOUNT select HAVE_DEBUG_KMEMLEAK select HAVE_DYNAMIC_FTRACE + select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h index bf246da..df87fd2 100644 --- a/arch/s390/include/asm/ftrace.h +++ b/arch/s390/include/asm/ftrace.h @@ -17,6 +17,10 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) #endif /* __ASSEMBLY__ */ +#ifdef CONFIG_DYNAMIC_FTRACE +#define ARCH_SUPPORTS_FTRACE_OPS 1 +#endif + #ifdef CONFIG_64BIT #define MCOUNT_INSN_SIZE 12 #else diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h index 4349197..31b064a 100644 --- a/arch/s390/include/asm/lowcore.h +++ b/arch/s390/include/asm/lowcore.h @@ -142,9 +142,10 @@ struct _lowcore { __u32 percpu_offset; /* 0x02f0 */ __u32 machine_flags; /* 0x02f4 */ __u32 ftrace_func;/* 0x02f8 */ - __u32 spinlock_lockval; /* 0x02fc */ + __u32 ftrace_regs_func; /* 0x02fc */ + __u32 spinlock_lockval; /* 0x0300 */ - __u8pad_0x0300[0x0e00-0x0300]; /* 0x0300 */ + __u8pad_0x0304[0x0e00-0x0304]; /* 0x0304 */ /* * 0xe00 contains the address of the IPL Parameter Information @@ -287,9 +288,10 @@ struct _lowcore { __u64 vdso_per_cpu_data; /* 0x0380 */ __u64 machine_flags; /* 0x0388 */ __u64 ftrace_func;/* 0x0390 */ - __u64 gmap; /* 0x0398 */ - __u32 spinlock_lockval; /* 0x03a0 */ - __u8pad_0x03a0[0x0400-0x03a4]; /* 0x03a4 */ + __u64 ftrace_regs_func; /* 0x0398 */ + __u64 gmap; /* 0x03a0 */ + __u32 spinlock_lockval; /* 0x03a8 */ + __u8pad_0x03ac[0x0400-0x03ac]; /* 0x03ac */ /* Per cpu primary space access list */ __u32 paste[16]; /* 0x0400 */ diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c index afe1715..12ef6ce 100644 --- a/arch/s390/kernel/asm-offsets.c +++ b/arch/s390/kernel/asm-offsets.c @@ -150,6 +150,7 @@ int main(void) DEFINE(__LC_MCCK_CLOCK, offsetof(struct _lowcore, mcck_clock)); DEFINE(__LC_MACHINE_FLAGS, offsetof(struct _lowcore, machine_flags)); DEFINE(__LC_FTRACE_FUNC, offsetof(struct _lowcore, ftrace_func)); + DEFINE(__LC_FTRACE_REGS_FUNC, offsetof(struct _lowcore, ftrace_regs_func)); DEFINE(__LC_DUMP_REIPL, offsetof(struct _lowcore, ipib)); BLANK(); DEFINE(__LC_CPU_TIMER_SAVE_AREA, offsetof(struct _lowcore, cpu_timer_save_area)); diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c index 0dff972..2ab52d2 100644 --- a/arch/s390/kernel/early.c +++ b/arch/s390/kernel/early.c @@ -493,5 +493,8 @@ void __init startup_init(void) #ifdef CONFIG_DYNAMIC_FTRACE S390_lowcore.ftrace_func = (unsigned long)ftrace_caller; #endif +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS + S390_lowcore.ftrace_regs_func = (unsigned long)ftrace_regs_caller; +#endif lockdep_on(); } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index 54d6493..7d1ec79 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -21,6 +21,7 @@ void ftrace_disable_code(void); void ftrace_enable_insn(void); +void ftrace_enable_regs_insn(void); #ifdef CONFIG_64BIT /* @@ -56,7 +57,13 @@ asm( "
[PATCH] s390: add support for DYNAMIC_FTRACE_WITH_REGS
Add support for DYNAMIC_FTRACE_WITH_REGS to 64-bit and 31-bit s390 architectures. This is required for kGraft and kpatch to work on s390. It's done by adding a _regs variant of ftrace_caller that preserves registers and puts them on stack in a struct pt_regs layout and allows modification of return address by changing the PSW (instruction pointer) member od struct pt_regs. Signed-off-by: Vojtech Pavlik vojt...@suse.cz Reviewed-by: Jiri Kosina jkos...@suse.cz Cc: Steven Rostedt rost...@goodmis.org --- arch/s390/Kconfig |1 arch/s390/include/asm/ftrace.h |4 ++ arch/s390/include/asm/lowcore.h | 12 --- arch/s390/kernel/asm-offsets.c |1 arch/s390/kernel/early.c|3 + arch/s390/kernel/ftrace.c | 68 +++- arch/s390/kernel/mcount.S | 43 + arch/s390/kernel/mcount64.S | 38 ++ arch/s390/kernel/setup.c|1 arch/s390/kernel/smp.c |1 include/linux/ftrace.h |1 11 files changed, 160 insertions(+), 13 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index bb63499..1de14d3 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -113,6 +113,7 @@ config S390 select HAVE_C_RECORDMCOUNT select HAVE_DEBUG_KMEMLEAK select HAVE_DYNAMIC_FTRACE + select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h index bf246da..df87fd2 100644 --- a/arch/s390/include/asm/ftrace.h +++ b/arch/s390/include/asm/ftrace.h @@ -17,6 +17,10 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) #endif /* __ASSEMBLY__ */ +#ifdef CONFIG_DYNAMIC_FTRACE +#define ARCH_SUPPORTS_FTRACE_OPS 1 +#endif + #ifdef CONFIG_64BIT #define MCOUNT_INSN_SIZE 12 #else diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h index 4349197..31b064a 100644 --- a/arch/s390/include/asm/lowcore.h +++ b/arch/s390/include/asm/lowcore.h @@ -142,9 +142,10 @@ struct _lowcore { __u32 percpu_offset; /* 0x02f0 */ __u32 machine_flags; /* 0x02f4 */ __u32 ftrace_func;/* 0x02f8 */ - __u32 spinlock_lockval; /* 0x02fc */ + __u32 ftrace_regs_func; /* 0x02fc */ + __u32 spinlock_lockval; /* 0x0300 */ - __u8pad_0x0300[0x0e00-0x0300]; /* 0x0300 */ + __u8pad_0x0304[0x0e00-0x0304]; /* 0x0304 */ /* * 0xe00 contains the address of the IPL Parameter Information @@ -287,9 +288,10 @@ struct _lowcore { __u64 vdso_per_cpu_data; /* 0x0380 */ __u64 machine_flags; /* 0x0388 */ __u64 ftrace_func;/* 0x0390 */ - __u64 gmap; /* 0x0398 */ - __u32 spinlock_lockval; /* 0x03a0 */ - __u8pad_0x03a0[0x0400-0x03a4]; /* 0x03a4 */ + __u64 ftrace_regs_func; /* 0x0398 */ + __u64 gmap; /* 0x03a0 */ + __u32 spinlock_lockval; /* 0x03a8 */ + __u8pad_0x03ac[0x0400-0x03ac]; /* 0x03ac */ /* Per cpu primary space access list */ __u32 paste[16]; /* 0x0400 */ diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c index afe1715..12ef6ce 100644 --- a/arch/s390/kernel/asm-offsets.c +++ b/arch/s390/kernel/asm-offsets.c @@ -150,6 +150,7 @@ int main(void) DEFINE(__LC_MCCK_CLOCK, offsetof(struct _lowcore, mcck_clock)); DEFINE(__LC_MACHINE_FLAGS, offsetof(struct _lowcore, machine_flags)); DEFINE(__LC_FTRACE_FUNC, offsetof(struct _lowcore, ftrace_func)); + DEFINE(__LC_FTRACE_REGS_FUNC, offsetof(struct _lowcore, ftrace_regs_func)); DEFINE(__LC_DUMP_REIPL, offsetof(struct _lowcore, ipib)); BLANK(); DEFINE(__LC_CPU_TIMER_SAVE_AREA, offsetof(struct _lowcore, cpu_timer_save_area)); diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c index 0dff972..2ab52d2 100644 --- a/arch/s390/kernel/early.c +++ b/arch/s390/kernel/early.c @@ -493,5 +493,8 @@ void __init startup_init(void) #ifdef CONFIG_DYNAMIC_FTRACE S390_lowcore.ftrace_func = (unsigned long)ftrace_caller; #endif +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS + S390_lowcore.ftrace_regs_func = (unsigned long)ftrace_regs_caller; +#endif lockdep_on(); } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index 54d6493..7d1ec79 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -21,6 +21,7 @@ void ftrace_disable_code(void); void ftrace_enable_insn(void); +void ftrace_enable_regs_insn(void); #ifdef CONFIG_64BIT
Re: [PATCH 00/21] kGraft
On Wed, Jun 25, 2014 at 05:54:50PM +0200, Jiri Kosina wrote: > > - wait_event_freezable(khubd_wait, > > + wait_event_freezable(khubd_wait, > > ({ kgr_task_safe(current); > > > > The changes are somewhat ugly with all the kgraft crap leaking into plces > > like jbd and freezer and usb. That says to me its not well isolated and > > there must be a better way of hiding that kgr stuff so we don't have to > > kepe putting more code into all the kthreads, and inevitably missing them > > or have people getting it wrong. > > Tejun commented on this very issue during the first RFC submission a > couple weeks ago. Part of his proposal was actually to take this as a good > pretext to review the existing kernel threads, as the idea is that a big > portion of those could easily be converted to workqueues. In the past few years there was a significant proliferation of kernel threads just for cases where something needs to be done from a process context now and then. This is underlined by the fact that on an average installation there are more kernel threads than real userspace processes. Converting these bits to workqueues would then also take care of their interaction with freezer, kthread_stop. The main loop code wouldn't have to be replicated over and over with slight variations. kgr_taks_safe would then plug into that rather elegantly. In the absence of this rework, kGraft hijacks the freezer and kthread_stop to pinpoint the 'end' of the main loop of most kernel threads and annotates the rest that doesn't handle freezing or stopping with kgr_task_safe directly. > It of course has its own implications, such as not being able to > prioritize that easily, but there is probably a lot of low-hanging fruits > where driver authors and their grandmas have been creating kthreads where > workqueue would suffice. Indeed, on the other hand, there are enough workqueues today and on realtime systems the need for prioritizing them exists already. > But it's a very long term goal, and it's probably impractical to make > kGraft dependent on it. Should the kernel thread changes turn to be a big issue, we could do the initial submission without them and depend on the kthread->workqueue rework. Once we add support for multiple patches in progress, the fact that we don't support kernel threads would not be as painful. > > You add instructions to various hotpaths (despite the CONFIG > > comment). > > Kernel entry is hidden behind _TIF_WORK_SYSCALL_ENTRY and > _TIF_ALLWORK_MASK, so it doesn't add any overhead if kGraft is not > enabled. What other hot paths do you refer to? As Jiří says, I don't see any hot path where kGraft would add instructions: Kernel exit and entry are handled outside the hot path in the _TIF_WORK_SYSCALL_ENTRY and _TIF_ALLWORK_MASK handlers. [patch 15/21] It adds instructions into the main loops of kernel threads, but those can hardly be called hot paths as they mostly sleep for long times. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/21] kGraft
On Wed, Jun 25, 2014 at 05:54:50PM +0200, Jiri Kosina wrote: - wait_event_freezable(khubd_wait, + wait_event_freezable(khubd_wait, ({ kgr_task_safe(current); The changes are somewhat ugly with all the kgraft crap leaking into plces like jbd and freezer and usb. That says to me its not well isolated and there must be a better way of hiding that kgr stuff so we don't have to kepe putting more code into all the kthreads, and inevitably missing them or have people getting it wrong. Tejun commented on this very issue during the first RFC submission a couple weeks ago. Part of his proposal was actually to take this as a good pretext to review the existing kernel threads, as the idea is that a big portion of those could easily be converted to workqueues. In the past few years there was a significant proliferation of kernel threads just for cases where something needs to be done from a process context now and then. This is underlined by the fact that on an average installation there are more kernel threads than real userspace processes. Converting these bits to workqueues would then also take care of their interaction with freezer, kthread_stop. The main loop code wouldn't have to be replicated over and over with slight variations. kgr_taks_safe would then plug into that rather elegantly. In the absence of this rework, kGraft hijacks the freezer and kthread_stop to pinpoint the 'end' of the main loop of most kernel threads and annotates the rest that doesn't handle freezing or stopping with kgr_task_safe directly. It of course has its own implications, such as not being able to prioritize that easily, but there is probably a lot of low-hanging fruits where driver authors and their grandmas have been creating kthreads where workqueue would suffice. Indeed, on the other hand, there are enough workqueues today and on realtime systems the need for prioritizing them exists already. But it's a very long term goal, and it's probably impractical to make kGraft dependent on it. Should the kernel thread changes turn to be a big issue, we could do the initial submission without them and depend on the kthread-workqueue rework. Once we add support for multiple patches in progress, the fact that we don't support kernel threads would not be as painful. You add instructions to various hotpaths (despite the CONFIG comment). Kernel entry is hidden behind _TIF_WORK_SYSCALL_ENTRY and _TIF_ALLWORK_MASK, so it doesn't add any overhead if kGraft is not enabled. What other hot paths do you refer to? As Jiří says, I don't see any hot path where kGraft would add instructions: Kernel exit and entry are handled outside the hot path in the _TIF_WORK_SYSCALL_ENTRY and _TIF_ALLWORK_MASK handlers. [patch 15/21] It adds instructions into the main loops of kernel threads, but those can hardly be called hot paths as they mostly sleep for long times. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 64bit x86: NMI nesting still buggy?
On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote: > On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote: > > Alright, Andy's iret optimization efforts do immediately bring a > > followup question -- why is this not a problem with iret-based return > > from #MC possibly interrupting NMI? > > Yeah, and frankly, I don't see this nesting fun at all protected against > a #MC interrupting it at any point actually. Because once the #MC > handler returns, it goes into paranoid_exit and that place doesn't > account for NMIs at all, AFAICS. > > Which would mean: > > * NMI goes off > * MCE happens, we switch to machine_check which is paranoidzeroentry > * #MC handler is done -> paranoid_exit -> IRET > -> boom! Or if not "boom", at least, the NMI gets forgotten. > > Am I missing something? I think to get a full BOOM you need a bit more complex process, namely: * NMI triggered * NMI handler starts * MCE happens * Second NMI triggered and queued * handler done, IRET * Second NMI handler starts and overwrites NMI return address on stack * Second NMI handler ends * First NMI handler ends and goes into an infinite IRET loop, always returning to the beginning of itself But you do have all the ingredients. And I don't see any other way out than not calling IRET for MCEs. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 64bit x86: NMI nesting still buggy?
On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote: On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote: Alright, Andy's iret optimization efforts do immediately bring a followup question -- why is this not a problem with iret-based return from #MC possibly interrupting NMI? Yeah, and frankly, I don't see this nesting fun at all protected against a #MC interrupting it at any point actually. Because once the #MC handler returns, it goes into paranoid_exit and that place doesn't account for NMIs at all, AFAICS. Which would mean: * NMI goes off * MCE happens, we switch to machine_check which is paranoidzeroentry * #MC handler is done - paranoid_exit - IRET - boom! Or if not boom, at least, the NMI gets forgotten. Am I missing something? I think to get a full BOOM you need a bit more complex process, namely: * NMI triggered * NMI handler starts * MCE happens * Second NMI triggered and queued * handler done, IRET * Second NMI handler starts and overwrites NMI return address on stack * Second NMI handler ends * First NMI handler ends and goes into an infinite IRET loop, always returning to the beginning of itself But you do have all the ingredients. And I don't see any other way out than not calling IRET for MCEs. -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching
On Sat, May 17, 2014 at 03:09:57AM +0900, Masami Hiramatsu wrote: > (2014/05/17 1:27), Jiri Kosina wrote: > > On Tue, 6 May 2014, Steven Rostedt wrote: > > > >>> However, I also think if users can accept such freezing wait-time, > >>> it means they can also accept kexec based "checkpoint-restart" patching. > >>> So, I think the final goal of the kpatch will be live patching without > >>> stopping the machine. I'm discussing the issue on github #138, but that is > >>> off-topic. :) > >> > >> I agree with Ingo too. Being conservative at first is the right > >> approach here. We should start out with a stop_machine making sure that > >> everything is sane before we continue. Sure, that's not much different > >> than a kexec, but lets take things one step at a time. > >> > >> ftrace did the stop_machine (and still does for some archs), and slowly > >> moved to a more efficient method. kpatch/kgraft should follow suit. > > > > I don't really agree here. > > > > I actually believe that "lazy" switching kgraft is doing provides a little > > bit more in the sense of consistency than stop_machine()-based aproach. > > > > Consider this scenario: > > > > void foo() > > { > > for (i=0; i<1; i++) { > > bar(i); > > something_else(i); > > } > > } > > In this case, I'd recommend you to add foo() to replacing target > as dummy. Then, kpatch can ensure foo() is actually not running. :) The problem is: Where do you stop? Adding the whole list of functions that transitively may ever call bar() can be pretty much the whole kernel. And given that you can be calling via pointer, you can't often even tell who is calling bar(). So yes, a developer could manually include foo() in the to be patched list, so that it is checked that foo() isn't being executed while patching. But that would have to be done entirely manually after thoroughly understanding the implications of the patch. > > Let's say you want to live-patch bar(). With stop_machine()-based aproach, > > you can easily end-up with old bar() and new bar() being called in two > > consecutive iterations before the loop is even exited, right? (especially > > on preemptible kernel, or if something_else() goes to sleep). > > > > With lazy-switching implemented in kgraft, this can never happen. > > And I guess similar thing may happen with kgraft. If old function and > new function share a non-auto variable and they modify it different way, > the result will be unexpected by the mutual interference. By non-auto I assume you mean globally accessible variable or data structures. Yes, with kGraft the new function has to be backward compatible with pre-patch global data layout and semantics. The parameters, return value and their semantics are free to change, though, and kGraft guarantees consistency there. -- Vojtech Pavlik SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH 0/2] kpatch: dynamic kernel patching
On Sat, May 17, 2014 at 03:09:57AM +0900, Masami Hiramatsu wrote: (2014/05/17 1:27), Jiri Kosina wrote: On Tue, 6 May 2014, Steven Rostedt wrote: However, I also think if users can accept such freezing wait-time, it means they can also accept kexec based checkpoint-restart patching. So, I think the final goal of the kpatch will be live patching without stopping the machine. I'm discussing the issue on github #138, but that is off-topic. :) I agree with Ingo too. Being conservative at first is the right approach here. We should start out with a stop_machine making sure that everything is sane before we continue. Sure, that's not much different than a kexec, but lets take things one step at a time. ftrace did the stop_machine (and still does for some archs), and slowly moved to a more efficient method. kpatch/kgraft should follow suit. I don't really agree here. I actually believe that lazy switching kgraft is doing provides a little bit more in the sense of consistency than stop_machine()-based aproach. Consider this scenario: void foo() { for (i=0; i1; i++) { bar(i); something_else(i); } } In this case, I'd recommend you to add foo() to replacing target as dummy. Then, kpatch can ensure foo() is actually not running. :) The problem is: Where do you stop? Adding the whole list of functions that transitively may ever call bar() can be pretty much the whole kernel. And given that you can be calling via pointer, you can't often even tell who is calling bar(). So yes, a developer could manually include foo() in the to be patched list, so that it is checked that foo() isn't being executed while patching. But that would have to be done entirely manually after thoroughly understanding the implications of the patch. Let's say you want to live-patch bar(). With stop_machine()-based aproach, you can easily end-up with old bar() and new bar() being called in two consecutive iterations before the loop is even exited, right? (especially on preemptible kernel, or if something_else() goes to sleep). With lazy-switching implemented in kgraft, this can never happen. And I guess similar thing may happen with kgraft. If old function and new function share a non-auto variable and they modify it different way, the result will be unexpected by the mutual interference. By non-auto I assume you mean globally accessible variable or data structures. Yes, with kGraft the new function has to be backward compatible with pre-patch global data layout and semantics. The parameters, return value and their semantics are free to change, though, and kGraft guarantees consistency there. -- Vojtech Pavlik SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 09/16] kgr: mark task_safe in some kthreads
On Wed, May 14, 2014 at 04:59:05PM +0200, Jiri Slaby wrote: > I see the worst case scenario. (For curious readers, it is for example > this kthread body: > while (1) { > some_paired_call(); /* invokes pre-patched code */ > if (kthread_should_stop()) { /* kgraft switches to the new code */ > its_paired_function(); /* invokes patched code (wrong) */ > break; > } > its_paired_function(); /* the same (wrong) */ > }) > > What to do with that now? We have come up with a couple possibilities. > Would you consider try_to_freeze() a good state-defining function? As it > is called when a kthread expects weird things can happen, it should be > safe to switch to the patched version in our opinion. > > The other possibility is to patch every kthread loop (~300) and insert > kgr_task_safe() semi-manually at some proper place. > > Or if you have any other suggestions we would appreciate that? A heretic idea would be to convert all kernel threads into functions that do not sleep and exit after a single iteration and are called from a central kthread main loop function. That would get all of kthread_should_stop() and try_to_freeze() and kgr_task_safe() nicely into one place and at the same time put enough constraint on what the thread function can do to prevent it from breaking the assumptions of each of these calls. -- Vojtech Pavlik SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 09/16] kgr: mark task_safe in some kthreads
On Wed, May 14, 2014 at 04:59:05PM +0200, Jiri Slaby wrote: I see the worst case scenario. (For curious readers, it is for example this kthread body: while (1) { some_paired_call(); /* invokes pre-patched code */ if (kthread_should_stop()) { /* kgraft switches to the new code */ its_paired_function(); /* invokes patched code (wrong) */ break; } its_paired_function(); /* the same (wrong) */ }) What to do with that now? We have come up with a couple possibilities. Would you consider try_to_freeze() a good state-defining function? As it is called when a kthread expects weird things can happen, it should be safe to switch to the patched version in our opinion. The other possibility is to patch every kthread loop (~300) and insert kgr_task_safe() semi-manually at some proper place. Or if you have any other suggestions we would appreciate that? A heretic idea would be to convert all kernel threads into functions that do not sleep and exit after a single iteration and are called from a central kthread main loop function. That would get all of kthread_should_stop() and try_to_freeze() and kgr_task_safe() nicely into one place and at the same time put enough constraint on what the thread function can do to prevent it from breaking the assumptions of each of these calls. -- Vojtech Pavlik SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 64bit x86: NMI nesting still buggy?
On Tue, Apr 29, 2014 at 05:24:32PM +0200, Jiri Kosina wrote: > On Tue, 29 Apr 2014, Steven Rostedt wrote: > > > > According to 38.4 of [1], when SMM mode is entered while the CPU is > > > handling NMI, the end result might be that upon exit from SMM, NMIs will > > > be re-enabled and latched NMI delivered as nested [2]. > > > > Note, if this were true, then the x86_64 hardware would be extremely > > buggy. That's because NMIs are not made to be nested. If SMM's come in > > during an NMI and re-enables the NMI, then *all* software would break. > > That would basically make NMIs useless. > > > > The only time I've ever witness problems (and I stress NMIs all the > > time), is when the NMI itself does a fault. Which my patch set handles > > properly. > > Yes, it indeed does. > > In the scenario I have outlined, the race window is extremely small, plus > NMIs don't happen that often, plus SMIs don't happen that often, plus > (hopefully) many BIOSes don't enable NMIs upon SMM exit. > > The problem is, that Intel documentation is clear in this respect, and > explicitly states it can happen. And we are violating that, which makes me > rather nervous -- it'd be very nice to know what is the background of 38.4 > section text in the Intel docs. If we cannot disable IST for NMI on x86_64, because it'd break SYSCALL, and thus cannot handle this situation well, then we should at least try to detect it post-factum. In the NMI handler, after ascertaining that the first NMI is executing (in_nmi not yet set) we check the return address stored on the stack. If it points anywhere inside the NMI handler (in reality only in the space between the NMI handler start and the check), a SMI-triggered nested NMI has happened. Then we should be able to at least report it before dying. If it doesn't ever happen: Great, this wasn't a real concern. If it does, we can pester BIOS vendors. -- Vojtech Pavlik Director SuSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/