Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
can we have a shorter title? On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote: > Hi Nishanth, > > > > >I am not sure if this #ifdeffery is even needed. > > > > > >Eduardo, Rui: If this is not the suggested technique, maybe you guys > >could suggest how we could handle a case where userspace might be > >hungup due to some reason and a case where a critical temperature > >event in the middle of device probe was triggered? Orderly power off is supposed to take care of this. Looking at the code, it will force a shutdown in case execution of userland command fails: static int __orderly_poweroff(bool force) { int ret; ret = run_cmd(poweroff_cmd); if (ret && force) { pr_warn("Failed to start orderly shutdown: forcing the issue\n"); /* * I guess this should try to kick off some daemon to sync and * poweroff asap. Or not even bother syncing if we're doing an * emergency shutdown? */ emergency_sync(); kernel_power_off(); } > > > >Obviously, we'd like to take into consideration userspace latencies as > >well- but that is very specific to fs being run.. not really a simple > >problem, IMHO.. > > -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
On 12/31/2015 12:20 PM, Eduardo Valentin wrote: > On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote: >> On 12/31/2015 11:29 AM, Eduardo Valentin wrote: >>> can we have a shorter title? >>> >>> >>> Orderly power off is supposed to take care of this. Looking at the code, >>> it will force a shutdown in case execution of userland command fails: >>> >>> static int __orderly_poweroff(bool force) >>> { >>> int ret; >>> >>> ret = run_cmd(poweroff_cmd); >>> >>> if (ret && force) { >>> pr_warn("Failed to start orderly shutdown: forcing the >>> issue\n"); >>> >>> /* >>> * I guess this should try to kick off some daemon to sync >>> and >>> * poweroff asap. Or not even bother syncing if we're >>> doing an >>> * emergency shutdown? >>> */ >>> emergency_sync(); >>> kernel_power_off(); >>> } >> >> Yes, it will *IF* userspace fails. the condition that I had tracked >> was before identifying the following fix[1] - Example fail is here[2] >> > > OK. But still, why other users of orderly_poweroff do not > deserve to be fixed, then? > I'd agree as well.. I guess the comment from Robin Holtanticipated something like this will eventually occur. "* I guess this should try to kick off some daemon to sync and * poweroff asap. Or not even bother syncing if we're doing an * emergency shutdown? " Keerthy - would you spin this as a generic fix? >> >> I hope this explains the problem. >> >> [1] >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224 >> [2] http://pastebin.ubuntu.com/14326688/ >> >> [3] >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738 >> >> -- >> Regards, >> Nishanth Menon -- Regards, Nishanth Menon -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
On Mon, Dec 21, 2015 at 11:16:18AM +0530, Keerthy wrote: > In few rare conditions like during boot up the orderly_poweroff > function might not be able to complete fully leaving the device > running at dangerously high temperatures. Hence adding a backup > workqueue to act after a known period of time and poweroff the device. I really wish a better description of what is going on. I am not really sure why thermal subsystem must deal with the case of a failing kernel API. If orderly power off is failing, why don't we fix it instead? What are the failing conditions? few rare conditions seams very odd. How does it fail? Why fixing it in thermal? Other users of it don't deserver the same fix? > } > } > -- > 1.9.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote: > On 12/31/2015 11:29 AM, Eduardo Valentin wrote: > > can we have a shorter title? > > > > > > Orderly power off is supposed to take care of this. Looking at the code, > > it will force a shutdown in case execution of userland command fails: > > > > static int __orderly_poweroff(bool force) > > { > > int ret; > > > > ret = run_cmd(poweroff_cmd); > > > > if (ret && force) { > > pr_warn("Failed to start orderly shutdown: forcing the > > issue\n"); > > > > /* > > * I guess this should try to kick off some daemon to sync > > and > > * poweroff asap. Or not even bother syncing if we're > > doing an > > * emergency shutdown? > > */ > > emergency_sync(); > > kernel_power_off(); > > } > > Yes, it will *IF* userspace fails. the condition that I had tracked > was before identifying the following fix[1] - Example fail is here[2] > OK. But still, why other users of orderly_poweroff do not deserve to be fixed, then? > > I hope this explains the problem. > > [1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224 > [2] http://pastebin.ubuntu.com/14326688/ > > [3] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738 > > -- > Regards, > Nishanth Menon -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
On 12/31/2015 11:29 AM, Eduardo Valentin wrote: > can we have a shorter title? > > On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote: >> Hi Nishanth, >> > > >>> >>> I am not sure if this #ifdeffery is even needed. >>> >>> >>> Eduardo, Rui: If this is not the suggested technique, maybe you guys >>> could suggest how we could handle a case where userspace might be >>> hungup due to some reason and a case where a critical temperature >>> event in the middle of device probe was triggered? > > Orderly power off is supposed to take care of this. Looking at the code, > it will force a shutdown in case execution of userland command fails: > > static int __orderly_poweroff(bool force) > { > int ret; > > ret = run_cmd(poweroff_cmd); > > if (ret && force) { > pr_warn("Failed to start orderly shutdown: forcing the > issue\n"); > > /* > * I guess this should try to kick off some daemon to sync and > * poweroff asap. Or not even bother syncing if we're doing > an > * emergency shutdown? > */ > emergency_sync(); > kernel_power_off(); > } Yes, it will *IF* userspace fails. the condition that I had tracked was before identifying the following fix[1] - Example fail is here[2] In this case, tmp102 is setup for X15 as [3] - and built as a module. as the kernel startsup filesystem and starts a modprobe of all modules via udev rules, the probe of tmp102 detects (falsely) a critical temperature condition. Shutdown attempt in the middle of driver probe is always a tricky business. As we look at the log in [2], Line 472 > thermal thermal_zone3: critical temperature reached(108 C),shutting down We have userspace trigger for shutdown taking place. Line 495: INIT: Sending processes the TERM signal userspace starts shutting down services. (but note that probe for other devices were either in progress or queued up to complete).. at line 647 - we are in a weird place -> sysrq shows that system is idled and userspace is shutdown and system is still active. In this case, we entered the case thanks to a driver bug, but if this situation was a real world temperature scenario, then we'd probably in an overtemp scenario, then device damage could take place OR something much worse. The only alternative is to run a parallel thread in case userspace fails to complete the job in some given period of time - due to what ever be the condition triggering the problem. I hope this explains the problem. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224 [2] http://pastebin.ubuntu.com/14326688/ [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738 -- Regards, Nishanth Menon -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
Hi Nishanth, On Monday 28 December 2015 11:11 PM, Nishanth Menon wrote: On 12/20/2015 11:46 PM, Keerthy wrote: +linux-pm. Thanks for looping! In few rare conditions like during boot up the orderly_poweroff function might not be able to complete fully leaving the device running at dangerously high temperatures. Hence adding a backup workqueue to act after a known period of time and poweroff the device. Suggested-by: Nishanth MenonReported-by: Nishanth Menon The specific case I hit was a critical temperature event as soon as the hwmon device was probed (the driver tmp102 was a kernel module). Signed-off-by: Keerthy --- drivers/thermal/Kconfig| 9 + drivers/thermal/thermal_core.c | 26 ++ 2 files changed, 35 insertions(+) diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 8cc4ac6..25584ee 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR endchoice +config THERMAL_BKUP_SHUTDOWN_DELAY_MS +int "Thermal shutdown backup delay in milli-seconds" +depends on THERMAL +default "5000" +---help--- +The number of milliseconds to delay after orderly_poweroff call. Probably needs a rephrase. Delay in milliseconds before the backup thermal shutdown is triggered. + +Default: 5000 (5 seconds) + config THERMAL_GOV_FAIR_SHARE bool "Fair-share thermal governor" help diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index d9e525c..b793857 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock); static struct thermal_governor *def_governor; +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS +#else +#define BKUP_SHUTDOWN_DELAY 5000 +#endif + I am not sure if this #ifdeffery is even needed. Eduardo, Rui: If this is not the suggested technique, maybe you guys could suggest how we could handle a case where userspace might be hungup due to some reason and a case where a critical temperature event in the middle of device probe was triggered? Obviously, we'd like to take into consideration userspace latencies as well- but that is very specific to fs being run.. not really a simple problem, IMHO.. static struct thermal_governor *__find_governor(const char *name) { struct thermal_governor *pos; @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz, def_governor->throttle(tz, trip); } +static void bkup_shutdown_func(struct work_struct *unused); +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func); + +static void bkup_shutdown_func(struct work_struct *work) +{ + pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n"); + kernel_power_off(); + + pr_warn("kernel_power_off has failed! Attempting emergency_restart\n"); + emergency_restart(); I think documentation is necessary that we are hoping for bootloader to be able to detect and halt as needed -> risk here obviously is an infinite reboot loop :( . Agreed. +} + static void handle_critical_trips(struct thermal_zone_device *tz, int trip, enum thermal_trip_type trip_type) { @@ -443,6 +461,14 @@ static void handle_critical_trips(struct thermal_zone_device *tz, dev_emerg(>device, "critical temperature reached(%d C),shutting down\n", tz->temperature / 1000); + /** needs to be kernel doc style? +* This is a backup option to shutdown the +* system in case orderly_poweroff +* fails Maybe adjust to 80char? Okay. +*/ + schedule_delayed_work(_shutdown_work, + msecs_to_jiffies(BKUP_SHUTDOWN_DELAY)); + orderly_poweroff(true); } } I will wait for Eduardo/Rui's inputs before posting the next version. Best Regards, Keerthy -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
On 12/20/2015 11:46 PM, Keerthy wrote: +linux-pm. > In few rare conditions like during boot up the orderly_poweroff > function might not be able to complete fully leaving the device > running at dangerously high temperatures. Hence adding a backup > workqueue to act after a known period of time and poweroff the device. > > Suggested-by: Nishanth Menon> Reported-by: Nishanth Menon The specific case I hit was a critical temperature event as soon as the hwmon device was probed (the driver tmp102 was a kernel module). > Signed-off-by: Keerthy > --- > drivers/thermal/Kconfig| 9 + > drivers/thermal/thermal_core.c | 26 ++ > 2 files changed, 35 insertions(+) > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig > index 8cc4ac6..25584ee 100644 > --- a/drivers/thermal/Kconfig > +++ b/drivers/thermal/Kconfig > @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR > > endchoice > > +config THERMAL_BKUP_SHUTDOWN_DELAY_MS > +int "Thermal shutdown backup delay in milli-seconds" > +depends on THERMAL > +default "5000" > +---help--- > +The number of milliseconds to delay after orderly_poweroff call. Probably needs a rephrase. > + > +Default: 5000 (5 seconds) > + > config THERMAL_GOV_FAIR_SHARE > bool "Fair-share thermal governor" > help > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index d9e525c..b793857 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock); > > static struct thermal_governor *def_governor; > > +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS > +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS > +#else > +#define BKUP_SHUTDOWN_DELAY 5000 > +#endif > + I am not sure if this #ifdeffery is even needed. Eduardo, Rui: If this is not the suggested technique, maybe you guys could suggest how we could handle a case where userspace might be hungup due to some reason and a case where a critical temperature event in the middle of device probe was triggered? Obviously, we'd like to take into consideration userspace latencies as well- but that is very specific to fs being run.. not really a simple problem, IMHO.. > static struct thermal_governor *__find_governor(const char *name) > { > struct thermal_governor *pos; > @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct > thermal_zone_device *tz, > def_governor->throttle(tz, trip); > } > > +static void bkup_shutdown_func(struct work_struct *unused); > +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func); > + > +static void bkup_shutdown_func(struct work_struct *work) > +{ > + pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n"); > + kernel_power_off(); > + > + pr_warn("kernel_power_off has failed! Attempting emergency_restart\n"); > + emergency_restart(); I think documentation is necessary that we are hoping for bootloader to be able to detect and halt as needed -> risk here obviously is an infinite reboot loop :( . > +} > + > static void handle_critical_trips(struct thermal_zone_device *tz, > int trip, enum thermal_trip_type trip_type) > { > @@ -443,6 +461,14 @@ static void handle_critical_trips(struct > thermal_zone_device *tz, > dev_emerg(>device, > "critical temperature reached(%d C),shutting down\n", > tz->temperature / 1000); > + /** needs to be kernel doc style? > + * This is a backup option to shutdown the > + * system in case orderly_poweroff > + * fails Maybe adjust to 80char? > + */ > + schedule_delayed_work(_shutdown_work, > + msecs_to_jiffies(BKUP_SHUTDOWN_DELAY)); > + > orderly_poweroff(true); > } > } > -- Regards, Nishanth Menon -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html