Re: [PATCH] memory cgroup: update root memory cgroup when node is onlined
At 09/14/2012 09:36 AM, Hugh Dickins Wrote: > On Thu, 13 Sep 2012, Johannes Weiner wrote: >> On Thu, Sep 13, 2012 at 03:14:28PM +0800, Wen Congyang wrote: >>> root_mem_cgroup->info.nodeinfo is initialized when the system boots. >>> But NODE_DATA(nid) is null if the node is not onlined, so >>> root_mem_cgroup->info.nodeinfo[nid]->zoneinfo[zone].lruvec.zone contains >>> an invalid pointer. If we use numactl to bind a program to the node >>> after onlining the node and its memory, it will cause the kernel >>> panicked: >> >> Is there any chance we could get rid of the zone backpointer in lruvec >> again instead? > > It could be done, but it would make me sad :( > >> Adding new nodes is a rare event and so updating every >> single memcg in the system might be just borderline crazy. > > Not horribly crazy, but rather ugly, yes. > >> But can't >> we just go back to passing the zone along with the lruvec down >> vmscan.c paths? I agree it's ugly to pass both, given their >> relationship. But I don't think the backpointer is any cleaner but in >> addition less robust. > > It's like how we use vma->mm: we could change everywhere to pass mm with > vma, but it looks cleaner and cuts down on long arglists to have mm in vma. >>From past experience, one of the things I worried about was adding extra > args to the reclaim stack. > >> >> That being said, the crashing code in particular makes me wonder: >> >> static __always_inline void add_page_to_lru_list(struct page *page, >> struct lruvec *lruvec, enum lru_list lru) >> { >> int nr_pages = hpage_nr_pages(page); >> mem_cgroup_update_lru_size(lruvec, lru, nr_pages); >> list_add(>lru, >lists[lru]); >> __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages); >> } >> >> Why did we ever pass zone in here and then felt the need to replace it >> with lruvec->zone in fa9add6 "mm/memcg: apply add/del_page to lruvec"? >> A page does not roam between zones, its zone is a static property that >> can be retrieved with page_zone(). > > Just as in vmscan.c, we have the lruvec to hand, and that's what we > mainly want to operate upon, but there is also some need for zone. > > (Both Konstantin and I were looking towards the day when we move the > lru_lock into the lruvec, removing more dependence on "zone". Pretty > much the only reason that hasn't happened yet, is that we have not found > time to make a performance case convincingly - but that's another topic.) > > Yes, page_zone(page) is a static property of the page, but it's not > necessarily cheap to evaluate: depends on how complex the memory model > and the spare page flags space, doesn't it? We both preferred to > derive zone from lruvec where convenient. > > How do you feel about this patch, and does it work for you guys? > > You'd be right if you guessed that I started out without the > mem_cgroup_zone_lruvec part of it, but oops in get_scan_count > told me that's needed too. > > Description to be filled in later: would it be needed for -stable, > or is onlining already broken in other ways that you're now fixing up? > > Reported-by: Tang Chen > Signed-off-by: Hugh Dickins Hi, all: What about the status of this patch? Thanks Wen Congyang > --- > > include/linux/mmzone.h |2 - > mm/memcontrol.c| 40 --- > mm/mmzone.c|6 - > mm/page_alloc.c|2 - > 4 files changed, 36 insertions(+), 14 deletions(-) > > --- 3.6-rc5/include/linux/mmzone.h2012-08-03 08:31:26.892842267 -0700 > +++ linux/include/linux/mmzone.h 2012-09-13 17:07:51.893772372 -0700 > @@ -744,7 +744,7 @@ extern int init_currently_empty_zone(str >unsigned long size, >enum memmap_context context); > > -extern void lruvec_init(struct lruvec *lruvec, struct zone *zone); > +extern void lruvec_init(struct lruvec *lruvec); > > static inline struct zone *lruvec_zone(struct lruvec *lruvec) > { > --- 3.6-rc5/mm/memcontrol.c 2012-08-03 08:31:27.060842270 -0700 > +++ linux/mm/memcontrol.c 2012-09-13 17:46:36.870804625 -0700 > @@ -1061,12 +1061,25 @@ struct lruvec *mem_cgroup_zone_lruvec(st > struct mem_cgroup *memcg) > { > struct mem_cgroup_per_zone *mz; > + struct lruvec *lruvec; > > - if (mem_cgroup_disabled()) > - return >lruvec; > + if (mem_cgroup_disabled()) { > + lruvec = >lruvec; > + goto out; > + } > > mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone)); > - return >lruvec; > + lruvec = >lruvec; > +out: > + /* > + * Since a node can be onlined after the mem_cgroup was created, > + * we have to be prepared to initialize lruvec->zone here. > + */ > + if (unlikely(lruvec->zone != zone)) { > + VM_BUG_ON(lruvec->zone); > + lruvec->zone = zone; > +
Re: [PATCH v2 09/13] ARM: davinci - update the dm644x soc code to use common clk drivers
Hi Murali, On 10/15/2012 9:21 PM, Karicheri, Muralidharan wrote: > --Cut > >>> Subject: Re: [PATCH v2 09/13] ARM: davinci - update the dm644x soc code to >>> use >>> common clk drivers >>> >> You have chosen to keep all clock related data in platform files >> while using the common clock framework to provide just the >> infrastructure. If you look at how mxs and spear have been migrated, >> they have >>> migrated the soc specific clock data to drivers/clk as well. >> See "drivers/clk/spear/spear3xx_clock.c" or >> "drivers/clk/mxs/clk-imx23.c I have to disagree on this one. I had investigated these code already and came up with a way that we can re-use code across all of the davinci platforms as well as other architectures that re-uses the clk hardware IPs. >>> >>> Which code you are talking about here? Even if you introduce clk-dm644x.c, >>> clk- >>> keystone.c etc in drivers/clk/davinci/ you can reuse the code you introduce >>> in patches 1- >>> 3. I cant see how that will be prevented. > > I was talking about re-use of davinci_common_clk_init in > drivers/clk/davinci/davinci-clock.c. > This is meant to be re-used across all of the DaVinci devices. > >>> spear3xx_clock.c has initialization code for each of the platforms and so is the case with imx23.c. >>> >>> By each of the platforms, you mean they all cater to a family of devices? >>> This depends on >>> how close together the family of devices are. >>> Otherwise, there would be a file per soc. DM644x also represents a family >>> for that matter. >>> By using platform_data approach, we are able to define clks for each of the SoC and >>> then use davinci_common_clk_init() to do initialize the clk drivers based >>> on platform >>> data. >>> >>> You need to define and register the clocks present on each SoC either which >>> way. I don't >>> see why just the platform_data approach allows this. >>> And looking closely, you have defined platform data, but don't actually >>> have a platform >>> device, making things more confusing. >>> > > Ok. There are multiple ways to implement this software. We had discussed this > internally and picked the platform_data approach. The clk drivers are written > not > following the platform driver model. But I don't see why we can't use > platform data > to configure this drivers. Down below, you have made two interesting points, > one is > ARM code reduction. This patch already does this by moving the API that > initializes > the clk drivers (davinci_common_clk_init()) out of ARM to > drivers/clk/davinci. So > this + removal of existing clk driver under arm/mach-davinci/clock.[ch], we > have > achieved this goal. The second point is the moving of SoC specific clk data > out of SoC > code to drive. Are you 100% sure this is the right thing to do for these > drivers. If so, > I can start working on this change right away. As I am working on this as a > background > activity, I want to be double or triple sure before doing the rework of these > patches :). > So please confirm. Yes, this is the right way to go. And I don't see it as something breaking new ground since there are already multiple SoCs in mainline which are following this same approach. May be to start with just convert one SoC and send for review. Thanks for taking this up and helping clean-up mach-davinci. Regards, Sekhar -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation
On Tue, Oct 16, 2012 at 09:56:48AM +0800, Ming Lei wrote: > On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim wrote: > > On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote: > >> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of > >> 'struct task_struct'), so that the flag can be set by one task > >> to avoid doing I/O inside memory allocation in the task's context. > >> > >> The patch trys to solve one deadlock problem caused by block device, > >> and the problem can be occured at least in the below situations: > >> > >> - during block device runtime resume situation, if memory allocation > >> with GFP_KERNEL is called inside runtime resume callback of any one > >> of its ancestors(or the block device itself), the deadlock may be > >> triggered inside the memory allocation since it might not complete > >> until the block device becomes active and the involed page I/O finishes. > >> The situation is pointed out first by Alan Stern. It is not a good > >> approach to convert all GFP_KERNEL in the path into GFP_NOIO because > >> several subsystems may be involved(for example, PCI, USB and SCSI may > >> be involved for usb mass stoarage device) > > > > Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as > > suspend path? > > IMO, we could, but it is not good and might trigger memory allocation problem. > > pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to > avoid allocating page with GFP_IOFS in all contexts during system sleep, > when processes have been frozen. > > But during runtime PM, the whole system is running and all processes are > runnable. Also runtime PM is per device and the whole system may have > lots of devices, so taking the global gfp_allowed_mask may keep page > allocation with ~GFP_IOFS for a considerable proportion of system > running time, then alloc_page() will return failure easier. > > The above deadlock problem may be fixed by allocating memory with > ~GFP_IOFS only in the context of calling runtime_resume, and that is > idea of the patch. Fair enough but it wouldn't be a good idea that add new unlikely branch in allocator's fast path. Please move the check into slow path which could be in __alloc_pages_slowpath. > > > > >> > >> - during error handling situation of usb mass storage deivce, USB > >> bus reset will be put on the device, so there shouldn't have any > >> memory allocation with GFP_KERNEL during USB bus reset, otherwise > >> the deadlock similar with above may be triggered. Unfortunately, any > >> usb device may include one mass storage interface in theory, so it > >> requires all usb interface drivers to handle the situation. In fact, > >> most usb drivers don't know how to handle bus reset on the device > >> and don't provide .pre_set() and .post_reset() callback at all, so > >> USB core has to unbind and bind driver for these devices. So it > >> is still not practical to resort to GFP_NOIO for solving the problem. > > > > I hope this case could be handled by usb core like usb_restrict_gfp_mask > > rather than adding new branch on fast path. > > See above, applying the global gfp_allowed_mask is not good. > > > Thanks, > -- > Ming Lei -- Kind Regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 05/12] x86, hotplug, suspend: Online CPU0 for suspend or hibernate
On 10/16/2012 02:20 AM, Rafael J. Wysocki wrote: > On Friday 12 of October 2012 09:09:42 Fenghua Yu wrote: >> From: Fenghua Yu >> >> Because x86 BIOS requires CPU0 to resume from sleep, suspend or hibernate >> can't >> be executed if CPU0 is detected offline. To make suspend or hibernate and >> further resume succeed, CPU0 must be online. >> >> Signed-off-by: Fenghua Yu >> --- >> arch/x86/power/cpu.c | 44 >> 1 files changed, 44 insertions(+), 0 deletions(-) >> >> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c >> index 218cdb1..adde775 100644 >> --- a/arch/x86/power/cpu.c >> +++ b/arch/x86/power/cpu.c >> @@ -237,3 +237,47 @@ void restore_processor_state(void) >> #ifdef CONFIG_X86_32 >> EXPORT_SYMBOL(restore_processor_state); >> #endif >> + >> +/* >> + * When bsp_check() is called in hibernate and suspend, cpu hotplug >> + * is disabled already. So it's unnessary to handle race condition between >> + * cpumask query and cpu hotplug. >> + */ >> +static int bsp_check(void) >> +{ >> +if (cpumask_first(cpu_online_mask) != 0) { >> +pr_warn("CPU0 is offline.\n"); >> +return -ENODEV; >> +} >> + >> +return 0; >> +} >> + >> +static int bsp_pm_callback(struct notifier_block *nb, unsigned long action, >> + void *ptr) >> +{ >> +int ret = 0; >> + >> +switch (action) { >> +case PM_SUSPEND_PREPARE: >> +case PM_HIBERNATION_PREPARE: >> +ret = bsp_check(); >> +break; >> +default: >> +break; >> +} >> +return notifier_from_errno(ret); >> +} >> + > > I wonder if there's anything preventing CPU0 from becoming offline after > you've > done this check and before user space is frozen? > Hi Rafael, bsp_pm_callback runs as a low priority notifier callback, specifically with lower priority than the cpu_hotplug_pm_callback (as mentioned in the comment below). And cpu_hotplug_pm_callback disables regular CPU hotplug (till the suspend/resume sequence is complete).. So there is no chance for CPU0 to become offline after that. Or, are you thinking of some other scenario where CPU0 can go offline? Regards, Srivatsa S. Bhat > > >> +static int __init bsp_pm_check_init(void) >> +{ >> +/* >> + * Set this bsp_pm_callback as lower priority than >> + * cpu_hotplug_pm_callback. So cpu_hotplug_pm_callback will be called >> + * earlier to disable cpu hotplug before bsp online check. >> + */ >> +pm_notifier(bsp_pm_callback, -INT_MAX); >> +return 0; >> +} >> + >> +core_initcall(bsp_pm_check_init); >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/10] ASoC: imx: Don't use {en,dis}able_fiq() calls
On Mon, Oct 15, 2012 at 02:51:28PM -0700, Anton Vorontsov wrote: > The driver uses platform-specific mxc_set_irq_fiq() with the VIRQ cookie > passed to it, so it's pretty clear that the driver is absolutely sure > that the FIQ is routed via platform-specific IC, and that the cookie can > be used to mask/unmask FIQs. So, let's switch to the genirq routines, > since we're about to remove FIQ-specific variants. Acked-by: Mark Brown -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] regulator: core: Check before enabling regulator while setting constraints.
On Tue, Oct 16, 2012 at 10:54:19AM +0530, Yadwinder Singh Brar wrote: > This patch adds check, whether regulator is already enabled before enabling it > while setting machine constraints. Since some PMICs have same register bits > for setting opmode and enabling/disabling the regulator, so it will overwrite > the settings (if any)done by set_mode/set_suspend_mode callbacks when it > enables regulator without checking previous status. This sounds like a bug in the driver, these ops are supposed to be repeatable at will. The driver needs to remember the mode setting when doing enable or disable, and setting the mode should not change the enable status. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix scheduling-while-atomic problem in console_cpu_notify()
On 10/16/2012 10:05 AM, Paul E. McKenney wrote: > On Mon, Oct 15, 2012 at 05:31:28PM -0700, Paul E. McKenney wrote: >> The console_cpu_notify( function runs with interrupts disabled in >> the CPU_DEAD case. It therefore cannot block, for example, as will >> happen when it calls console_lock(). Therefore, remove the CPU_DEAD >> leg of the switch statement to avoid this problem. >> >> Signed-off-by: Paul E. McKenney > > s/CPU_DEAD/CPU_DYING/ > > Apparently it is a bad idea to compose and send a patch while in a > C++ standards committee meeting where people are arguing about async > futures... Fixed patch below. > > Thanx, Paul > > > > printk: Fix scheduling-while-atomic problem in console_cpu_notify() > > The console_cpu_notify( function runs with interrupts disabled in > the CPU_DYING case. It therefore cannot block, for example, as will > happen when it calls console_lock(). Therefore, remove the CPU_DYING > leg of the switch statement to avoid this problem. > > Signed-off-by: Paul E. McKenney > Reviewed-by: Srivatsa S. Bhat Regards, Srivatsa S. Bhat > diff --git a/kernel/printk.c b/kernel/printk.c > index 66a2ea3..2d607f4 100644 > --- a/kernel/printk.c > +++ b/kernel/printk.c > @@ -1890,7 +1890,6 @@ static int __cpuinit console_cpu_notify(struct > notifier_block *self, > switch (action) { > case CPU_ONLINE: > case CPU_DEAD: > - case CPU_DYING: > case CPU_DOWN_FAILED: > case CPU_UP_CANCELED: > console_lock(); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] regulator: max77686: Add set_suspend_disable/set_suspend_mode callbacks.
This patch implements set_suspend_disable callback for BUCKs which support only switch ON/OFF modes during system suspend state, and set_suspend_mode callbacks for LDOs which also suport Low power mode and switch ON/OFF modes. Signed-off-by: Yadwinder Singh Brar --- drivers/regulator/max77686.c | 142 +++-- 1 files changed, 135 insertions(+), 7 deletions(-) diff --git a/drivers/regulator/max77686.c b/drivers/regulator/max77686.c index 2a67d08..e83db38 100644 --- a/drivers/regulator/max77686.c +++ b/drivers/regulator/max77686.c @@ -69,6 +69,76 @@ struct max77686_data { struct regulator_dev *rdev[MAX77686_REGULATORS]; }; +/* Some BUCKS supports Normal[ON/OFF] mode during suspend */ +static int max77686_buck_set_suspend_disable(struct regulator_dev *rdev) +{ + unsigned int val; + + if (rdev->desc->id == MAX77686_BUCK1) + val = 0x1; + else + val = 0x1 << MAX77686_OPMODE_BUCK234_SHIFT; + + return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg, + rdev->desc->enable_mask, + val); +} + +/* Some LDOs supports [LPM/Normal]ON mode during suspend state */ +static int max77686_set_suspend_mode(struct regulator_dev *rdev, +unsigned int mode) +{ + unsigned int val; + + /* BUCK[5-9] doesn't support this feature */ + if (rdev->desc->id >= MAX77686_BUCK5) + return 0; + + switch (mode) { + case REGULATOR_MODE_IDLE: /* ON in LP Mode */ + val = 0x2 << MAX77686_OPMODE_SHIFT; + break; + case REGULATOR_MODE_NORMAL: /* ON in Normal Mode */ + val = 0x3 << MAX77686_OPMODE_SHIFT; + break; + default: + pr_warn("%s: regulator_suspend_mode : 0x%x not supported\n", + rdev->desc->name, mode); + return -EINVAL; + } + + return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg, + rdev->desc->enable_mask, + val); +} + +/* Some LDOs supports LPM-ON/OFF/Normal-ON mode during suspend state */ +static int max77686_ldo_set_suspend_mode(struct regulator_dev *rdev, +unsigned int mode) +{ + unsigned int val; + + switch (mode) { + case REGULATOR_MODE_STANDBY:/* switch off */ + val = 0x1 << MAX77686_OPMODE_SHIFT; + break; + case REGULATOR_MODE_IDLE: /* ON in LP Mode */ + val = 0x2 << MAX77686_OPMODE_SHIFT; + break; + case REGULATOR_MODE_NORMAL: /* ON in Normal Mode */ + val = 0x3 << MAX77686_OPMODE_SHIFT; + break; + default: + pr_warn("%s: regulator_suspend_mode : 0x%x not supported\n", + rdev->desc->name, mode); + return -EINVAL; + } + + return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg, + rdev->desc->enable_mask, + val); +} + static int max77686_set_ramp_delay(struct regulator_dev *rdev, int ramp_delay) { unsigned int ramp_value = RAMP_RATE_NO_CTRL; @@ -103,6 +173,31 @@ static struct regulator_ops max77686_ops = { .get_voltage_sel= regulator_get_voltage_sel_regmap, .set_voltage_sel= regulator_set_voltage_sel_regmap, .set_voltage_time_sel = regulator_set_voltage_time_sel, + .set_suspend_mode = max77686_set_suspend_mode, +}; + +static struct regulator_ops max77686_ldo_ops = { + .list_voltage = regulator_list_voltage_linear, + .map_voltage= regulator_map_voltage_linear, + .is_enabled = regulator_is_enabled_regmap, + .enable = regulator_enable_regmap, + .disable= regulator_disable_regmap, + .get_voltage_sel= regulator_get_voltage_sel_regmap, + .set_voltage_sel= regulator_set_voltage_sel_regmap, + .set_voltage_time_sel = regulator_set_voltage_time_sel, + .set_suspend_mode = max77686_ldo_set_suspend_mode, +}; + +static struct regulator_ops max77686_buck1_ops = { + .list_voltage = regulator_list_voltage_linear, + .map_voltage= regulator_map_voltage_linear, + .is_enabled = regulator_is_enabled_regmap, + .enable = regulator_enable_regmap, + .disable= regulator_disable_regmap, + .get_voltage_sel= regulator_get_voltage_sel_regmap, + .set_voltage_sel= regulator_set_voltage_sel_regmap, + .set_voltage_time_sel = regulator_set_voltage_time_sel, + .set_suspend_disable=
Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events
On 10/15/2012 10:53 PM, Arun Sharma wrote: > On 10/15/12 8:55 AM, Robert Richter wrote: > > [..] >> Perf tool works then out-of-the-box with: >> >> $ perf record -e cpu/stalled-cycles-fixed-point/ ... >> >> The event string can easily be reused by other architectures as a >> quasi standard. > > I like Robert's proposal better. It's hard to model all the stall events > (eg: instruction decoder related stalls on x86) in a hardware > independent way. > > Another area to think about: software engineers are generally busy and > have a limited amount of time to devote to hardware event based > optimizations. The most common question I hear is: what is the expected > perf gain if I fix this? It's hard to answer that with just the stall > events. > Hardware event based optimization is a very important aspect of real world application tuning. CPI stack analysis is a good reason why perf should have stall events as generic ones. But I am not clear on situations where we consider adding these new generic events into linux/perf_event.h and the situations where we should go with the sys fs interface. Could you please elaborate on this ? Regards Anshuman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] regulator: core: Check before enabling regulator while setting constraints.
This patch adds check, whether regulator is already enabled before enabling it while setting machine constraints. Since some PMICs have same register bits for setting opmode and enabling/disabling the regulator, so it will overwrite the settings (if any)done by set_mode/set_suspend_mode callbacks when it enables regulator without checking previous status. Signed-off-by: Yadwinder Singh Brar --- drivers/regulator/core.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c index f7c74db..9e3a0c7 100644 --- a/drivers/regulator/core.c +++ b/drivers/regulator/core.c @@ -958,6 +958,9 @@ static int set_machine_constraints(struct regulator_dev *rdev, */ if ((rdev->constraints->always_on || rdev->constraints->boot_on) && ops->enable) { + if (ops->is_enabled && ops->is_enabled(rdev)) + goto enabled; + ret = ops->enable(rdev); if (ret < 0) { rdev_err(rdev, "failed to enable\n"); @@ -965,6 +968,7 @@ static int set_machine_constraints(struct regulator_dev *rdev, } } +enabled: if (rdev->constraints->ramp_delay && ops->set_ramp_delay) { ret = ops->set_ramp_delay(rdev, rdev->constraints->ramp_delay); if (ret < 0) { -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef CONFIG_HIBERNATION
acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block, but the current code put the declaration in #ifdef CONFIG_PM_SLEEP block. I happened to meet this issue when I turned off PM_SLEEP config manually: arch/x86/kernel/acpi/sleep.c:100:4: error: implicit declaration of function ‘acpi_no_s4_hw_signature’ [-Werror=implicit-function-declaration] v2: take better title and add build error message suggested by Fengguang Signed-off-by: Yuanhan Liu Reviewed-by: Fengguang Wu --- include/linux/acpi.h |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 90be989..a468429 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, resource_size_t n, int acpi_resources_are_enforced(void); -#ifdef CONFIG_PM_SLEEP +#ifdef CONFIG_HIBERNATION void __init acpi_no_s4_hw_signature(void); +#endif + +#ifdef CONFIG_PM_SLEEP void __init acpi_old_suspend_ordering(void); void __init acpi_nvs_nosave(void); #endif /* CONFIG_PM_SLEEP */ -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature
On Tue, Oct 16, 2012 at 12:27:13PM +0800, Fengguang Wu wrote: > The title could be made more descriptive: > > ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef > CONFIG_HIBERNATION Yes, much better. > > On Tue, Oct 16, 2012 at 12:05:03PM +0800, Yuanhan Liu wrote: > > acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block, > > but the current code put the declare in #ifdef CONFIG_PM_SLEEP block. > > And it's better to always include the original build error/warning > messages when fixing build problems. Got it. Will send out v2 soon. Thanks, Yuanhan Liu > > Otherwise looks good to me. > > Reviewed-by: Fengguang Wu > > > Signed-off-by: Yuanhan Liu > > --- > > include/linux/acpi.h |5 - > > 1 files changed, 4 insertions(+), 1 deletions(-) > > > > diff --git a/include/linux/acpi.h b/include/linux/acpi.h > > index 90be989..a468429 100644 > > --- a/include/linux/acpi.h > > +++ b/include/linux/acpi.h > > @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, > > resource_size_t n, > > > > int acpi_resources_are_enforced(void); > > > > -#ifdef CONFIG_PM_SLEEP > > +#ifdef CONFIG_HIBERNATION > > void __init acpi_no_s4_hw_signature(void); > > +#endif > > + > > +#ifdef CONFIG_PM_SLEEP > > void __init acpi_old_suspend_ordering(void); > > void __init acpi_nvs_nosave(void); > > #endif /* CONFIG_PM_SLEEP */ > > -- > > 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
From: HATAYAMA Daisuke Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP Date: Tue, 16 Oct 2012 14:03:13 +0900 > From: "Yu, Fenghua" > Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP > Date: Tue, 16 Oct 2012 04:51:36 + > >>> -Original Message- >>> From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com] >>> Sent: Monday, October 15, 2012 9:35 PM >>> To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org; >>> x...@kernel.org >>> Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu, >>> Fenghua; vgo...@redhat.com; ebied...@xmission.com; >>> grant.lik...@secretlab.ca; rob.herr...@calxeda.com; >>> d.hatay...@jp.fujitsu.com >>> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP >>> >>> Multiple CPUs are useful for CPU-bound processing like compression and >>> I do want to use compression to generate crash dump quickly. But now >>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if >>> crash happens on AP. If crash happens on AP, kexec enters the 2nd >>> kernel with the AP, and there BSP in the 1st kernel is expected to be >>> haling in the 1st kernel or possibly in any fatal system error state. >>> >>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes >>> BSP to jump into BIOS init code. A typical visible behaviour is hang >>> or immediate reset, depending on the BIOS init code. >>> >>> AP can be initiated by INIT even in a fatal state: MP spec explains >>> that processor-specific INIT can be used to recover AP from a fatal >>> system error. On the other hand, there's no method for BSP to recover; >>> it might be possible to do so by NMI plus any hand-coded reset code >>> that is carefully designed, but at least I have no idea in this >>> direction now. >> >> In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is >> not in tip tree yet. >> >> BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336 >> >>> >>> Therefore, the idea I do in this patch set is simply to disable BSP if >>> vboot cpu is AP. >>> >> >> The BSP hotplug patchset will be useful for you goal. With the BSP hotplug >> patcheset, you can wake up BSP and don't need to disable it. >> >>> My motivation is to use multiple CPUs in order to quickly generate >>> crash dump on the machine with huge amount of memory. I assume such >>> machine tends to also have a lot of CPUs. So disabling one CPU would >>> be no problem. >> >> Luckily you don't need to disable any CPU to archive your goal with >> the BSP hotplug pachest:) >> >> On a dual core/single thread machine, this means you get 100% performance >> boost with BSP's help. >> >> Plus crash dump kernel code is better structured by not treating BSP >> specially. >> > > Hello Fenghua. > > I've of course noticed your patch set and locally tested, but I saw > NMI to BSP failed in the 2nd kernel. I'll send a log to you later. > > BTW, I tested with your previous v8 patch set. Did you change > something during v8 to v9 relevant to this issue? > I've fogetten saying one comment that your patch distinguish BSP by CPU#0. CPU#0 is assigned to the boot cpu, which can be AP in the kdump 2nd kernel. Distinguishing BSP by CPU#0 is not enough here. I have my local patch set based on your v8 patch doing this, but NMI to BSP failed. I guess this comes from the difference of BSP states: halting in play dead in your NMI method and halting in the 1st kernel on crash or possibly in a fatal system error on actual situation. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Disintegrate UAPI for xtensa [ver #2]
On Tue, Oct 9, 2012 at 1:16 PM, David Howells wrote: > Can you merge the following branch into the xtensa tree please. > > This is to complete part of the UAPI disintegration for which the preparatory > patches were pulled recently. > > Now that the fixups and the asm-generic chunk have been merged, I've > regenerated the patches to get rid of those dependencies and to take account > of > any changes made so far in the merge window. If you have already pulled the > older version of the branch aimed at you, then please feel free to ignore this > request. > > The following changes since commit 9e2d8656f5e8aa214e66b462680cf86b210b74a8: > > Merge branch 'akpm' (Andrew's patch-bomb) (2012-10-09 16:23:15 +0900) > > are available in the git repository at: > > > git://git.infradead.org/users/dhowells/linux-headers.git > tags/disintegrate-xtensa-20121009 > > for you to fetch changes up to 91a0696e40414e9f1554cd91060f6b404d545cb3: > > UAPI: (Scripted) Disintegrate arch/xtensa/include/asm (2012-10-09 09:47:57 > +0100) > > > UAPI Disintegration 2012-10-09 > > > David Howells (1): > UAPI: (Scripted) Disintegrate arch/xtensa/include/asm Thanks, applied to the xtensa_next tree. -- -- Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ASoC: Ux500: Dispose of device nodes correctly
On Mon, Oct 15, 2012 at 02:13:25PM +0100, Lee Jones wrote: > When of_parse_phandle() is used to find a device node, its > reference count is incremented by the helper. Once we're > finished with them, it's our responsibly to ensure they > are freed in the correct manor. Applied both, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> >> My motivation is to use multiple CPUs in order to quickly generate > >> crash dump on the machine with huge amount of memory. I assume such > >> machine tends to also have a lot of CPUs. So disabling one CPU would > >> be no problem. > > > > Luckily you don't need to disable any CPU to archive your goal with > > the BSP hotplug pachest:) > > > > On a dual core/single thread machine, this means you get 100% > performance > > boost with BSP's help. > > > > Plus crash dump kernel code is better structured by not treating BSP > > specially. > > > > Hello Fenghua. > > I've of course noticed your patch set and locally tested, but I saw > NMI to BSP failed in the 2nd kernel. I'll send a log to you later. > > BTW, I tested with your previous v8 patch set. Did you change > something during v8 to v9 relevant to this issue? In the patch 0/12 in v9, I describe what change is in v9 on the top of v8: v9: Add Intel vendor check to support the feature on Intel platforms only. Did you see the BSP wake up failure on the latest tip tree? There is a rcu regression issue which prevents BSP from waking up in 3.6.0. The issue has been fixed on 10/12. The commit is a4fbe35a. Please make sure your tip tree has this commit. Thanks. -Fenghua Thanks. -Fenghua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] regulator: gpio-regulator: Allow use of GPIO controlled regulators though DT
On Mon, Oct 15, 2012 at 02:16:59PM +0100, Lee Jones wrote: > Here we provide the GPIO Regulator driver with Device Tree capability, so > that when a platform is booting with DT instead of platform data we can > still make full use of it. Not looked at the patch yet but patch 2 doesn't seem to have appeared? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 25/25] xtensa: Use Kbuild infrastructure to handle asm-generic headers
On Sat, Oct 13, 2012 at 6:26 AM, Steven Rostedt wrote: > From: Steven Rostedt > > Use Kbuild infrastructure to handle the asm-generic headers > and remove the wrapper headers that call them. > > This only affects headers that do nothing but include the generic > equivalent. It does not touch any header that does a little more. > > Cc: linux-kbu...@vger.kernel.org > Cc: linux-xte...@linux-xtensa.org > Cc: Chris Zankel > Cc: Max Filippov > Signed-off-by: Steven Rostedt Thanks, rebased on top of UAPI changes and applied to the xtensa_next tree. -- -- Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mpol_to_str revisited.
On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes wrote: > On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. > > It's certainly not a complete fix, but I think it's a much better result > of the race, i.e. we don't panic anymore, we simply fail the read() > instead. Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring to caller complex. That's not good and have no worth. >> we should >> close a race (or kill remain ref count leak) if we still have. > > As I mentioned earlier in the thread, the read() is done here on a task > while only a reference to the task_struct is taken and we do not hold > task_lock() which is required for task->mempolicy. Once that is fixed, > mpol_to_str() should never be called for !task->mempolicy so it will never > need to return -EINVAL in such a condition. I agree that's obviously a bug and we should fix it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CPU utilization between physical CPU and virtual CPU in KVM
Any body can be help about this or a little bit clues? Thanks! On Mon, Oct 8, 2012 at 3:01 PM, Dennis Chen wrote: > Hi All, > > I am confused by the following observed scenario: > > In my 4-CPU (KVM supported, 2 core with 2 thread for each) host > machine box, I create only one VM with 3-vCPU through virsh/libvirt > tools and also I pin this VM process to the physical processor 3. I > guess the CPU utilization for the processor 3 will not exceed 100%, > then I create 3 process (dead loop-- while(1);) and bind each of them > to vCPU[0-2] respectively, through the "top -c" command in VM > environment, I can see the CPU utilization for each of the vCPU is > about 100%, but interesting, I found that the CPU utilization of > processor 3 in the host machine is about 300% with "toc -c" command. > why does a single process bound to a CPU can get ~300% cpu bandwidth > in this case, does the kernel scheduler dispatch the idle cycle > capacity of the CPUs to the virtual CPU of the VM, other word, the > scheduler knows the vCPU info in the VM process? > > For the same case, if I create another 4 new dead-loop processes and > bind them to the physical CPU[0-3] equally, then I find the vCPU0/1 in > VM will not be 100%, eg. 32%, (I think the scheduler in the guest OS > doesn't know it's running in a virtual environment, so the utilization > of the vCPU will not change to adapt to the physical processor > utilization, but it did, why? > > -org-gnu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
From: "Yu, Fenghua" Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP Date: Tue, 16 Oct 2012 04:51:36 + >> -Original Message- >> From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com] >> Sent: Monday, October 15, 2012 9:35 PM >> To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org; >> x...@kernel.org >> Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu, >> Fenghua; vgo...@redhat.com; ebied...@xmission.com; >> grant.lik...@secretlab.ca; rob.herr...@calxeda.com; >> d.hatay...@jp.fujitsu.com >> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP >> >> Multiple CPUs are useful for CPU-bound processing like compression and >> I do want to use compression to generate crash dump quickly. But now >> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if >> crash happens on AP. If crash happens on AP, kexec enters the 2nd >> kernel with the AP, and there BSP in the 1st kernel is expected to be >> haling in the 1st kernel or possibly in any fatal system error state. >> >> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes >> BSP to jump into BIOS init code. A typical visible behaviour is hang >> or immediate reset, depending on the BIOS init code. >> >> AP can be initiated by INIT even in a fatal state: MP spec explains >> that processor-specific INIT can be used to recover AP from a fatal >> system error. On the other hand, there's no method for BSP to recover; >> it might be possible to do so by NMI plus any hand-coded reset code >> that is carefully designed, but at least I have no idea in this >> direction now. > > In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is > not in tip tree yet. > > BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336 > >> >> Therefore, the idea I do in this patch set is simply to disable BSP if >> vboot cpu is AP. >> > > The BSP hotplug patchset will be useful for you goal. With the BSP hotplug > patcheset, you can wake up BSP and don't need to disable it. > >> My motivation is to use multiple CPUs in order to quickly generate >> crash dump on the machine with huge amount of memory. I assume such >> machine tends to also have a lot of CPUs. So disabling one CPU would >> be no problem. > > Luckily you don't need to disable any CPU to archive your goal with > the BSP hotplug pachest:) > > On a dual core/single thread machine, this means you get 100% performance > boost with BSP's help. > > Plus crash dump kernel code is better structured by not treating BSP > specially. > Hello Fenghua. I've of course noticed your patch set and locally tested, but I saw NMI to BSP failed in the 2nd kernel. I'll send a log to you later. BTW, I tested with your previous v8 patch set. Did you change something during v8 to v9 relevant to this issue? Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> -Original Message- > From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com] > Sent: Monday, October 15, 2012 9:35 PM > To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org; > x...@kernel.org > Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu, > Fenghua; vgo...@redhat.com; ebied...@xmission.com; > grant.lik...@secretlab.ca; rob.herr...@calxeda.com; > d.hatay...@jp.fujitsu.com > Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP > > Multiple CPUs are useful for CPU-bound processing like compression and > I do want to use compression to generate crash dump quickly. But now > we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if > crash happens on AP. If crash happens on AP, kexec enters the 2nd > kernel with the AP, and there BSP in the 1st kernel is expected to be > haling in the 1st kernel or possibly in any fatal system error state. > > To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes > BSP to jump into BIOS init code. A typical visible behaviour is hang > or immediate reset, depending on the BIOS init code. > > AP can be initiated by INIT even in a fatal state: MP spec explains > that processor-specific INIT can be used to recover AP from a fatal > system error. On the other hand, there's no method for BSP to recover; > it might be possible to do so by NMI plus any hand-coded reset code > that is carefully designed, but at least I have no idea in this > direction now. In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is not in tip tree yet. BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336 > > Therefore, the idea I do in this patch set is simply to disable BSP if > vboot cpu is AP. > The BSP hotplug patchset will be useful for you goal. With the BSP hotplug patcheset, you can wake up BSP and don't need to disable it. > My motivation is to use multiple CPUs in order to quickly generate > crash dump on the machine with huge amount of memory. I assume such > machine tends to also have a lot of CPUs. So disabling one CPU would > be no problem. Luckily you don't need to disable any CPU to archive your goal with the BSP hotplug pachest:) On a dual core/single thread machine, this means you get 100% performance boost with BSP's help. Plus crash dump kernel code is better structured by not treating BSP specially. Thanks. -Fenghua
Re: [PATCH 05/16] f2fs: add checkpoint operations
On Sat, 13 Oct 2012 00:49:06 +0900 Jaegeuk Kim wrote: > 2012-10-11 (목), 09:24 +1100, NeilBrown: > > On Fri, 05 Oct 2012 20:59:29 +0900 김재극 wrote: > > > > > +static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) > > > +{ > > > + struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); > > > + nid_t last_nid = 0; > > > + int nat_upd_blkoff[3]; > > > + block_t start_blk; > > > + struct page *cp_page; > > > + unsigned int data_sum_blocks, orphan_blocks; > > > + void *kaddr; > > > + __u32 crc32 = 0; > > > + int i; > > > + > > > + /* Flush all the NAT/SIT pages */ > > > + while (get_pages(sbi, F2FS_DIRTY_META)) > > > + sync_meta_pages(sbi, META, LONG_MAX); > > > + > > > + next_free_nid(sbi, _nid); > > > + > > > + /* > > > + * modify checkpoint > > > + * version number is already updated > > > + */ > > > + ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi)); > > > + ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi)); > > > + ckpt->free_segment_count = cpu_to_le32(free_segments(sbi)); > > > + for (i = 0; i < 3; i++) { > > > + ckpt->cur_node_segno[i] = > > > + cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE)); > > > + ckpt->cur_node_blkoff[i] = > > > + cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE)); > > > + nat_upd_blkoff[i] = NM_I(sbi)->nat_upd_blkoff[i]; > > > + ckpt->nat_upd_blkoff[i] = cpu_to_le16(nat_upd_blkoff[i]); > > > + ckpt->alloc_type[i + CURSEG_HOT_NODE] = > > > + curseg_alloc_type(sbi, i + CURSEG_HOT_NODE); > > > + } > > > + for (i = 0; i < 3; i++) { > > > + ckpt->cur_data_segno[i] = > > > + cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA)); > > > + ckpt->cur_data_blkoff[i] = > > > + cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA)); > > > + ckpt->alloc_type[i + CURSEG_HOT_DATA] = > > > + curseg_alloc_type(sbi, i + CURSEG_HOT_DATA); > > > + } > > > + > > > + ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi)); > > > + ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi)); > > > + ckpt->next_free_nid = cpu_to_le32(last_nid); > > > + > > > + /* 2 cp + n data seg summary + orphan inode blocks */ > > > + data_sum_blocks = npages_for_summary_flush(sbi); > > > + if (data_sum_blocks < 3) > > > + ckpt->ckpt_flags |= CP_COMPACT_SUM_FLAG; > > > + else > > > + ckpt->ckpt_flags &= (~CP_COMPACT_SUM_FLAG); > > > + > > > + orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1) > > > + / F2FS_ORPHANS_PER_BLOCK; > > > + ckpt->cp_pack_start_sum = 1 + orphan_blocks; > > > + ckpt->cp_pack_total_block_count = 2 + data_sum_blocks + orphan_blocks; > > > > This looks a bit weird to me, though I might be misunderstanding something. > > > > data_sum_blocks is either 1, 2, or 3. > > "3" actually means "at least 3". > > > > If it is 3, you choose not to set CP_COMPACT_SUM_FLAG. In that case the NAT > > and SIT journal entries go into SSA blocks, not into the checkpoint at all. > > So in that case, zero blocks of the checkpoint are used for journalling. > > Yet > > you still add data_sum_blocks (==3) to the cp_pack_total_block_count (and > > later to the start block). > > Is that really what you want to do? Leave 3 empty blocks? > > > > I would suggest changing npages_for_summary_flush to return 0 if the number > > of blocks needed would be more than three, and set CP_COMPACT_SUM_FLAG only > > when data_sum_blocks > 0. > > > > I don't know if you would need to make a corresponding change to the > > recovery > > code, I haven't fully examined that yet. > > Ok, let me explain about CP_COMPACT_SUM_FLAG. > Let's assume that there are some journal entries and data summaries. > Note that this scenario is not from the umount procedure. > > Basically f2fs writes three data summary blocks for current active logs > inside the checkpoint pack. > And NAT and SIT journal entries are stored in hot and cold data summary > blocks. > So, if the CP_COMPACT_SUM_FLAG is not set, f2fs writes the checkpoint > pack like this. > > [CP 0] > [Orphan blocks] > [Hot sum block w/ NAT journal] > [Warm sum block] > [Cold sum block w/ SIT journal] > [CP 0'] > > But, if the CP_COMPACT_SUM_FLAG is set, the checkpoint pack consists of > 1 or 2 summary blocks as follows. > > [CP 0] > [Orphan blocks] > [summary entries w/ NAT and SIT journal] > [CP 0'] > > or, > > [CP 0] > [Orphan blocks] > [summary entries] > [summary entries w/ NAT and SIT journal] > [CP 0'] > > So, I think it needs no change. > Any idea? > Thanks, I see. I missed the fact that the current data summary blocks are always written to the checkpoint area - I assumed they were being written back to the SSA. So it makes sense now and you are right - no change needed. Thanks, NeilBrown > > > > > Regards, > > NeilBrown > signature.asc Description: PGP signature
Re: [PATCH 07/16] f2fs: add segment operations
On Sat, 13 Oct 2012 00:12:53 +0900 Jaegeuk Kim wrote: > 2012-10-11 (목), 09:37 +1100, NeilBrown: > > On Fri, 05 Oct 2012 21:00:55 +0900 김재극 wrote: > > > > > +/** > > > + * Find a new segment from the free segments bitmap to right order > > > + * This function should be returned with success, otherwise BUG > > > + */ > > > +static void get_new_segment(struct f2fs_sb_info *sbi, > > > + unsigned int *newseg, bool new_sec, int dir) > > > +{ > > > + struct free_segmap_info *free_i = FREE_I(sbi); > > > + unsigned int total_secs = sbi->total_sections; > > > + unsigned int segno, secno, zoneno; > > > + unsigned int total_zones = sbi->total_sections / sbi->secs_per_zone; > > > + unsigned int hint = *newseg >> sbi->log_segs_per_sec; > > > + unsigned int old_zoneno = GET_ZONENO_FROM_SEGNO(sbi, *newseg); > > > + unsigned int left_start = hint; > > > + bool init = true; > > > + int go_left = 0; > > > + int i; > > > + > > > + write_lock(_i->segmap_lock); > > > + > > > + if (!new_sec && ((*newseg + 1) % sbi->segs_per_sec)) { > > > + segno = find_next_zero_bit(free_i->free_segmap, > > > + TOTAL_SEGS(sbi), *newseg + 1); > > > + if (segno < TOTAL_SEGS(sbi)) > > > + goto got_it; > > > + } > > > +find_other_zone: > > > + secno = find_next_zero_bit(free_i->free_secmap, total_secs, hint); > > > + if (secno >= total_secs) { > > > + if (dir == ALLOC_RIGHT) { > > > + secno = find_next_zero_bit(free_i->free_secmap, > > > + total_secs, 0); > > > + BUG_ON(secno >= total_secs); > > > + } else { > > > + go_left = 1; > > > + left_start = hint - 1; > > > + } > > > + } > > > + if (go_left == 0) > > > + goto skip_left; > > > + > > > + while (test_bit(left_start, free_i->free_secmap)) { > > > + if (left_start > 0) { > > > + left_start--; > > > + continue; > > > + } > > > + left_start = find_next_zero_bit(free_i->free_secmap, > > > + total_secs, 0); > > > + BUG_ON(left_start >= total_secs); > > > + break; > > > + } > > > + secno = left_start; > > > +skip_left: > > > + hint = secno; > > > + segno = secno << sbi->log_segs_per_sec; > > > + zoneno = secno / sbi->secs_per_zone; > > > + > > > + if (sbi->secs_per_zone == 1) > > > + goto got_it; > > > + if (zoneno == old_zoneno) > > > + goto got_it; > > > + if (dir == ALLOC_LEFT) { > > > + if (!go_left && zoneno + 1 >= total_zones) > > > + goto got_it; > > > + if (go_left && zoneno == 0) > > > + goto got_it; > > > + } > > > + > > > + for (i = 0; i < DEFAULT_CURSEGS; i++) { > > > + struct curseg_info *curseg = CURSEG_I(sbi, i); > > > + > > > + if (curseg->zone != zoneno) > > > + continue; > > > + if (!init) > > > + continue; > > > + > > > + if (go_left) > > > + hint = zoneno * sbi->secs_per_zone - 1; > > > + else if (zoneno + 1 >= total_zones) > > > + hint = 0; > > > + else > > > + hint = (zoneno + 1) * sbi->secs_per_zone; > > > + init = false; > > > + goto find_other_zone; > > > + } > > > > I think this code is correct, but I found it very confusing to read. > > The point of the loop is simply to find out if any current segment using > > the > > given zone. But that isn't obvious, it seem to do more. > > I would re-write it as: > > > > for (i = 0; i < DEFAULT_CURSEGS ; i++) { > >struct curseg_info *curseg = CURSEG_I(sbi, i); > >if (curseg->zone == zoneno) > >break; > > } > > if (i < DEFAULT_CURSEGS && init) { > > /* Zone is in use,try another */ > > if (go_left) > > hint = > > else if () > > hint = 0; > > else > > hint = ..; > > init = false; > > goto find_other_zone; > > } > > > > To me, that makes it much clearer what is happening. > > > > Ok. > I think it had better change like this to avoid unecessary loop. > > /* give up on finding another zone */ > if (!init) > goto got_it; > > for (i = 0; i < DEFAULT_CURSEGS; i++) { > if (CURSEG_I(sbi, i)->zone == zoneno) > break; > } > > if (i < DEFAULT_CURSEGS) { > /* zone is in use, try another */ > if (go_left) > hint = > else if () > hint = 0; > else > hint = ..; > init = false; > goto find_other_zone; > } Yes, that looks good. Thanks. > > > > +static void f2fs_end_io_write(struct bio *bio, int err) > > > +{ > > > + const int uptodate = test_bit(BIO_UPTODATE, >bi_flags); > > > + struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; > > > + struct
Re: [PATCH RFC] random: Account for entropy loss due to overwrites
On 10/15/2012 09:08 PM, Theodore Ts'o wrote: > On Sat, Sep 29, 2012 at 12:47:04PM -0700, H. Peter Anvin wrote: >>> -static struct poolinfo { >>> +static const struct poolinfo { >>> + int poolshift; /* log2(POOLBITS) */ >>> int poolwords; >>> int tap1, tap2, tap3, tap4, tap5; > > Poolshift is duplicated information; it's just log2(poolwords) + 5 > (since POOLBITS is poolwords*32). > > Granted you don't want to recalculate it every single time you need to > use it, but perhaps it would be better to add poolshift to struct > entropy_store, and set it in init_std_data()? > Or we could compute poolwords (and poolbits, and poolbytes) from it, since shifts generally are cheap. I don't strongly care, whatever your preference is. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 08/12] x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI
From: Fenghua Yu Subject: [PATCH v9 08/12] x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI Date: Fri, 12 Oct 2012 09:09:45 -0700 > @@ -1037,6 +1101,8 @@ void __init native_smp_prepare_cpus(unsigned int > max_cpus) >*/ > setup_local_APIC(); > > + cpu0_logical_apicid = GET_APIC_LOGICAL_ID(apic_read(APIC_LDR)); > + In x2apic mode, logical apicid occupies a whole 32-bit length of LDR, but GET_APIC_LOGICAL_ID returns high 31-24 bits only, and this is only for xapic mode. Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] make GFP_NOTRACK flag unconditional
On Mon, 15 Oct 2012 21:02:45 -0700 (PDT) David Rientjes wrote: > On Tue, 2 Oct 2012, David Rientjes wrote: > > > > There was a general sentiment in a recent discussion (See > > > https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be > > > defined unconditionally. Currently, the only offender is GFP_NOTRACK, > > > which is conditional to KMEMCHECK. > > > > > > This simple patch makes it unconditional. > > > > > > Signed-off-by: Glauber Costa > > > CC: Christoph Lameter > > > CC: Mel Gorman > > > CC: Andrew Morton > > > > Acked-by: David Rientjes > > > > I think it was done this way to show that if CONFIG_KMEMCHECK=n then the > > bit could be reused for something else but I can't think of any reason why > > that would be useful; what would need to add a gfp bit that would also > > happen to depend on CONFIG_KMEMCHECK=n? Nothing comes to mind to save a > > bit. > > > > There are other cases of this as well, like __GFP_OTHER_NODE which is only > > useful for thp and it's defined unconditionally. So this seems fine to > > me. > > > > Still missing from linux-next as of this morning, I think this patch > should be merged. It's in 3.7-rc1. commit 3e648ebe076390018c317881d7d926f24d7bac6b Author: Glauber Costa Date: Mon Oct 8 16:33:52 2012 -0700 make GFP_NOTRACK definition unconditional -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name
* Masami Hiramatsu [2012-10-16 13:19:57]: > (2012/10/16 10:37), Hyeoncheol Lee wrote: > > convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS > > bytes for a function name > > Yeah, that one was from my laziness... > Guess not your fault, but mine. > > > > Cc: Masami Hiramatsu > > Cc: Srikar Dronamraju > > Signed-off-by: Hyeoncheol Lee > > --- > > tools/perf/util/probe-event.c |5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c > > index 49a256e..bb40ed4 100644 > > --- a/tools/perf/util/probe-event.c > > +++ b/tools/perf/util/probe-event.c > > @@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct > > perf_probe_event *pev, const char *exec) > > free(exec_copy); > > } > > free(pp->function); > > - pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS); > > + pp->function = zalloc(sizeof(char) * > > + (3 + sizeof(unsigned long long) * 2)); > > Could you comment that this is enough long here? Also can we move the arith into a macro? > > > if (!pp->function) { > > ret = -ENOMEM; > > pr_warning("Failed to allocate memory by zalloc.\n"); > > goto out; > > } > > - e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr); > > + sprintf(pp->function, "0x%llx", vaddr); > > And at least we should use snprintf instead of sprintf... > (I think ret = e_snprintf(...) is better) > Agree. > > ret = 0; > > > > out: > > > -- Thanks and Regards Srikar -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix scheduling-while-atomic problem in console_cpu_notify()
On Mon, Oct 15, 2012 at 05:31:28PM -0700, Paul E. McKenney wrote: > The console_cpu_notify( function runs with interrupts disabled in > the CPU_DEAD case. It therefore cannot block, for example, as will > happen when it calls console_lock(). Therefore, remove the CPU_DEAD > leg of the switch statement to avoid this problem. > > Signed-off-by: Paul E. McKenney s/CPU_DEAD/CPU_DYING/ Apparently it is a bad idea to compose and send a patch while in a C++ standards committee meeting where people are arguing about async futures... Fixed patch below. Thanx, Paul printk: Fix scheduling-while-atomic problem in console_cpu_notify() The console_cpu_notify( function runs with interrupts disabled in the CPU_DYING case. It therefore cannot block, for example, as will happen when it calls console_lock(). Therefore, remove the CPU_DYING leg of the switch statement to avoid this problem. Signed-off-by: Paul E. McKenney diff --git a/kernel/printk.c b/kernel/printk.c index 66a2ea3..2d607f4 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1890,7 +1890,6 @@ static int __cpuinit console_cpu_notify(struct notifier_block *self, switch (action) { case CPU_ONLINE: case CPU_DEAD: - case CPU_DYING: case CPU_DOWN_FAILED: case CPU_UP_CANCELED: console_lock(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
We disable BSP if boot cpu is AP. INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for BSP since it causes BSP jump to BIOS init code; typical visible behaviour is hang or immediate reset, depending on the BIOS init code. INIT can be used to reset AP in a fatal system error state as described in MP spec 3.7.3 Processor-specific INIT. In contrast, there is no processor-specific INIT for BSP to initilize from a fatal system error. It might be possible to do so by NMI plus any hand-crafted reset code that is carefully designed, but at least I have no idea in this direction now. By the way, my motivation is to generate crash dump quickly on the system with huge memory. I think we can assume such system also has a lot of cpus. If so, it would be no problem if only one cpu gets unavailable. We lookup ACPI table or MP table to get BSP information because we cannot run rdmsr instruction on the CPU we are about to wake up just now. One thing to be concerned about here is that ACPI guidlines BIOS *should* list the BSP in the first MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS following ACPI's guideline. On the other hand, we don't need to worry about this in MP table case because it has explit BSP flag. To avoid any undesirable bahaviour caused by any broken BIOS that doesn't conform to the guideline, it's enough to limit the number of cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in default kdump configuration. (Of course, it's problematic in maxcpu=1 case if trying to wake up other cpus in user space later.) Some firmware features such as hibernation and suspend needs to switch its CPU to BSP before transitting its execution to firmware, so these features are unavailable on the BSP-disabled setting. This is no problem because we don't need hibernation and suspend in the kdump 2nd kernel. SFI and devicetree doesn't provide BSP information, so there's no functionality change in their codes, only assigning false for all the entries, keeping interface uniform. Signed-off-by: HATAYAMA Daisuke --- arch/x86/include/asm/mpspec.h |2 +- arch/x86/kernel/acpi/boot.c | 10 +- arch/x86/kernel/apic/apic.c | 21 - arch/x86/kernel/devicetree.c |2 +- arch/x86/kernel/mpparse.c | 15 +-- arch/x86/platform/sfi/sfi.c |2 +- 6 files changed, 45 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index d56f253..b5d8e23 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -97,7 +97,7 @@ static inline void early_reserve_e820_mpc_new(void) { } #define default_get_smp_config x86_init_uint_noop #endif -void __cpuinit generic_processor_info(int apicid, int version); +void __cpuinit generic_processor_info(int apicid, bool isbsp, int version); #ifdef CONFIG_ACPI extern void mp_register_ioapic(int id, u32 address, u32 gsi_base); extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger, diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index e651f7a..e873c09 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -198,6 +198,7 @@ static int __init acpi_parse_madt(struct acpi_table_header *table) static void __cpuinit acpi_register_lapic(int id, u8 enabled) { unsigned int ver = 0; + bool isbsp = false; if (id >= (MAX_LOCAL_APIC-1)) { printk(KERN_INFO PREFIX "skipped apicid that is too big\n"); @@ -212,7 +213,14 @@ static void __cpuinit acpi_register_lapic(int id, u8 enabled) if (boot_cpu_physical_apicid != -1U) ver = apic_version[boot_cpu_physical_apicid]; - generic_processor_info(id, ver); + /* +* ACPI says BIOS should list BSP in the first MADT LAPIC +* entry. +*/ + if (!num_processors && !disabled_cpus) + isbsp = true; + + generic_processor_info(id, isbsp, ver); } static int __init diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index d8d69e4..4184853 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2034,13 +2034,32 @@ void disconnect_bsp_APIC(int virt_wire_setup) apic_write(APIC_LVT1, value); } -void __cpuinit generic_processor_info(int apicid, int version) +void __cpuinit generic_processor_info(int apicid, bool isbsp, int version) { int cpu, max = nr_cpu_ids; bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid, phys_cpu_present_map); /* +* If boot cpu is AP, we now don't have any way to initialize +* BSP. To save memory consumed, we disable BSP this case. +* +* Then, we cannot use the features specific to BSP such as +* hibernation and suspend. This is no problem because AP +* becomes boot cpu only on kexec triggered by crash. +*/ +
[PATCH v1 1/2] x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP or not
Part of boot-up code assumes booting CPU is BSP, but kexec can enter the 2nd kernel with AP. To be able to distinguish these throughout kernel processing, introduce boot_cpu_is_bsp. Signed-off-by: HATAYAMA Daisuke --- arch/x86/include/asm/mpspec.h |3 +++ arch/x86/kernel/apic/apic.c | 13 + arch/x86/kernel/setup.c |2 ++ 3 files changed, 18 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index 3e2f42a..d56f253 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -47,10 +47,13 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES]; extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); extern unsigned int boot_cpu_physical_apicid; +extern bool boot_cpu_is_bsp; extern unsigned int max_physical_apicid; extern int mpc_default_type; extern unsigned long mp_lapic_addr; +extern void boot_cpu_is_bsp_init(void); + #ifdef CONFIG_X86_LOCAL_APIC extern int smp_found_config; #else diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b17416e..d8d69e4 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -62,6 +62,10 @@ unsigned disabled_cpus __cpuinitdata; /* Processor that is doing the boot up */ unsigned int boot_cpu_physical_apicid = -1U; +/* Indicates whether the processor that is doing the boot up, is BSP + * processor or not */ +bool boot_cpu_is_bsp; + /* * The highest APIC ID seen during enumeration. */ @@ -2515,3 +2519,12 @@ static int __init lapic_insert_resource(void) * that is using request_resource */ late_initcall(lapic_insert_resource); + +void boot_cpu_is_bsp_init(void) +{ + u32 l, h; + + rdmsr(MSR_IA32_APICBASE, l, h); + + boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false; +} diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a2bb18e..6ecb9bc 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -988,6 +988,8 @@ void __init setup_arch(char **cmdline_p) early_quirks(); + boot_cpu_is_bsp_init(); + /* * Read APIC and some other early information from ACPI tables. */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Multiple CPUs are useful for CPU-bound processing like compression and I do want to use compression to generate crash dump quickly. But now we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if crash happens on AP. If crash happens on AP, kexec enters the 2nd kernel with the AP, and there BSP in the 1st kernel is expected to be haling in the 1st kernel or possibly in any fatal system error state. To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes BSP to jump into BIOS init code. A typical visible behaviour is hang or immediate reset, depending on the BIOS init code. AP can be initiated by INIT even in a fatal state: MP spec explains that processor-specific INIT can be used to recover AP from a fatal system error. On the other hand, there's no method for BSP to recover; it might be possible to do so by NMI plus any hand-coded reset code that is carefully designed, but at least I have no idea in this direction now. Therefore, the idea I do in this patch set is simply to disable BSP if vboot cpu is AP. My motivation is to use multiple CPUs in order to quickly generate crash dump on the machine with huge amount of memory. I assume such machine tends to also have a lot of CPUs. So disabling one CPU would be no problem. On most BIOSs, BSP is always assigned to cpu#1; on other BIOSs, BSP could probably be assigned to a fixed cpu number. Assuming this fact, it might be possible to choose an idea that waking up the cpus except for cpu#1, not waking up cpu#1 only. But I don't choose this in this patch set because: - It's ugly desgin to keep switch in sysfs that can unintentionally cause system to enter undefined behaviour. - Memory space for BSP is never used if BSP is not running. Amount of reserved memory for 2nd kernel is typically from 128MB to 512MB only, severely limited. If BSP is unused, I want to use the space for another AP instead. Note: recent upstream kernel fails reserving memory for kdump 2nd kernel. To run kdump, please apply the patch below on top of this patch set: https://lkml.org/lkml/2012/8/31/238 --- HATAYAMA Daisuke (2): x86, apic: Disable BSP if boot cpu is AP x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP or not arch/x86/include/asm/mpspec.h |5 - arch/x86/kernel/acpi/boot.c | 10 +- arch/x86/kernel/apic/apic.c | 34 +- arch/x86/kernel/devicetree.c |2 +- arch/x86/kernel/mpparse.c | 15 +-- arch/x86/kernel/setup.c |2 ++ arch/x86/platform/sfi/sfi.c |2 +- 7 files changed, 63 insertions(+), 7 deletions(-) -- Thanks. HATAYAMA, Daisuke -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 1/2] x86, pci: Reset PCIe devices at boot time
(2012/10/16 3:36), Yinghai Lu wrote: On Mon, Oct 15, 2012 at 12:00 AM, Takao Indoh wrote: This patch resets PCIe devices at boot time by hot reset when "reset_devices" is specified. how about pci devices that domain_nr is not zero ? This patch does not support multiple domains yet. Signed-off-by: Takao Indoh --- arch/x86/include/asm/pci-direct.h |1 arch/x86/kernel/setup.c |3 arch/x86/pci/early.c | 344 include/linux/pci.h |2 init/main.c |4 5 files changed, 352 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h index b1e7a45..de30db2 100644 --- a/arch/x86/include/asm/pci-direct.h +++ b/arch/x86/include/asm/pci-direct.h @@ -18,4 +18,5 @@ extern int early_pci_allowed(void); extern unsigned int pci_early_dump_regs; extern void early_dump_pci_device(u8 bus, u8 slot, u8 func); extern void early_dump_pci_devices(void); +extern void early_reset_pcie_devices(void); #endif /* _ASM_X86_PCI_DIRECT_H */ diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a2bb18e..73d3425 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -987,6 +987,9 @@ void __init setup_arch(char **cmdline_p) generic_apic_probe(); early_quirks(); +#ifdef CONFIG_PCI + early_reset_pcie_devices(); +#endif /* * Read APIC and some other early information from ACPI tables. diff --git a/arch/x86/pci/early.c b/arch/x86/pci/early.c index d1067d5..683b30f 100644 --- a/arch/x86/pci/early.c +++ b/arch/x86/pci/early.c @@ -1,5 +1,6 @@ #include #include +#include #include #include #include @@ -109,3 +110,346 @@ void early_dump_pci_devices(void) } } } + +#define PCI_EXP_SAVE_REGS 7 +#define pcie_cap_has_devctl(type, flags) 1 +#define pcie_cap_has_lnkctl(type, flags) \ + ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\ +(type == PCI_EXP_TYPE_ROOT_PORT || \ + type == PCI_EXP_TYPE_ENDPOINT || \ + type == PCI_EXP_TYPE_LEG_END)) +#define pcie_cap_has_sltctl(type, flags) \ + ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\ +((type == PCI_EXP_TYPE_ROOT_PORT) || \ + (type == PCI_EXP_TYPE_DOWNSTREAM && \ + (flags & PCI_EXP_FLAGS_SLOT +#define pcie_cap_has_rtctl(type, flags)\ + ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\ +(type == PCI_EXP_TYPE_ROOT_PORT || \ + type == PCI_EXP_TYPE_RC_EC)) + +struct save_config { + u32 pci[16]; + u16 pcie[PCI_EXP_SAVE_REGS]; +}; + +struct pcie_dev { + int cap; /* position of PCI Express capability */ + int flags; /* PCI_EXP_FLAGS */ + struct save_config save; /* saved configration register */ +}; + +struct pcie_port { + struct list_head dev; + u8 secondary; + struct pcie_dev child[PCI_MAX_FUNCTIONS]; +}; + +static LIST_HEAD(device_list); +static void __init pci_udelay(int loops) +{ + while (loops--) { + /* Approximately 1 us */ + native_io_delay(); + } +} + +/* Derived from drivers/pci/pci.c */ +#define PCI_FIND_CAP_TTL 48 +static int __init __pci_find_next_cap_ttl(u8 bus, u8 slot, u8 func, + u8 pos, int cap, int *ttl) +{ + u8 id; + + while ((*ttl)--) { + pos = read_pci_config_byte(bus, slot, func, pos); + if (pos < 0x40) + break; + pos &= ~3; + id = read_pci_config_byte(bus, slot, func, + pos + PCI_CAP_LIST_ID); + if (id == 0xff) + break; + if (id == cap) + return pos; + pos += PCI_CAP_LIST_NEXT; + } + return 0; +} + +static int __init __pci_find_next_cap(u8 bus, u8 slot, u8 func, u8 pos, int cap) +{ + int ttl = PCI_FIND_CAP_TTL; + + return __pci_find_next_cap_ttl(bus, slot, func, pos, cap, ); +} + +static int __init __pci_bus_find_cap_start(u8 bus, u8 slot, u8 func, + u8 hdr_type) +{ + u16 status; + + status = read_pci_config_16(bus, slot, func, PCI_STATUS); + if (!(status & PCI_STATUS_CAP_LIST)) + return 0; + + switch (hdr_type) { + case PCI_HEADER_TYPE_NORMAL: + case PCI_HEADER_TYPE_BRIDGE: + return PCI_CAPABILITY_LIST; + case PCI_HEADER_TYPE_CARDBUS: + return PCI_CB_CAPABILITY_LIST; + default: + return 0; + } + + return 0; +} + +static int __init early_pci_find_capability(u8 bus, u8 slot, u8 func, int cap) +{ + int pos; + u8 type =
Re: [PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature
The title could be made more descriptive: ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef CONFIG_HIBERNATION On Tue, Oct 16, 2012 at 12:05:03PM +0800, Yuanhan Liu wrote: > acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block, > but the current code put the declare in #ifdef CONFIG_PM_SLEEP block. And it's better to always include the original build error/warning messages when fixing build problems. Otherwise looks good to me. Reviewed-by: Fengguang Wu > Signed-off-by: Yuanhan Liu > --- > include/linux/acpi.h |5 - > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/include/linux/acpi.h b/include/linux/acpi.h > index 90be989..a468429 100644 > --- a/include/linux/acpi.h > +++ b/include/linux/acpi.h > @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, > resource_size_t n, > > int acpi_resources_are_enforced(void); > > -#ifdef CONFIG_PM_SLEEP > +#ifdef CONFIG_HIBERNATION > void __init acpi_no_s4_hw_signature(void); > +#endif > + > +#ifdef CONFIG_PM_SLEEP > void __init acpi_old_suspend_ordering(void); > void __init acpi_nvs_nosave(void); > #endif /* CONFIG_PM_SLEEP */ > -- > 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the final tree
On Tue, Oct 16, 2012 at 02:50:29PM +1100, Stephen Rothwell wrote: > Hi Al, > > After merging the final tree, today's linux-next build (sparc64 defconfig) > failed like this: > > arch/sparc/kernel/head_64.o: In function `sys64_execve': > (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol > `sys_execve' defined in .text section in fs/built-in.o > arch/sparc/kernel/head_64.o: In function `sys32_execve': > (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol > `compat_sys_execve' defined in .text section in fs/built-in.o > > Probably caused by commit 3223f8aab885 ("sparc64: convert to generic > execve") and following from the signal tree. > > I have added this patch you suggested on IRC: > > From: Stephen Rothwell > Date: Tue, 16 Oct 2012 14:43:51 +1100 > Subject: [PATCH] sparc: fixup for conversion to generic execve > > Fixes these errors: > > arch/sparc/kernel/head_64.o: In function `sys64_execve': > (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol > `sys_execve' defined in .text section in fs/built-in.o > arch/sparc/kernel/head_64.o: In function `sys32_execve': > (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol > `compat_sys_execve' defined in .text section in fs/built-in.o > > Dictated-by: Al Viro > Signed-off-by: Stephen Rothwell > --- > arch/sparc/kernel/syscalls.S | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S > index 4bae096..f667cdf 100644 > --- a/arch/sparc/kernel/syscalls.S > +++ b/arch/sparc/kernel/syscalls.S > @@ -2,15 +2,19 @@ >* environment settings are the same as the calling processes. >*/ > sys64_execve: > - ba,pt %xcc,sys_execve > - flushw > + flushw > + mov %o7, %l5 > + callsys_execve > + mov%l5, %o7 > > #ifdef CONFIG_COMPAT > sunos_execv: > mov %g0, %o2 > sys32_execve: > - ba,pt %xcc,compat_sys_execve > - flushw > + flushw > + mov %o7, %l5 > + callcompat_sys_execve > + mov%l5, %o7 > #endif BTW, that's really quick and dirty; I'm not at all sure we need that flushw there, which could make things much simpler. Namely, kill sys64_execve completely, making it equivalent to sys_execve(), do the same to sys32_execve() (== compat_sys_execve()) and as for sunos_execv(), I'd simply put it into sys_sparc32.c as SYSCALL_DEFINE2(sunos_execv, char __user *, filename, const char __user *const __user *, argv) { return compat_sys_execve(filename, argv, NULL); } We definitely want flushw in fork and friends, but I'm not sure what we need it for in execve(2)... Anyway, the brute-force variant works. I had been lucky to stay within the ba,pt target limit on the config I used (very heavily modular, so not much code in vmlinux in the first place, let alone before fs/exec.o), so I'd missed the problem until now. I've booted that with fatter config that would blow the previous variant at link time and it works. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 1/3] dmaengine: dw_dmac: Update documentation style comments for dw_dma_platform_data
Documentation style comments were missing for few fields in struct dw_dma_platform_data. Add these. Signed-off-by: Viresh Kumar Reviewed-by: Andy Shevchenko --- include/linux/dw_dmac.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/dw_dmac.h b/include/linux/dw_dmac.h index e1c8c9e..62a6190 100644 --- a/include/linux/dw_dmac.h +++ b/include/linux/dw_dmac.h @@ -19,6 +19,8 @@ * @nr_channels: Number of channels supported by hardware (max 8) * @is_private: The device channels should be marked as private and not for * by the general purpose DMA channel allocator. + * @chan_allocation_order: Allocate channels starting from 0 or 7 + * @chan_priority: Set channel priority increasing from 0 to 7 or 7 to 0. * @block_size: Maximum block size supported by the controller * @nr_masters: Number of AHB masters supported by the controller * @data_width: Maximum data width supported by hardware per AHB master -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name
(2012/10/16 10:37), Hyeoncheol Lee wrote: > convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS > bytes for a function name Yeah, that one was from my laziness... > > Cc: Masami Hiramatsu > Cc: Srikar Dronamraju > Signed-off-by: Hyeoncheol Lee > --- > tools/perf/util/probe-event.c |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c > index 49a256e..bb40ed4 100644 > --- a/tools/perf/util/probe-event.c > +++ b/tools/perf/util/probe-event.c > @@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct > perf_probe_event *pev, const char *exec) > free(exec_copy); > } > free(pp->function); > - pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS); > + pp->function = zalloc(sizeof(char) * > + (3 + sizeof(unsigned long long) * 2)); Could you comment that this is enough long here? > if (!pp->function) { > ret = -ENOMEM; > pr_warning("Failed to allocate memory by zalloc.\n"); > goto out; > } > - e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr); > + sprintf(pp->function, "0x%llx", vaddr); And at least we should use snprintf instead of sprintf... (I think ret = e_snprintf(...) is better) > ret = 0; > > out: > Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 2/3] dmaengine: dw_dmac: Enhance device tree support
dw_dmac driver already supports device tree but it used to have its platform data passed the non-DT way. This patch does following changes: - pass platform data via DT, non-DT way still takes precedence if both are used. - create generic filter routine - Earlier slave information was made available by slave specific filter routines in chan->private field. Now, this information would be passed from within dmac DT node. Slave drivers would now be required to pass bus_id (a string) as parameter to this generic filter(), which would be compared against the slave data passed from DT, by the generic filter routine. - Update binding document Signed-off-by: Viresh Kumar Reviewed-by: Andy Shevchenko --- V2->V3: -- - Simplified an equation in filter routine - renamed variable 'val' as 'tmp' in DT parsing routine V1->V2: -- - Optimized filter & DT parsing routine - Removed unnecessary casts from changes - renamed filter function - Fixed function prototype and return value of DT parsing routine for !CONFIG_OF case - use of_get_child_count() Documentation/devicetree/bindings/dma/snps-dma.txt | 44 +++ drivers/dma/dw_dmac.c | 134 + drivers/dma/dw_dmac_regs.h | 4 + include/linux/dw_dmac.h| 43 --- 4 files changed, 208 insertions(+), 17 deletions(-) diff --git a/Documentation/devicetree/bindings/dma/snps-dma.txt b/Documentation/devicetree/bindings/dma/snps-dma.txt index c0d85db..5bb3dfb 100644 --- a/Documentation/devicetree/bindings/dma/snps-dma.txt +++ b/Documentation/devicetree/bindings/dma/snps-dma.txt @@ -6,6 +6,26 @@ Required properties: - interrupt-parent: Should be the phandle for the interrupt controller that services interrupts for this device - interrupt: Should contain the DMAC interrupt number +- nr_channels: Number of channels supported by hardware +- is_private: The device channels should be marked as private and not for by the + general purpose DMA channel allocator. False if not passed. +- chan_allocation_order: order of allocation of channel, 0 (default): ascending, + 1: descending +- chan_priority: priority of channels. 0 (default): increase from chan 0->n, 1: + increase from chan n->0 +- block_size: Maximum block size supported by the controller +- nr_masters: Number of AHB masters supported by the controller +- data_width: Maximum data width supported by hardware per AHB master + (0 - 8bits, 1 - 16bits, ..., 5 - 256bits) +- slave_info: + - bus_id: name of this device channel, not just a device name since + devices may have more than one channel e.g. "foo_tx". For using the + dw_generic_filter(), slave drivers must pass exactly this string as + param to filter function. + - cfg_hi: Platform-specific initializer for the CFG_HI register + - cfg_lo: Platform-specific initializer for the CFG_LO register + - src_master: src master for transfers on allocated channel. + - dst_master: dest master for transfers on allocated channel. Example: @@ -14,4 +34,28 @@ Example: reg = <0xfc00 0x1000>; interrupt-parent = <>; interrupts = <12>; + + nr_channels = <8>; + chan_allocation_order = <1>; + chan_priority = <1>; + block_size = <0xfff>; + nr_masters = <2>; + data_width = <3 3 0 0>; + + slave_info { + uart0-tx { + bus_id = "uart0-tx"; + cfg_hi = <0x4000>; /* 0x8 << 11 */ + cfg_lo = <0>; + src_master = <0>; + dst_master = <1>; + }; + spi0-tx { + bus_id = "spi0-tx"; + cfg_hi = <0x2000>; /* 0x4 << 11 */ + cfg_lo = <0>; + src_master = <0>; + dst_master = <0>; + }; + }; }; diff --git a/drivers/dma/dw_dmac.c b/drivers/dma/dw_dmac.c index c4b0eb3..98f33a7 100644 --- a/drivers/dma/dw_dmac.c +++ b/drivers/dma/dw_dmac.c @@ -1179,6 +1179,50 @@ static void dwc_free_chan_resources(struct dma_chan *chan) dev_vdbg(chan2dev(chan), "%s: done\n", __func__); } +bool dw_dma_generic_filter(struct dma_chan *chan, void *param) +{ + struct dw_dma *dw = to_dw_dma(chan->device); + static struct dw_dma *last_dw; + static char *last_bus_id; + int i = -1; + + /* +* dmaengine framework calls this routine for all channels of all dma +* controller, until true is returned. If 'param' bus_id is not +* registered with a dma controller (dw), then there is no need of +* running below function for all channels of
[PATCH V3 3/3] ARM: SPEAr13xx: Pass DW DMAC platform data from DT
This patch adds dw_dmac's platform data to DT node. It also creates slave info node for SPEAr13xx, for the devices which were using dw_dmac. Signed-off-by: Viresh Kumar --- V1->V3: -- - renamed filter function arch/arm/boot/dts/spear1340.dtsi | 19 ++ arch/arm/boot/dts/spear13xx.dtsi | 38 arch/arm/mach-spear13xx/include/mach/spear.h | 2 -- arch/arm/mach-spear13xx/spear1310.c | 4 +-- arch/arm/mach-spear13xx/spear1340.c | 27 +++--- arch/arm/mach-spear13xx/spear13xx.c | 54 ++-- 6 files changed, 65 insertions(+), 79 deletions(-) diff --git a/arch/arm/boot/dts/spear1340.dtsi b/arch/arm/boot/dts/spear1340.dtsi index d71fe2a..8ea3f66 100644 --- a/arch/arm/boot/dts/spear1340.dtsi +++ b/arch/arm/boot/dts/spear1340.dtsi @@ -24,6 +24,25 @@ status = "disabled"; }; + dma@ea80 { + slave_info { + uart1_tx { + bus_id = "uart1_tx"; + cfg_hi = <0x6000>; /* 0xC << 11 */ + cfg_lo = <0>; + src_master = <0>; + dst_master = <1>; + }; + uart1_tx { + bus_id = "uart1_tx"; + cfg_hi = <0x680>; /* 0xD << 7 */ + cfg_lo = <0>; + src_master = <1>; + dst_master = <0>; + }; + }; + }; + spi1: spi@5d40 { compatible = "arm,pl022", "arm,primecell"; reg = <0x5d40 0x1000>; diff --git a/arch/arm/boot/dts/spear13xx.dtsi b/arch/arm/boot/dts/spear13xx.dtsi index f7b84ac..f06bb50 100644 --- a/arch/arm/boot/dts/spear13xx.dtsi +++ b/arch/arm/boot/dts/spear13xx.dtsi @@ -91,6 +91,37 @@ reg = <0xea80 0x1000>; interrupts = <0 19 0x4>; status = "disabled"; + + nr_channels = <8>; + chan_allocation_order = <1>; + chan_priority = <1>; + block_size = <0xfff>; + nr_masters = <2>; + data_width = <3 3 0 0>; + + slave_info { + ssp0_tx { + bus_id = "ssp0_tx"; + cfg_hi = <0x2000>; /* 0x4 << 11 */ + cfg_lo = <0>; + src_master = <0>; + dst_master = <0>; + }; + ssp0_rx { + bus_id = "ssp0_rx"; + cfg_hi = <0x280>; /* 0x5 << 7 */ + cfg_lo = <0>; + src_master = <0>; + dst_master = <0>; + }; + cf { + bus_id = "cf"; + cfg_hi = <0>; + cfg_lo = <0>; + src_master = <0>; + dst_master = <0>; + }; + }; }; dma@eb00 { @@ -98,6 +129,13 @@ reg = <0xeb00 0x1000>; interrupts = <0 59 0x4>; status = "disabled"; + + nr_channels = <8>; + chan_allocation_order = <1>; + chan_priority = <1>; + block_size = <0xfff>; + nr_masters = <2>; + data_width = <3 3 0 0>; }; fsmc: flash@b000 { diff --git a/arch/arm/mach-spear13xx/include/mach/spear.h b/arch/arm/mach-spear13xx/include/mach/spear.h index 07d90ac..71bf5b6 100644 --- a/arch/arm/mach-spear13xx/include/mach/spear.h +++ b/arch/arm/mach-spear13xx/include/mach/spear.h @@ -43,8 +43,6 @@ #define VA_L2CC_BASE IOMEM(UL(0xFB00)) /* others */ -#define DMAC0_BASE UL(0xEA80) -#define DMAC1_BASE UL(0xEB00) #define MCIF_CF_BASE UL(0xB280) /* Devices present in SPEAr1310 */ diff --git a/arch/arm/mach-spear13xx/spear1310.c
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> FS_IOC_GET_HEAT_INFO: return a struct containing the various >> metrics collected in btrfs_freq_data structs, and also return a > > I think you mean hot_freq_data :P Yeah, sorry. > >> calculated data temperature based on those metrics. Optionally, retrieve >> the temperature from the hot data hash list instead of recalculating it. > > To get the heat info for a specific file you have to know what file > you want to get that info for, right? I can see the usefulness of Yes. > asking for the heat data on a specific file, but how do you find the > hot files in the first place? i.e. the big question the user > interface needs to answer is "what files are hot?". We only tell the user what the files' temperatures are, not what files are hot. Their temperatures are in the output of debugfs. > > Once userspace knows what the hottest files are, it can open them If the user need to know this type of info, it is easy for us to provide it. But i don't know what way the user hope to get it via. > and query the data via the above ioctl, but expecting userspace to > iterate millions of inodes in a filesystem to find hot files is very > inefficient. > > FWIW, if you were to return file handles to the hottest files, then > the application could open and query them without even needing to > know the path name to them. This woul dbe exceedingly useful for > defragmentation programs, especially as that is the way xfs_fsr > already operates on candidate files.(*) ah. > > IOWs, sometimes the pathname is irrelevant to the operations that > applications want to perform - all they care about having an > efficient method of finding the inode they want and getting a file > descriptor that points to the file. Given the heat map info fits > right in to the sort of operations defrag and data mover tools > already do, it kind of makes sense to optimise the interface towards > those uses > > (*) i.e. finds them via bulkstat which returns handle information > along with all the other inode data, then opens the file by handle > to do the defrag work OK. > >> FS_IOC_GET_HEAT_OPTS: return an integer representing the current >> state of hot data tracking and migration: >> >> 0 = do nothing >> 1 = track frequency of access >> >> FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and >> migration, as described above. > > I can't see how this is a manageable interface. It is not > persistent, so after every filesystem mount you'd have to set the > flag on all your inodes again. Hence, for the moment, I'd suggest > that dropping per-inode tracking control until all the core issues > are sorted out OK. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Initial report on F2FS filesystem performance
This is a brief summary of our initial filesystem performance study of f2fs against existing two filesystems in linux: EXT4, NILFS2, and f2fs. * test platform i) Desktop PC : Linux 3.6.1 (f2fs patched), Intel i5-2500 @3.3GHz quad-core, 8GB RAM, Transcend 16GB class 10 micro SD card ii) Galaxy-S3 : Linux 3.0.15 (f2fs ported), Android 4.0.4, DVFS turned off, Transcend 16GB class 10 micro SD card * experiment 1: buffered write(sequential and random, 4KByte write) === F2FS surpasses other two filesystems in both random and sequential. In desktop and Galaxy S3, f2fs exhibits 2.5 and 1.6 times better performance in random write against EXT4, respectively. EXT4 is standard Android filesystem. buffered write (1GB file) +---+-+--+ | | Desktop PC|Galaxy-S3 | | +-+---+--+---+ | |sequential (MB/s)| random (IOPS) |sequential (MB/s) | random (IOPS) | +---+-+---+--+---+ | EXT4 |7.1 | 1073 |6.7 | 1073 | +---+-+---+--+---+ | NILFS2|6.8 | 1462 |4.0 | 1272 | +---+-+---+--+---+ | F2FS | 10.6 | 2675 |6.9 | 1682 | +---+-+---+--+---+ * experiment 2: write + fsync(sequential and random) F2FS surpasses other two filesystems in both random and sequential workload. In desktop and Galaxy S3, f2fs exhibits 2 and 1.5 times better performance in write+fsync random write against EXT4, respectively. write + fsync (100MB file) +---+-+--+ | | Desktop PC|Galaxy-S3 | | +-+---+--+---+ | |sequential (KB/s)| random (IOPS) |sequential (KB/s) | random (IOPS) | +---+-+---+--+---+ | EXT4 | 511.8 | 125 | 383.4 | 119 | +---+-+---+--+---+ | NILFS2| 545.2 | 112 | 356.7 | 72 | +---+-+---+--+---+ | F2FS | 1057.9 | 240 | 772.3 | 184 | +---+-+---+--+---+ write() with fsync is to test the filesystem performance under Android SQLite operation. * experiment 3: mounting time === To measure the mount time, we used two different scenarios. First, we mounted file system after formatting without rebooting system. Second, we mounted file system after rebooting in order to ensure any data cached in memory is flushed. Overall, EXT4 shows fastest mount time, and F2FS shows second best performance; however, we observed that F2FS takes longest time to mount right after formatting. mounting time with Transcend 16GB micro-SD +---+---+---+ | | Desktop PC |Galaxy-S3 | | +-+-+-+-+ | |1st mount after | after rebooting |1st mount after | after rebooting | | |format (msec)| (msec) |format (msec)| (msec) | +---+-+-+-+-+ | EXT4 | 11 | 20 | 20 | 40 | +---+-+-+-+-+ | NILFS2|920 | 1013 | 1680 | 1630 | +---+-+-+-+-+ | F2FS | 1486 |161 | 2280 | 1570 | +---+-+-+-+-+ Sooman Jeong ESOS Lab. Hanyang University. <77sm...@hanyang.ac.kr>
Re: [PATCH v2] fat: editions to support fat_fallocate()
2012/10/15 OGAWA Hirofumi : > Namjae Jeon writes: > >> Implement preallocation via the fallocate syscall on VFAT partitions. >> This patch is based on an earlier patch of the same name which had some >> issues detailed below and did not get accepted. Refer >> https://lkml.org/lkml/2007/12/22/130. >> >> a)The preallocated space was not persistent across remounts when the >> FALLOC_FL_KEEP_SIZE flag was set. Also, writes to the file allocated new >> clusters instead of using the preallocated area. >> >> Consider the scenario: >> mount-->preallocate space for a file --> unmount. >> In the old patch,the preallocated space was not reflected for that >> file (verified using the 'du' command). >> >> This is now fixed with modifications to fat_fill_inode(). > When we consider other filesystems like XFS and ext4, the space which is preallocated is reserved for the life-time of that file which is persistent across(mount/umount). So, we tried to make this as similar to the existent solution - as that would keep the meaning of FALLOCATE - WITH_KEEP_SIZE as same across all filesystems. > What is real usage pattern of persistent across remounts on FAT? Yes, like a TORRENT FILE -> it reserves space in advance even though the system can be rebooted/disk unmounted and remount but the space still remains there - as long as the torrent exists Or if Torrent case does not matches currently Then, Consider a case for a TV series to be recorded Since – we want all the parts to be recorded on the same file (i.e., APPEND write) – and in such cases there are chances of TV shutdown, device unmount-mount again. So, we need to have the space to be remain available in such cases. > If once device was unmounted, we can't know the state of FS anymore, there are > many implementations of FAT. And preallocation is not in the spec. I agree, As you said before, we can make fat fallocate feature as configurable – so this is entirely in the hands of USER. > > I worry to break something. And I guess the freeing preallocation on > last close may fix the issue for usage. Okay, we can avoid most of your concerns except suddenly unplugging usb device. But fallocate behavior will be different with other filesystem. How about to make fat fallocate with configuration to be used by users is having needs? Let me know your opinion :) Thanks. Thanks.> -- > OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size
于 2012年10月16日 10:51, Myklebust, Trond 写道: >> >> 1) is it means: nfs_inode_attrs_need_update need not consider async >> read_done situation ? > > I don't understand what you mean. This is mainly about the asynchronous > write situation... for async read done, it will call nfs_readpage_result -> nfs_read_done -> nfs_refresh_inode -> nfs_refresh_inode_locked -> nfs_inode_attrs_need_update -> nfs_size_need_update. we need consider the situation that "async read_done also call nfs_size_need_update with an old useless larger file size". you means, it need not consider async read (only consider async write is enough), is it correct ? > > No... If I did, I would have changed this 15 years ago when I was > writing that code. Nothing here is new... 2.6.27-rc9 has the exact same > heuristics. 1) I have read the relative source code of 2.6.27-rc9, it is truly no nfs_size_need_update function. 2) I have test the 2.6.27-rc9, it truly pass the LTP test of udp+nfsv2. 3) I got the 2.6.27-rc9 source code by this way (please check) A) get source code from (git clone) git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git B) git archive v2.6.27-rc9 | tar -xf - -C ../2.6.27-rc9/ > It boils down to the rule that if you want to ensure that data is not > _lost_, then you have to ensure that the cached file size is not less > than the true file size. > 1) you means: in some condition, the cached file size can be bigger than the true file size ? can you give some example (which no negative effect for correctness) ? 2) What I feel: A) I am not quite familiar with nfs (so truly need your information); B) I think it is truly a bug, but maybe nfs_size_need_update is not the root cause (so need nfs maintainers' audit) C) if nfs_size_need_update is truly not the root cause, I shall continue analysing it, after get enough information from nfs maintainers. >> B) the test tools which I use is from the LTP (Linux Test Project), >> they use both udp and tcp to test both the nfsv2 and nfsv3. > > So what combinations are failing? for udp + nfsv2 failing (I am not test udp + nfsv3) > >> C) truly LTP has its limitations: "for stress test, LTP let nfs client >> and server under the same machine, which will cause kernel stable >> issue", but for net test, LTP use different machine (I got our issue >> from LTP net test). > > Running the client and server on the same machine is likely to deadlock > due to memory pressure issues. The client needs to be able to _increase_ > memory pressure on the server in order to reduce its own pressure. That > doesn't work well when client == server. > truly got confirmation from Jeff Layton, 1-2 months ago; also thank you for giving confirmation too. -- Chen Gang Asianux Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] random: Account for entropy loss due to overwrites
On Sat, Sep 29, 2012 at 12:47:04PM -0700, H. Peter Anvin wrote: > >-static struct poolinfo { > >+static const struct poolinfo { > >+int poolshift; /* log2(POOLBITS) */ > > int poolwords; > > int tap1, tap2, tap3, tap4, tap5; Poolshift is duplicated information; it's just log2(poolwords) + 5 (since POOLBITS is poolwords*32). Granted you don't want to recalculate it every single time you need to use it, but perhaps it would be better to add poolshift to struct entropy_store, and set it in init_std_data()? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature
acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block, but the current code put the declare in #ifdef CONFIG_PM_SLEEP block. Signed-off-by: Yuanhan Liu --- include/linux/acpi.h |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 90be989..a468429 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, resource_size_t n, int acpi_resources_are_enforced(void); -#ifdef CONFIG_PM_SLEEP +#ifdef CONFIG_HIBERNATION void __init acpi_no_s4_hw_signature(void); +#endif + +#ifdef CONFIG_PM_SLEEP void __init acpi_old_suspend_ordering(void); void __init acpi_nvs_nosave(void); #endif /* CONFIG_PM_SLEEP */ -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] make GFP_NOTRACK flag unconditional
On Tue, 2 Oct 2012, David Rientjes wrote: > > There was a general sentiment in a recent discussion (See > > https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be > > defined unconditionally. Currently, the only offender is GFP_NOTRACK, > > which is conditional to KMEMCHECK. > > > > This simple patch makes it unconditional. > > > > Signed-off-by: Glauber Costa > > CC: Christoph Lameter > > CC: Mel Gorman > > CC: Andrew Morton > > Acked-by: David Rientjes > > I think it was done this way to show that if CONFIG_KMEMCHECK=n then the > bit could be reused for something else but I can't think of any reason why > that would be useful; what would need to add a gfp bit that would also > happen to depend on CONFIG_KMEMCHECK=n? Nothing comes to mind to save a > bit. > > There are other cases of this as well, like __GFP_OTHER_NODE which is only > useful for thp and it's defined unconditionally. So this seems fine to > me. > Still missing from linux-next as of this morning, I think this patch should be merged. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: Tree for Oct 16
Hi all, The merge window has closed, feel free to add new stuff again. Changes since 201201015: New tree: cortex Dropped Tree: cortex (complex merge conflict) Removed tree: kmemleak (maintainer suggested) The l2-mtd tree still had its build failure so I used the version from next-20121011. The tip tree gained a conflict against Linus' tree. The kvm-ppc tree lost its build failure. The cortex tree gained conflicts against Linus' tree. The signal tree gained a build failure for which I applied a suggested patch. The akpm tree gained a conflict against the signal tree. I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" as mentioned in the FAQ on the wiki (see below). You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig for x86_64. After the final fixups (if any), it is also built with powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig and allyesconfig (minus CONFIG_PROFILE_ALL_BRANCHES - this fails its final link) and i386, sparc, sparc64 and arm defconfig. These builds also have CONFIG_ENABLE_WARN_DEPRECATED, CONFIG_ENABLE_MUST_CHECK and CONFIG_DEBUG_INFO disabled when necessary. Below is a summary of the state of the merge. We are up to 204 trees (counting Linus' and 26 trees of patches pending for Linus' tree), more are welcome (even if they are currently empty). Thanks to those who have contributed, and to those who haven't, please do. Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. There is a wiki covering stuff to do with linux-next at http://linux.f-seidel.de/linux-next/pmwiki/ . Thanks to Frank Seidel. -- Cheers, Stephen Rothwells...@canb.auug.org.au $ git checkout master $ git reset --hard stable Merging origin/master (dd8e8c4 thermal, cpufreq: Fix build when CPU_FREQ_TABLE isn't configured) Merging fixes/master (12250d8 Merge branch 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux) Merging kbuild-current/rc-fixes (b1e0d8b kbuild: Fix gcc -x syntax) Merging arm-current/fixes (3d6ee36 Merge branch 'late-for-linus' of git://git.linaro.org/people/rmk/linux-arm) Merging m68k-current/for-linus (92f79db m68k: Remove empty #ifdef/#else/#endif block) Merging powerpc-merge/merge (fd3bc66 Merge tag 'disintegrate-powerpc-20121009' into merge) Merging sparc/master (ddffeb8 Linux 3.7-rc1) Merging net/master (29bb4cc docbook: networking: fix file paths for uapi headers) Merging sound-current/for-linus (ddffeb8 Linux 3.7-rc1) Merging pci-current/for-linus (0ff9514 PCI: Don't print anything while decoding is disabled) Merging wireless/master (bf11315 net/wireless: ipw2200: Fix panic occurring in ipw_handle_promiscuous_tx()) Merging driver-core.current/driver-core-linus (ddffeb8 Linux 3.7-rc1) Merging tty.current/tty-linus (3e5bde8 serial/8250_hp300: Missing 8250 register interface conversion bits) Merging usb.current/usb-linus (8282da4 MAINTAINERS: Add maintainer entry for the USB webcam gadget) Merging staging.current/staging-linus (ddffeb8 Linux 3.7-rc1) Merging char-misc.current/char-misc-linus (ddffeb8 Linux 3.7-rc1) Merging input-current/for-linus (0cc8d6a Merge branch 'next' into for-linus) Merging md-current/for-linus (72f36d5 md: refine reporting of resync/reshape delays.) Merging audit-current/for-linus (c158a35 audit: no leading space in audit_log_d_path prefix) Merging crypto-current/master (c9f97a2 crypto: x86/glue_helper - fix storing of new IV in CBC encryption) Merging ide/master (9974e43 ide: fix generic_ide_suspend/resume Oops) Merging dwmw2/master (244dc4e Merge git://git.infradead.org/users/dwmw2/random-2.6) Merging sh-current/sh-fixes-for-linus (4403310 SH: Convert out[bwl] macros to inline functions) Merging irqdomain-current/irqdomain/merge (15e06bf irqdomain: Fix debugfs formatting) Merging devicetree-current/devicetree/merge (4e8383b of: release node fix for of_parse_phandle_with_args) Merging spi-current/spi/merge (d1c185b of/spi: Fix SPI module loading by using proper "spi:" modalias prefixes.) Merging gpio-current/gpio/merge (96b7064 gpio/tca6424: merge I2C transactions, remove cast) Merging asm-generic/master (c37d615 Merge branch 'disintegrate-asm-generic' of
Re: mpol_to_str revisited.
On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. It's certainly not a complete fix, but I think it's a much better result of the race, i.e. we don't panic anymore, we simply fail the read() instead. > we should > close a race (or kill remain ref count leak) if we still have. As I mentioned earlier in the thread, the read() is done here on a task while only a reference to the task_struct is taken and we do not hold task_lock() which is required for task->mempolicy. Once that is fixed, mpol_to_str() should never be called for !task->mempolicy so it will never need to return -EINVAL in such a condition. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: build failure after merge of the final tree
Hi Al, After merging the final tree, today's linux-next build (sparc64 defconfig) failed like this: arch/sparc/kernel/head_64.o: In function `sys64_execve': (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol `sys_execve' defined in .text section in fs/built-in.o arch/sparc/kernel/head_64.o: In function `sys32_execve': (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol `compat_sys_execve' defined in .text section in fs/built-in.o Probably caused by commit 3223f8aab885 ("sparc64: convert to generic execve") and following from the signal tree. I have added this patch you suggested on IRC: From: Stephen Rothwell Date: Tue, 16 Oct 2012 14:43:51 +1100 Subject: [PATCH] sparc: fixup for conversion to generic execve Fixes these errors: arch/sparc/kernel/head_64.o: In function `sys64_execve': (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol `sys_execve' defined in .text section in fs/built-in.o arch/sparc/kernel/head_64.o: In function `sys32_execve': (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol `compat_sys_execve' defined in .text section in fs/built-in.o Dictated-by: Al Viro Signed-off-by: Stephen Rothwell --- arch/sparc/kernel/syscalls.S | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S index 4bae096..f667cdf 100644 --- a/arch/sparc/kernel/syscalls.S +++ b/arch/sparc/kernel/syscalls.S @@ -2,15 +2,19 @@ * environment settings are the same as the calling processes. */ sys64_execve: - ba,pt %xcc,sys_execve -flushw + flushw + mov %o7, %l5 + callsys_execve +mov%l5, %o7 #ifdef CONFIG_COMPAT sunos_execv: mov %g0, %o2 sys32_execve: - ba,pt %xcc,compat_sys_execve -flushw + flushw + mov %o7, %l5 + callcompat_sys_execve +mov%l5, %o7 #endif .align 32 -- 1.7.10.280.gaa39 -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpCuYJ04DY7h.pgp Description: PGP signature
Re: [RESEND] [PATCH 2/2] random: fix debug format strings
On Mon, Oct 15, 2012 at 11:43:29PM +0200, Jiri Kosina wrote: > Fix the following warnings in formatting debug output: > > drivers/char/random.c: In function ‘xfer_secondary_pool’: > drivers/char/random.c:827: warning: format ‘%d’ expects type ‘int’, but > argument 7 has type ‘size_t’ > drivers/char/random.c: In function ‘account’: > drivers/char/random.c:859: warning: format ‘%d’ expects type ‘int’, but > argument 5 has type ‘size_t’ > drivers/char/random.c:881: warning: format ‘%d’ expects type ‘int’, but > argument 5 has type ‘size_t’ > drivers/char/random.c: In function ‘random_read’: > drivers/char/random.c:1141: warning: format ‘%d’ expects type ‘int’, but > argument 5 has type ‘ssize_t’ > drivers/char/random.c:1145: warning: format ‘%d’ expects type ‘int’, but > argument 5 has type ‘ssize_t’ > drivers/char/random.c:1145: warning: format ‘%d’ expects type ‘int’, but > argument 6 has type ‘long unsigned int’ > > by using '%zd' instead of '%d' to properly denote ssize_t/size_t conversion. > > Signed-off-by: Jiri Kosina Applied to the random tree, thanks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND] [PATCH 1/2] random: make it possible to enable debugging without rebuild
On Mon, Oct 15, 2012 at 11:42:55PM +0200, Jiri Kosina wrote: > The module parameter that turns debugging mode (which basically means > printing a few extra lines during runtime) is in '#if 0' block. Forcing > everyone who would like to see how entropy is behaving on his system to > rebuild seems to be a little bit too harsh. > > If we were concerned about speed, we could potentially turn 'debug' into a > static key, but I don't think it's necessary. > > Drop the '#if 0' block to allow using the 'debug' parameter without > rebuilding. > > Signed-off-by: Jiri Kosina Applied to the random tree, thanks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpufreq:core: Fix printing of governor and driver name
On 15 October 2012 23:21, Rafael J. Wysocki wrote: > On Wednesday 10 of October 2012 10:12:11 Viresh Kumar wrote: >> Arrays for governer and driver name are of size CPUFREQ_NAME_LEN or 16. >> i.e. 15 bytes for name and 1 for trailing '\0'. >> >> When cpufreq driver print these names (for sysfs), it includes '\n' or ' ' in >> the fmt string and still passes length as CPUFREQ_NAME_LEN. If the driver or >> governor names are using all 15 fields allocated to them, then the trailing >> '\n' >> or ' ' will never be printed. And so commands like: >> >> root@linaro-developer# cat >> /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver >> >> will print something like: >> >> cpufreq_foodrvroot@linaro-developer# >> >> Fix this by increasing print length by one character. >> >> Signed-off-by: Viresh Kumar > > Thanks for the patch, I'll queue it up for v3.8. Hi Rafael, Thanks for accepting the patch. I thought both of my patches would go in 3.7-rc2 as they are bug fixes. Isn't that correct? -- viresh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 02/14] memcg: Reclaim when more than one page needed.
(2012/10/08 19:06), Glauber Costa wrote: > From: Suleiman Souhlal > > mem_cgroup_do_charge() was written before kmem accounting, and expects > three cases: being called for 1 page, being called for a stock of 32 > pages, or being called for a hugepage. If we call for 2 or 3 pages (and > both the stack and several slabs used in process creation are such, at > least with the debug options I had), it assumed it's being called for > stock and just retried without reclaiming. > > Fix that by passing down a minsize argument in addition to the csize. > > And what to do about that (csize == PAGE_SIZE && ret) retry? If it's > needed at all (and presumably is since it's there, perhaps to handle > races), then it should be extended to more than PAGE_SIZE, yet how far? > And should there be a retry count limit, of what? For now retry up to > COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if > __GFP_NORETRY. > > [v4: fixed nr pages calculation pointed out by Christoph Lameter ] > > Signed-off-by: Suleiman Souhlal > Signed-off-by: Glauber Costa > Reviewed-by: Kamezawa Hiroyuki > Acked-by: Michal Hocko > Acked-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: > From: Zhi Yong Wu > > FS_IOC_GET_HEAT_INFO: return a struct containing the various > metrics collected in btrfs_freq_data structs, and also return a I think you mean hot_freq_data :P > calculated data temperature based on those metrics. Optionally, retrieve > the temperature from the hot data hash list instead of recalculating it. To get the heat info for a specific file you have to know what file you want to get that info for, right? I can see the usefulness of asking for the heat data on a specific file, but how do you find the hot files in the first place? i.e. the big question the user interface needs to answer is "what files are hot?". Once userspace knows what the hottest files are, it can open them and query the data via the above ioctl, but expecting userspace to iterate millions of inodes in a filesystem to find hot files is very inefficient. FWIW, if you were to return file handles to the hottest files, then the application could open and query them without even needing to know the path name to them. This woul dbe exceedingly useful for defragmentation programs, especially as that is the way xfs_fsr already operates on candidate files.(*) IOWs, sometimes the pathname is irrelevant to the operations that applications want to perform - all they care about having an efficient method of finding the inode they want and getting a file descriptor that points to the file. Given the heat map info fits right in to the sort of operations defrag and data mover tools already do, it kind of makes sense to optimise the interface towards those uses (*) i.e. finds them via bulkstat which returns handle information along with all the other inode data, then opens the file by handle to do the defrag work > FS_IOC_GET_HEAT_OPTS: return an integer representing the current > state of hot data tracking and migration: > > 0 = do nothing > 1 = track frequency of access > > FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and > migration, as described above. I can't see how this is a manageable interface. It is not persistent, so after every filesystem mount you'd have to set the flag on all your inodes again. Hence, for the moment, I'd suggest that dropping per-inode tracking control until all the core issues are sorted out Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/5] irq_work: Move irq_work_raise() declaration/default definition to arch headers
On Tue, Oct 16, 2012 at 12:18:05AM +0200, Frederic Weisbecker wrote: > 2012/10/15 Arnd Bergmann : > > On Monday 15 October 2012, Steven Rostedt wrote: > >> On Mon, 2012-10-15 at 22:23 +0200, Frederic Weisbecker wrote: > >> > 2012/10/15 Steven Rostedt : > >> > > On Mon, 2012-10-15 at 17:11 +0100, Catalin Marinas wrote: > >> > > BTW, is there any rational reason that the include path lookup doesn't > >> > > just check for the files in include/asm-generic after looking in > >> > > arch/*/include/asm? > >> > > Really, the best way would be just to add the default asm files into > >> > > include/asm-generic and be done with it. I hate the fact that we need > >> > > to > >> > > touch every arch for every generic default file. > >> > Agreed. I'm including Arnd in the conversation. > >> As David Howells is doing user space header work, I'll include him too. > >> Maybe someone can shed some light onto this. I'll just add my vote there, I've *no* idea why asm-generic isn't in the include path by default, I could never figure out what that was for. > > A number of people have expressed the wish to do this through Makefile > > magic, but > > so far nobody has been able to come up with the right incantation. > > > > I've spent a day trying to figure it out, and I think Mark Brown tried some > > of > > the same things. It's probably not all that hard for someone who is more > > familiar > > with the Kbuild internals. I came up with stuff for it, though it needed prettyfying. > This seems to do the trick: > (It's the diff result of ln -s asm-generic include/asm) That'd work, but I assume there is some reason why we've got this system of explicitly adding each file. It's not like cpp can test for the presence of include files. If we can't figure out why we're not doing this I'd propose we start. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hung task when USB storage probe occurs during suspend
[oops, fixing LKML address] If you plug in a USB storage device and then suspend, resume, and quickly suspend again, the system may freeze. 2 minutes later you'll get the following message. I believe this is a regression introduced in 62d3c543 ("Block: use a freezable workqueue for disk-event polling"). Reverting that patch prevents the deadlock. <3>[ 240.107877] INFO: task kworker/u:2:64 blocked for more than 120 seconds. <3>[ 240.107888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6>[ 240.107899] kworker/u:2 D 880149f61908 064 2 0x <5>[ 240.107914] 880149f719f0 0046 8801495e3250 8801495e3250 <5>[ 240.107923] 880149f615d8 880149f61590 880149f71fd8 880149f71fd8 <5>[ 240.107931] 00012180 880149f61590 88014fa92208 7fff <5>[ 240.107939] Call Trace: <5>[ 240.107955] [] schedule+0x64/0x66 <5>[ 240.107966] [] schedule_timeout+0x34/0xde <5>[ 240.107973] [] wait_for_common+0xcd/0x14b <5>[ 240.107983] [] ? try_to_wake_up+0x1e0/0x1e0 <5>[ 240.107990] [] wait_for_completion+0x1d/0x1f <5>[ 240.107998] [] flush_work+0x2e/0x34 <5>[ 240.108004] [] ? do_work_for_cpu+0x27/0x27 <5>[ 240.108011] [] flush_delayed_work+0x49/0x4e <5>[ 240.108019] [] disk_clear_events+0x97/0xfb <5>[ 240.108029] [] check_disk_change+0x2d/0x5f <5>[ 240.108039] [] sd_open+0xb1/0x160 <5>[ 240.108047] [] __blkdev_get+0xbf/0x3b0 <5>[ 240.108054] [] blkdev_get+0x1df/0x2d8 <5>[ 240.108064] [] ? unlock_new_inode+0x5c/0x61 <5>[ 240.108074] [] ? put_device+0x17/0x19 <5>[ 240.108083] [] ? disk_put_part+0x12/0x14 <5>[ 240.108089] [] add_disk+0x29f/0x3e6 <5>[ 240.108096] [] sd_probe_async+0x124/0x1c4 <5>[ 240.108103] [] ? async_schedule+0x17/0x17 <5>[ 240.108108] [] async_run_entry_fn+0xa2/0x153 <5>[ 240.108115] [] process_one_work+0x199/0x2b8 <5>[ 240.108123] [] worker_thread+0x13c/0x222 <5>[ 240.108130] [] ? manage_workers.isra.26+0x171/0x171 <5>[ 240.108138] [] kthread+0x8b/0x93 <5>[ 240.108147] [] kernel_thread_helper+0x4/0x10 <5>[ 240.108155] [] ? __init_kthread_worker+0x39/0x39 <5>[ 240.108163] [] ? gs_change+0xb/0xb <0>[ 240.108169] Kernel panic - not syncing: hung_task: blocked tasks This async SCSI probe is stuck trying to flush the system_nrt_freezable_wq workqueue, which is frozen. <6>[ 169.464976] powerd_suspend D 880101d2e818 0 5687 5686 0x <5>[ 169.464981] 880101f2bcd8 0082 0246 81813020 <5>[ 169.464988] 0246 880101d2e4a0 880101f2bfd8 880101f2bfd8 <5>[ 169.464996] 00012180 880101d2e4a0 880101f2bcc8 0217 <5>[ 169.465002] Call Trace: <5>[ 169.465009] [] ? scsi_bus_resume_common+0x8d/0x8d <5>[ 169.465015] [] schedule+0x64/0x66 <5>[ 169.465023] [] async_synchronize_cookie_domain+0xb6/0x112 <5>[ 169.465029] [] ? __init_waitqueue_head+0x32/0x32 <5>[ 169.465038] [] async_synchronize_cookie+0x15/0x17 <5>[ 169.465046] [] async_synchronize_full+0x15/0x31 <5>[ 169.465052] [] scsi_bus_prepare+0x1d/0x36 <5>[ 169.465059] [] dpm_prepare+0xdd/0x18d <5>[ 169.465065] [] dpm_suspend_start+0x15/0x40 <5>[ 169.465073] [] suspend_devices_and_enter+0x78/0x27f <5>[ 169.465081] [] pm_suspend+0x134/0x1a9 <5>[ 169.465088] [] state_store+0x9c/0xc5 <5>[ 169.465098] [] kobj_attr_store+0x17/0x19 <5>[ 169.465105] [] sysfs_write_file+0x104/0x140 <5>[ 169.465111] [] vfs_write+0xa8/0xcf <5>[ 169.465117] [] sys_write+0x4a/0x71 <5>[ 169.465124] [] system_call_fastpath+0x16/0x1b The powerd_suspend task is blocked waiting for SCSI probes to complete. We've worked around this issue in the chromium tree by partially reverting 62d3c543 (see https://gerrit.chromium.org/gerrit/#/c/35324/). Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] include/version.h: Update for kernel 3.7
On Mon, Oct 15, 2012 at 04:43:27PM -0500, Larry Finger wrote: > The value for LINUX_VERSION_CODE was not updated for kernel 3.7-rc1. > > Signed-off-by: Larry Finger > --- > version.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > --- > > Index: linux-2.6/include/linux/version.h > === > --- linux-2.6.orig/include/linux/version.h > +++ linux-2.6/include/linux/version.h > @@ -1,2 +1,2 @@ > -#define LINUX_VERSION_CODE 198144 > +#define LINUX_VERSION_CODE 198400 > #define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c)) This isn't in the Linux git sources; it's a generated file. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: manual merge of the akpm tree with the signal tree
Hi Andrew, Today's linux-next merge of the akpm tree got a conflict in arch/arm64/kernel/sys_compat.c between commit 0fe8f08036a2 ("arm64: Use generic sys_execve() implementation") from the signal tree and commit "compat: generic compat_sys_sched_rr_get_interval implementation" from the akpm tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc arch/arm64/kernel/sys_compat.c index d140b73,fd8ae6e..000 --- a/arch/arm64/kernel/sys_compat.c +++ b/arch/arm64/kernel/sys_compat.c @@@ -49,21 -49,24 +49,6 @@@ asmlinkage int compat_sys_vfork(struct regs, 0, NULL, NULL); } - asmlinkage int compat_sys_sched_rr_get_interval(compat_pid_t pid, - struct compat_timespec __user *interval) - { - struct timespec t; - int ret; - mm_segment_t old_fs = get_fs(); - - set_fs(KERNEL_DS); - ret = sys_sched_rr_get_interval(pid, (struct timespec __user *)); - set_fs(old_fs); - if (put_compat_timespec(, interval)) - return -EFAULT; - return ret; - } - -asmlinkage int compat_sys_execve(const char __user *filenamei, - compat_uptr_t argv, compat_uptr_t envp, - struct pt_regs *regs) -{ - int error; - struct filename *filename; - - filename = getname(filenamei); - error = PTR_ERR(filename); - if (IS_ERR(filename)) - goto out; - error = compat_do_execve(filename->name, compat_ptr(argv), - compat_ptr(envp), regs); - putname(filename); -out: - return error; -} - static inline void do_compat_cache_op(unsigned long start, unsigned long end, int flags) { pgpYTllEUCFHr.pgp Description: PGP signature
Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size
On Tue, 2012-10-16 at 09:37 +0800, Chen Gang wrote: > 于 2012年10月15日 20:32, Myklebust, Trond 写道: > > RPC is not ordered. The fact that we get one RPC reply before another > > does not mean that the server sent them in that order. > > > > This is doubly true when you use UDP as the transport protocol. > > 1) is it means: nfs_inode_attrs_need_update need not consider async > read_done situation ? I don't understand what you mean. This is mainly about the asynchronous write situation... > 2) for correctness, I do not think "nfs_size_to_loff_t(fattr->size) > > i_size_read(inode)" in nfs_size_need_update is enough. (at least need > use "!=" instead of '>'), do you think so ? No... If I did, I would have changed this 15 years ago when I was writing that code. Nothing here is new... 2.6.27-rc9 has the exact same heuristics. It boils down to the rule that if you want to ensure that data is not _lost_, then you have to ensure that the cached file size is not less than the true file size. > 3) another reference: > > A) for an old kernel version (such as 2.6.27-rc9), no such issue > (because it did not have nfs_size_need_update). > > B) the test tools which I use is from the LTP (Linux Test Project), > they use both udp and tcp to test both the nfsv2 and nfsv3. So what combinations are failing? > C) truly LTP has its limitations: "for stress test, LTP let nfs client > and server under the same machine, which will cause kernel stable > issue", but for net test, LTP use different machine (I got our issue > from LTP net test). Running the client and server on the same machine is likely to deadlock due to memory pressure issues. The client needs to be able to _increase_ memory pressure on the server in order to reduce its own pressure. That doesn't work well when client == server. -- Trond Myklebust Linux NFS client maintainer NetApp trond.mykleb...@netapp.com www.netapp.com
Re: [PATCH] include/version.h: Update for kernel 3.7
On 10/15/2012 04:54 PM, Borislav Petkov wrote: On Mon, Oct 15, 2012 at 04:43:27PM -0500, Larry Finger wrote: The value for LINUX_VERSION_CODE was not updated for kernel 3.7-rc1. That's probably fallout from the whole UAPI thing. Signed-off-by: Larry Finger --- version.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- Index: linux-2.6/include/linux/version.h === --- linux-2.6.orig/include/linux/version.h +++ linux-2.6/include/linux/version.h @@ -1,2 +1,2 @@ There are two version.h files on my box: -#define LINUX_VERSION_CODE 198144 This is in +#define LINUX_VERSION_CODE 198400 This is in I'd guess that everything should include this new version.h file now since the Makefile generates this now and not the one above. But I could very well be wrong. It seems to be fixed now. Thanks, Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Linux KVM tool for v3.7-rc0
Hi Linus, On Fri, 12 Oct 2012 14:34:33 +0300 (EEST) Pekka Enberg wrote: > > Please consider pulling the latest LKVM tree from: > > git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git > kvmtool/for-linus So you have not taken this in the v3.7 merge window. Will you ever merge this? If not, it should be removed from linux-next (where it has been sitting since before v3.1) and turned into an independent (to the kernel) project. -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpXyOPwt0P3B.pgp Description: PGP signature
Re: Re: [PATCH] block: Add blk_rq_pos(rq) to sort rq when plushing plug-list.
On 2012-10-15 21:18 Shaohua Li Wrote: >2012/10/15 Shaohua Li : >> 2012/10/15 Jianpeng Ma : >>> My workload is a raid5 which had 16 disks. And used our filesystem to >>> write using direct-io mode. >>> I used the blktrace to find those message: >>> >>> 8,16 0 3570 1.083923979 2519 I W 144323176 + 24 [md127_raid5] >>> 8,16 00 1.083926214 0 m N cfq2519 insert_request >>> 8,16 0 3571 1.083926586 2519 I W 144323072 + 104 [md127_raid5] >>> 8,16 00 1.083926952 0 m N cfq2519 insert_request >>> 8,16 0 3572 1.083927180 2519 U N [md127_raid5] 2 >>> 8,16 00 1.083927870 0 m N cfq2519 Not >>> idling.st->count:1 >>> 8,16 00 1.083928320 0 m N cfq2519 dispatch_insert >>> 8,16 00 1.083928951 0 m N cfq2519 dispatched a request >>> 8,16 00 1.083929443 0 m N cfq2519 activate rq,drv=1 >>> 8,16 0 3573 1.083929530 2519 D W 144323176 + 24 [md127_raid5] >>> 8,16 00 1.083933883 0 m N cfq2519 Not >>> idling.st->count:1 >>> 8,16 00 1.083934189 0 m N cfq2519 dispatch_insert >>> 8,16 00 1.083934654 0 m N cfq2519 dispatched a request >>> 8,16 00 1.083935014 0 m N cfq2519 activate rq,drv=2 >>> 8,16 0 3574 1.083935101 2519 D W 144323072 + 104 [md127_raid5] >>> 8,16 0 3575 1.084196179 0 C W 144323176 + 24 [0] >>> 8,16 00 1.084197979 0 m N cfq2519 complete rqnoidle 0 >>> 8,16 0 3576 1.084769073 0 C W 144323072 + 104 [0] >>> .. >>> 8,16 1 3596 1.091394357 2519 I W 144322544 + 16 [md127_raid5] >>> 8,16 10 1.091396181 0 m N cfq2519 insert_request >>> 8,16 1 3597 1.091396571 2519 I W 144322520 + 24 [md127_raid5] >>> 8,16 10 1.091396934 0 m N cfq2519 insert_request >>> 8,16 1 3598 1.091397165 2519 I W 144322488 + 32 [md127_raid5] >>> 8,16 10 1.091397477 0 m N cfq2519 insert_request >>> 8,16 1 3599 1.091397708 2519 I W 144322432 + 56 [md127_raid5] >>> 8,16 10 1.091398023 0 m N cfq2519 insert_request >>> 8,16 1 3600 1.091398284 2519 U N [md127_raid5] 4 >>> 8,16 10 1.091398986 0 m N cfq2519 Not idling. >>> st->count:1 >>> 8,16 10 1.091399511 0 m N cfq2519 dispatch_insert >>> 8,16 10 1.091400217 0 m N cfq2519 dispatched a request >>> 8,16 10 1.091400688 0 m N cfq2519 activate rq,drv=1 >>> 8,16 1 3601 1.091400766 2519 D W 144322544 + 16 [md127_raid5] >>> 8,16 10 1.091406151 0 m N cfq2519 Not >>> idling.st->count:1 >>> 8,16 10 1.091406460 0 m N cfq2519 dispatch_insert >>> 8,16 10 1.091406931 0 m N cfq2519 dispatched a request >>> 8,16 10 1.091407291 0 m N cfq2519 activate rq,drv=2 >>> 8,16 1 3602 1.091407378 2519 D W 144322520 + 24 [md127_raid5] >>> 8,16 10 1.091414006 0 m N cfq2519 Not >>> idling.st->count:1 >>> 8,16 10 1.091414297 0 m N cfq2519 dispatch_insert >>> 8,16 10 1.091414702 0 m N cfq2519 dispatched a request >>> 8,16 10 1.091415047 0 m N cfq2519 activate rq, drv=3 >>> 8,16 1 3603 1.091415125 2519 D W 144322488 + 32 [md127_raid5] >>> 8,16 10 1.091416469 0 m N cfq2519 Not >>> idling.st->count:1 >>> 8,16 10 1.091416754 0 m N cfq2519 dispatch_insert >>> 8,16 10 1.091417186 0 m N cfq2519 dispatched a request >>> 8,16 10 1.091417535 0 m N cfq2519 activate rq,drv=4 >>> 8,16 1 3604 1.091417628 2519 D W 144322432 + 56 [md127_raid5] >>> 8,16 1 3605 1.091857225 4393 C W 144322544 + 16 [0] >>> 8,16 10 1.091858753 0 m N cfq2519 complete rqnoidle 0 >>> 8,16 1 3606 1.092068456 4393 C W 144322520 + 24 [0] >>> 8,16 10 1.092069851 0 m N cfq2519 complete rqnoidle 0 >>> 8,16 1 3607 1.092350440 4393 C W 144322488 + 32 [0] >>> 8,16 10 1.092351688 0 m N cfq2519 complete rqnoidle 0 >>> 8,16 1 3608 1.093629323 0 C W 144322432 + 56 [0] >>> 8,16 10 1.093631151 0 m N cfq2519 complete rqnoidle 0 >>> 8,16 10 1.093631574 0 m N cfq2519 will busy wait >>> 8,16 10 1.093631829 0 m N cfq schedule dispatch >>> >>> Because in func "elv_attempt_insert_merge", it only to try to >>> backmerge.So the four request can't merge in theory. >>> I trace ten minutes and count those situation, it can count 25%. >>> >>> With the patch,i tested and not found situation like above. >>> >>> Signed-off-by: Jianpeng Ma >>> --- >>>
RE: [PATCH 11/16] f2fs: add inode operations for special inodes
> On Monday 15 October 2012, Changman Lee wrote: > > 2012년 10월 15일 월요일에 Arnd Bergmann님이 작성: > > > It is only a performance hint though, so it is not a correctness issue the > > > file system gets it wrong. In order to do efficient garbage collection, a > > > log > > > structured file system should take all the information it can get about > > > the > > > expected life of data it writes. I agree that the list, even in the form > > > of > > > mkfs time settings, is not a clean abstraction, but in the place of an > > > Android > > > phone manufacturer I would still enable it if it promises a significant > > > performance advantage over not using it. I guess it would be nice if this > > > could be overridden in some form, e.g. using an ioctl on the file as ext4 > > > does. > > > > > Right. This is related with HOT/COLD separation policy of f2fs. If we know > > that data is COLD, we can manage gc effectively. > > I think that ext lists are placed in sb is better like your advice because > > it's difficult to fix user app. Although it's nasty way. > > Ok. I think you should adapt the terminology though. Right now, the > optimization > is to mark the data as COLD because we expect it to be written less often than > other kinds of data. However, the hot/cold terms are usually only applied to > data that we assume is going to be written soon or not based on how often > the same data has been accessed in the past. > > Anything you detect from the file name is not really a hint on hot/cold > files, but rather on the expected access pattern: These files are going > to be written once, and will be read-only after that, they are probably > multiple megabytes in size, and if you have a lot of them, they are likely > to live for the same time. > > It may well be possible that we later decide to use the hint in a different > way, e.g. to put these files into yet another separate log, aside from > other hot or cold files. > > > > We should also take the kinds of access we have seen on a file into > > > account. > > > E.g. if someone opens a file O_RDWR and performs seek or pwrite on it, we > > > can > > > assume that it's not in the category of typical media files, and a file > > > that > > > gets written to disk linearly in multiple megabytes might belong into the > > > category even if it is named otherwise. > > > > > This is more general but it's hard to adapt now. > > I think it's important to leave the option open for a future optimization. > Right now, what we have to get agreement on is the on-disk format, because > we absolutely don't want to make incompatible changes to that once f2fs > has been merged into the kernel and is getting used on real systems. > > This is independent of how the code is implemented at the moment, and > any tuning regarding how to group different kinds of data into the six > logs is completely up to how things work out in practice. But you should > definitely ensure that those changes don't require changing the format > if we decide to use a different number of logs in the future, or to > use the logs differently. > > The split between logs for nodes on the one hand and data on the other > is something that can well be hardcoded, and it's ok to have a hard > upper bound on the number of logs in the file system, possibly higher > than 6. > Thank you for a lot of points to be addressed. :) Maybe it's time to summarize them. Please let me know what I misunderstood. [In v2] - Extension list : Mkfs supports configuring extensions by user, and that information will be stored in the superblock. In order to reduce the cleaning overhead, f2fs supports an additional interface, ioctl, likewise ext4. - The number of active logs : No change will be done in on-disk layout (i.e., max 6 logs). Instead, f2fs supports changing the number with a mount option. Currently, I think 4, 5, and 6 would be enough. - Section size : Mkfs supports multiples of segments for a section, not power-of-two. [Future optimization] - Data separation : file access pattern, and else? > Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: manual merge of the cortex tree with Linus' tree
Hi Uwe, Today's linux-next merge of the cortex tree got a conflict in arch/arm/kernel/process.c between commit 9e14f828ee4a ("arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]") from Linus' tree and commit 2f3e7d3436cb ("Cortex-M3: Add support for exception handling") from the cortex tree. I have no idea how to fix this up, so I have just dropped this tree for today. -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpdeNj3PGZ6Q.pgp Description: PGP signature
linux-next: manual merge of the cortex tree with Linus' tree
Hi Uwe, Today's linux-next merge of the cortex tree got a conflict in arch/arm/include/asm/ptrace.h between commit cb8db5d4578a ("UAPI: (Scripted) Disintegrate arch/arm/include/asm") from Linus' tree and commit 69bc3631744a ("Cortex-M3: Add base support for Cortex-M3") from the cortex tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). I also had to add this merge fix patch: diff --git a/arch/arm/include/uapi/asm/ptrace.h b/arch/arm/include/uapi/asm/ptrace.h index 96ee092..b71a3f8 100644 --- a/arch/arm/include/uapi/asm/ptrace.h +++ b/arch/arm/include/uapi/asm/ptrace.h @@ -49,13 +49,15 @@ #define SYSTEM_MODE0x001f #define MODE32_BIT 0x0010 #define MODE_MASK 0x001f -#define PSR_T_BIT 0x0020 -#define PSR_F_BIT 0x0040 -#define PSR_I_BIT 0x0080 -#define PSR_A_BIT 0x0100 -#define PSR_E_BIT 0x0200 -#define PSR_J_BIT 0x0100 -#define PSR_Q_BIT 0x0800 +#define V4_PSR_T_BIT 0x0020 /* >= V4T, but not V7M */ +#define V7M_PSR_T_BIT 0x0100 +#define PSR_T_BIT V4_PSR_T_BIT +#define PSR_F_BIT 0x0040 /* >= V4, but not V7M */ +#define PSR_I_BIT 0x0080 /* >= V4, but not V7M */ +#define PSR_A_BIT 0x0100 /* >= V6, but not V7M */ +#define PSR_E_BIT 0x0200 /* >= V6, but not V7M */ +#define PSR_J_BIT 0x0100 /* >= V5J, but not V7M */ +#define PSR_Q_BIT 0x0800 /* >= V5E, including V7M */ #define PSR_V_BIT 0x1000 #define PSR_C_BIT 0x2000 #define PSR_Z_BIT 0x4000 @@ -125,6 +127,7 @@ struct pt_regs { #define ARM_r1 uregs[1] #define ARM_r0 uregs[0] #define ARM_ORIG_r0uregs[17] +#define ARM_EXC_RETuregs[18] /* * The size of the user-visible VFP state as seen by PTRACE_GET/SETVFPREGS -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc arch/arm/include/asm/ptrace.h index 3d52ee1,090fea7..000 --- a/arch/arm/include/asm/ptrace.h +++ b/arch/arm/include/asm/ptrace.h @@@ -10,12 -10,156 +10,29 @@@ #ifndef __ASM_ARM_PTRACE_H #define __ASM_ARM_PTRACE_H -#include +#include -#define PTRACE_GETREGS12 -#define PTRACE_SETREGS13 -#define PTRACE_GETFPREGS 14 -#define PTRACE_SETFPREGS 15 -/* PTRACE_ATTACH is 16 */ -/* PTRACE_DETACH is 17 */ -#define PTRACE_GETWMMXREGS18 -#define PTRACE_SETWMMXREGS19 -/* 20 is unused */ -#define PTRACE_OLDSETOPTIONS 21 -#define PTRACE_GET_THREAD_AREA22 -#define PTRACE_SET_SYSCALL23 -/* PTRACE_SYSCALL is 24 */ -#define PTRACE_GETCRUNCHREGS 25 -#define PTRACE_SETCRUNCHREGS 26 -#define PTRACE_GETVFPREGS 27 -#define PTRACE_SETVFPREGS 28 -#define PTRACE_GETHBPREGS 29 -#define PTRACE_SETHBPREGS 30 - -/* - * PSR bits - * Note on V7M there is no mode contained in the PSR - */ -#define USR26_MODE0x -#define FIQ26_MODE0x0001 -#define IRQ26_MODE0x0002 -#define SVC26_MODE0x0003 + #if defined(__KERNEL__) && defined(CONFIG_CPU_V7M) + /* + * Use 0 here to get code right that creates a userspace + * or kernel space thread + */ ++#undef USR_MODE ++#undef SVC_MODE ++#undef PSR_T_BIT + #define USR_MODE 0x + #define SVC_MODE 0x -#else -#define USR_MODE 0x0010 -#define SVC_MODE 0x0013 -#endif -#define FIQ_MODE 0x0011 -#define IRQ_MODE 0x0012 -#define ABT_MODE 0x0017 -#define UND_MODE 0x001b -#define SYSTEM_MODE 0x001f -#define MODE32_BIT0x0010 -#define MODE_MASK 0x001f - -#define V4_PSR_T_BIT 0x0020 /* >= V4T, but not V7M */ -#define V7M_PSR_T_BIT 0x0100 -#if defined(__KERNEL__) && defined(CONFIG_CPU_V7M) + #define PSR_T_BIT V7M_PSR_T_BIT -#else -/* for compatibility */ -#define PSR_T_BIT V4_PSR_T_BIT -#endif - -#define PSR_F_BIT 0x0040 /* >= V4, but not V7M */ -#define PSR_I_BIT 0x0080 /* >= V4, but not V7M */ -#define PSR_A_BIT 0x0100 /* >= V6, but not V7M */ -#define PSR_E_BIT 0x0200 /* >= V6, but not V7M */ -#define PSR_J_BIT 0x0100 /* >= V5J, but not V7M */ -#define PSR_Q_BIT 0x0800 /* >= V5E, including V7M */ -#define PSR_V_BIT 0x1000 -#define PSR_C_BIT 0x2000 -#define PSR_Z_BIT 0x4000 -#define PSR_N_BIT 0x8000 - -/* - * Groups of PSR bits - */ -#define PSR_f 0xff00 /* Flags*/ -#define PSR_s 0x00ff /* Status */ -#define PSR_x 0xff00 /* Extension*/ -#define PSR_c 0x00ff /* Control */ - -/* - * ARMv7 groups of PSR bits - */ -#define APSR_MASK 0xf80f /* N, Z, C, V, Q and GE flags */ -#define PSR_ISET_MASK 0x0110 /* ISA state (J, T) mask
Re: [PATCH] [media] stk1160: Check return value of stk1160_read_reg() in stk1160_i2c_read_reg()
On Mon, Oct 15, 2012 at 9:03 PM, Jesper Juhl wrote: > On Mon, 15 Oct 2012, Ezequiel Garcia wrote: > >> On Mon, Oct 15, 2012 at 7:52 PM, Jesper Juhl wrote: >> > On Mon, 15 Oct 2012, Jesper Juhl wrote: >> > >> >> On Sat, 13 Oct 2012, Ezequiel Garcia wrote: >> >> > [...] >> > Currently there are two checks for 'rc' being less than zero with no >> > change to 'rc' between the two, so the second is just dead code. >> > The intention seems to have been to assign the return value of >> > 'stk1160_read_reg()' to 'rc' before the (currently dead) second check >> > and then test /that/. This patch does that. >> > >> >> This is an overly complicated explanation for such a small patch. >> Can you try to simplify it? >> > How's this? > > > From: Jesper Juhl > Date: Sat, 13 Oct 2012 00:16:37 +0200 > Subject: [PATCH] [media] stk1160: Check return value of stk1160_read_reg() in > stk1160_i2c_read_reg() > > Remember to collect the exit status from 'stk1160_read_reg()' in 'rc' > before testing it for less than zero. > > Signed-off-by: Jesper Juhl > --- > drivers/media/usb/stk1160/stk1160-i2c.c |3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/drivers/media/usb/stk1160/stk1160-i2c.c > b/drivers/media/usb/stk1160/stk1160-i2c.c > index 176ac93..a2370e4 100644 > --- a/drivers/media/usb/stk1160/stk1160-i2c.c > +++ b/drivers/media/usb/stk1160/stk1160-i2c.c > @@ -116,10 +116,9 @@ static int stk1160_i2c_read_reg(struct stk1160 *dev, u8 > addr, > if (rc < 0) > return rc; > > - stk1160_read_reg(dev, STK1160_SBUSR_RD, value); > + rc = stk1160_read_reg(dev, STK1160_SBUSR_RD, value); > if (rc < 0) > return rc; > - Sorry for the nitpick, but I'd like you to *not* remove this line. Thanks Ezequiel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tools/include: use stdint types for user-space byteshift headers
From: Yaakov Selkowitz Commit a07f7672d7cf0ff0d6e548a9feb6e0bd016d9c6c added user-space copies of the byteshift headers to be used by hostprogs, changing e.g. u8 to __u8. However, in order to cross-compile the kernel from a non-Linux system, stdint.h types need to be used instead of linux/types.h types. Signed-off-by: Yaakov Selkowitz --- Ping 2; still hasn't been merged and no on-list response to original posted of 11 June. Also applies to linux-3.[3456].y tools/include/tools/be_byteshift.h | 34 +- tools/include/tools/le_byteshift.h | 34 +- 2 files changed, 34 insertions(+), 34 deletions(-) diff --git a/tools/include/tools/be_byteshift.h b/tools/include/tools/be_byteshift.h index f4912e2..84c17d8 100644 --- a/tools/include/tools/be_byteshift.h +++ b/tools/include/tools/be_byteshift.h @@ -1,68 +1,68 @@ #ifndef _TOOLS_BE_BYTESHIFT_H #define _TOOLS_BE_BYTESHIFT_H -#include +#include -static inline __u16 __get_unaligned_be16(const __u8 *p) +static inline uint16_t __get_unaligned_be16(const uint8_t *p) { return p[0] << 8 | p[1]; } -static inline __u32 __get_unaligned_be32(const __u8 *p) +static inline uint32_t __get_unaligned_be32(const uint8_t *p) { return p[0] << 24 | p[1] << 16 | p[2] << 8 | p[3]; } -static inline __u64 __get_unaligned_be64(const __u8 *p) +static inline uint64_t __get_unaligned_be64(const uint8_t *p) { - return (__u64)__get_unaligned_be32(p) << 32 | + return (uint64_t)__get_unaligned_be32(p) << 32 | __get_unaligned_be32(p + 4); } -static inline void __put_unaligned_be16(__u16 val, __u8 *p) +static inline void __put_unaligned_be16(uint16_t val, uint8_t *p) { *p++ = val >> 8; *p++ = val; } -static inline void __put_unaligned_be32(__u32 val, __u8 *p) +static inline void __put_unaligned_be32(uint32_t val, uint8_t *p) { __put_unaligned_be16(val >> 16, p); __put_unaligned_be16(val, p + 2); } -static inline void __put_unaligned_be64(__u64 val, __u8 *p) +static inline void __put_unaligned_be64(uint64_t val, uint8_t *p) { __put_unaligned_be32(val >> 32, p); __put_unaligned_be32(val, p + 4); } -static inline __u16 get_unaligned_be16(const void *p) +static inline uint16_t get_unaligned_be16(const void *p) { - return __get_unaligned_be16((const __u8 *)p); + return __get_unaligned_be16((const uint8_t *)p); } -static inline __u32 get_unaligned_be32(const void *p) +static inline uint32_t get_unaligned_be32(const void *p) { - return __get_unaligned_be32((const __u8 *)p); + return __get_unaligned_be32((const uint8_t *)p); } -static inline __u64 get_unaligned_be64(const void *p) +static inline uint64_t get_unaligned_be64(const void *p) { - return __get_unaligned_be64((const __u8 *)p); + return __get_unaligned_be64((const uint8_t *)p); } -static inline void put_unaligned_be16(__u16 val, void *p) +static inline void put_unaligned_be16(uint16_t val, void *p) { __put_unaligned_be16(val, p); } -static inline void put_unaligned_be32(__u32 val, void *p) +static inline void put_unaligned_be32(uint32_t val, void *p) { __put_unaligned_be32(val, p); } -static inline void put_unaligned_be64(__u64 val, void *p) +static inline void put_unaligned_be64(uint64_t val, void *p) { __put_unaligned_be64(val, p); } diff --git a/tools/include/tools/le_byteshift.h b/tools/include/tools/le_byteshift.h index c99d45a..8fe9f24 100644 --- a/tools/include/tools/le_byteshift.h +++ b/tools/include/tools/le_byteshift.h @@ -1,68 +1,68 @@ #ifndef _TOOLS_LE_BYTESHIFT_H #define _TOOLS_LE_BYTESHIFT_H -#include +#include -static inline __u16 __get_unaligned_le16(const __u8 *p) +static inline uint16_t __get_unaligned_le16(const uint8_t *p) { return p[0] | p[1] << 8; } -static inline __u32 __get_unaligned_le32(const __u8 *p) +static inline uint32_t __get_unaligned_le32(const uint8_t *p) { return p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24; } -static inline __u64 __get_unaligned_le64(const __u8 *p) +static inline uint64_t __get_unaligned_le64(const uint8_t *p) { - return (__u64)__get_unaligned_le32(p + 4) << 32 | + return (uint64_t)__get_unaligned_le32(p + 4) << 32 | __get_unaligned_le32(p); } -static inline void __put_unaligned_le16(__u16 val, __u8 *p) +static inline void __put_unaligned_le16(uint16_t val, uint8_t *p) { *p++ = val; *p++ = val >> 8; } -static inline void __put_unaligned_le32(__u32 val, __u8 *p) +static inline void __put_unaligned_le32(uint32_t val, uint8_t *p) { __put_unaligned_le16(val >> 16, p + 2); __put_unaligned_le16(val, p); } -static inline void __put_unaligned_le64(__u64 val, __u8 *p) +static inline void __put_unaligned_le64(uint64_t val, uint8_t *p) { __put_unaligned_le32(val >> 32, p + 4); __put_unaligned_le32(val, p); }
RE: [PATCH 11/16] f2fs: add inode operations for special inodes
> On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote: > > On Sunday 14 October 2012, Vyacheslav Dubeyko wrote: > > > On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote: > > > > 2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko: > > > Extended attributes are more flexible way, from my point of view. The > > > xattr gives > > > possibility to make hint to filesystem at any time and without any > > > dependencies with > > > application's functional opportunities. Documented way of using such > > > extended attributes > > > gives to user flexible way of manipulation of filesystem behavior (but I > > > remember that > > > you don't believe in an user :-)). > > > > > > So, I think that fadvise() and extended attributes can be complementary > > > solutions. > > > > Right. Another option is to have ext4 style attributes, see > > http://linux.die.net/man/1/chattr > > Xattrs are much prefered to more "ext4 style" flags because xattrs > are filesystem independent. Indeed, some filesystems can't store any > new "ext4 style" flags without a change of disk format or > internally mapping them to an xattr. So really, xattrs are the best > way forward for such hints. > > > Unlike extended attributes, there is a limited number of those, > > and they can only be boolean flags, but that might be enough for > > this particular use case. > > A boolean is not sufficient for access policy hints. An extensible > xattr format is probably the best approach to take here, so that we > can easily introduce new access policy hints as functionality is > required. Indeed, an extensible xattr could start with just a > hot/cold boolean, and grow from there > > > The main reason I can see against extended attributes is that they are not > > stored > > very efficiently in f2fs, unless a lot of work is put into coming up with a > > good > > implementation. A single flags bit can trivially be added to the inode in > > comparison (if it's not there already). > > That's a deficiency that should be corrected, then, because xattrs > are very common these days. IMO, most file systems including f2fs have some inefficiency to store and retrieve xattrs, since they have to allocate an additional block. The only distinct problem in f2fs is that there is a cleaning overhead. So, that's the why xattr is not an efficient way in f2fs. OTOH, I think xattr itself is for users, not for communicating between file system and users. Moreover, I'm not sure in the current android, but I saw ICS android did not call any xattr operations, even if mount option was enabled. > > And given that stuff like access frequency tracking is being > implemented at the VFS level, access policy hints should also be VFS > functionality. A bad filesystem implementation should not dictate > the interface for generically useful functionality > > > > Anyway, hardcoding or saving in filesystem list of file extensions is a > > > nasty way. It > > > can be not safe or hardly understandable by users the way of > > > reconfiguration filesystem > > > by means of tunefs or debugfs with the purpose of file extensions > > > addition in such > > > "black-box" as TV or smartphones, from my point of view. > > > > It is only a performance hint though, so it is not a correctness issue the > > file system gets it wrong. In order to do efficient garbage collection, a > > log > > structured file system should take all the information it can get about the > > expected life of data it writes. I agree that the list, even in the form of > > mkfs time settings, is not a clean abstraction, but in the place of an > > Android > > phone manufacturer I would still enable it if it promises a significant > > performance advantage over not using it. I guess it would be nice if this > > could be overridden in some form, e.g. using an ioctl on the file as ext4 > > does. > > An xattr on the root inode that holds a list like this is something > that could be set at mkfs time, but then also updated easily by new > software packages that are installed... > > > We should also take the kinds of access we have seen on a file into account. > > Yes, but it should be done at the VFS level, not in the filesystem > itself. Integrated into the current hot inode/range tracking that is > being worked on right now, I'd suggest. > > IOWs, these access policy issues are not unique to F2FS or it's use > case. Anything to do with access hints, policy, tracking, file > classification, etc that can influence data locality, reclaim, > migration, etc need to be dealt with at the VFS, independently of a > specific filesystem. Filesystems can make use of that information > how they please (whether in the kernel or via userspace tools), but > having filesystem specific interfaces and implementations of the > same functionality is extremely wasteful. Let's do it once, and do > it right the first time. ;) I agree that VFS should support something, but before then, it needs to do something by the file
Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation
On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim wrote: > On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote: >> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of >> 'struct task_struct'), so that the flag can be set by one task >> to avoid doing I/O inside memory allocation in the task's context. >> >> The patch trys to solve one deadlock problem caused by block device, >> and the problem can be occured at least in the below situations: >> >> - during block device runtime resume situation, if memory allocation >> with GFP_KERNEL is called inside runtime resume callback of any one >> of its ancestors(or the block device itself), the deadlock may be >> triggered inside the memory allocation since it might not complete >> until the block device becomes active and the involed page I/O finishes. >> The situation is pointed out first by Alan Stern. It is not a good >> approach to convert all GFP_KERNEL in the path into GFP_NOIO because >> several subsystems may be involved(for example, PCI, USB and SCSI may >> be involved for usb mass stoarage device) > > Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as > suspend path? IMO, we could, but it is not good and might trigger memory allocation problem. pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to avoid allocating page with GFP_IOFS in all contexts during system sleep, when processes have been frozen. But during runtime PM, the whole system is running and all processes are runnable. Also runtime PM is per device and the whole system may have lots of devices, so taking the global gfp_allowed_mask may keep page allocation with ~GFP_IOFS for a considerable proportion of system running time, then alloc_page() will return failure easier. The above deadlock problem may be fixed by allocating memory with ~GFP_IOFS only in the context of calling runtime_resume, and that is idea of the patch. > >> >> - during error handling situation of usb mass storage deivce, USB >> bus reset will be put on the device, so there shouldn't have any >> memory allocation with GFP_KERNEL during USB bus reset, otherwise >> the deadlock similar with above may be triggered. Unfortunately, any >> usb device may include one mass storage interface in theory, so it >> requires all usb interface drivers to handle the situation. In fact, >> most usb drivers don't know how to handle bus reset on the device >> and don't provide .pre_set() and .post_reset() callback at all, so >> USB core has to unbind and bind driver for these devices. So it >> is still not practical to resort to GFP_NOIO for solving the problem. > > I hope this case could be handled by usb core like usb_restrict_gfp_mask > rather than adding new branch on fast path. See above, applying the global gfp_allowed_mask is not good. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] aoe: update driver-internal version number to 60
Signed-off-by: Ed Cashin --- drivers/block/aoe/aoe.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h index 8e8da1c..536942b 100644 --- a/drivers/block/aoe/aoe.h +++ b/drivers/block/aoe/aoe.h @@ -1,5 +1,5 @@ /* Copyright (c) 2012 Coraid, Inc. See COPYING for GPL terms. */ -#define VERSION "50" +#define VERSION "60" #define AOE_MAJOR 152 #define DEVICE_NAME "aoe" -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 05/24] block: Use bio_sectors() more consistently
The aoe changes look OK, thanks. -- Ed Cashin ecas...@coraid.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/9] aoe: whitespace cleanup
Signed-off-by: Ed Cashin --- drivers/block/aoe/aoe.h |2 +- drivers/block/aoe/aoechr.c |2 +- drivers/block/aoe/aoecmd.c |6 +++--- drivers/block/aoe/aoemain.c |2 +- drivers/block/aoe/aoenet.c |4 ++-- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h index 52f75c0..8e8da1c 100644 --- a/drivers/block/aoe/aoe.h +++ b/drivers/block/aoe/aoe.h @@ -151,7 +151,7 @@ struct aoedev { struct work_struct work;/* disk create work struct */ struct gendisk *gd; struct request_queue *blkq; - struct hd_geometry geo; + struct hd_geometry geo; sector_t ssize; struct timer_list timer; spinlock_t lock; diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c index 2bf6273..42e67ad 100644 --- a/drivers/block/aoe/aoechr.c +++ b/drivers/block/aoe/aoechr.c @@ -287,7 +287,7 @@ aoechr_init(void) int n, i; n = register_chrdev(AOE_MAJOR, "aoechr", _fops); - if (n < 0) { + if (n < 0) { printk(KERN_ERR "aoe: can't register char device\n"); return n; } diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 82e16c4..c491fba 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -978,7 +978,7 @@ ktiocomplete(struct frame *f) pr_err("aoe: ata error cmd=%2.2Xh stat=%2.2Xh from e%ld.%d\n", ahout->cmdstat, ahin->cmdstat, d->aoemajor, d->aoeminor); -noskb: if (buf) +noskb: if (buf) clear_bit(BIO_UPTODATE, >bio->bi_flags); goto badrsp; } @@ -1191,7 +1191,7 @@ aoecmd_cfg(ushort aoemajor, unsigned char aoeminor) aoecmd_cfg_pkts(aoemajor, aoeminor, ); aoenet_xmit(); } - + struct sk_buff * aoecmd_ata_id(struct aoedev *d) { @@ -1230,7 +1230,7 @@ aoecmd_ata_id(struct aoedev *d) return skb_clone(skb, GFP_ATOMIC); } - + static struct aoetgt * addtgt(struct aoedev *d, char *addr, ulong nframes) { diff --git a/drivers/block/aoe/aoemain.c b/drivers/block/aoe/aoemain.c index 04793c2..4b987c2 100644 --- a/drivers/block/aoe/aoemain.c +++ b/drivers/block/aoe/aoemain.c @@ -105,7 +105,7 @@ aoe_init(void) aoechr_exit(); chr_fail: aoedev_exit(); - + printk(KERN_INFO "aoe: initialisation failure.\n"); return ret; } diff --git a/drivers/block/aoe/aoenet.c b/drivers/block/aoe/aoenet.c index a1bb692..461b6c4 100644 --- a/drivers/block/aoe/aoenet.c +++ b/drivers/block/aoe/aoenet.c @@ -126,8 +126,8 @@ aoenet_xmit(struct sk_buff_head *queue) } } -/* - * (1) len doesn't include the header by default. I want this. +/* + * (1) len doesn't include the header by default. I want this. */ static int aoenet_rcv(struct sk_buff *skb, struct net_device *ifp, struct packet_type *pt, struct net_device *orig_dev) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/9] aoe: cleanup: remove unused ata_scnt function
Signed-off-by: Ed Cashin --- drivers/block/aoe/aoecmd.c | 10 -- 1 files changed, 0 insertions(+), 10 deletions(-) diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 2bb8c7d..82e16c4 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -552,16 +552,6 @@ sthtith(struct aoedev *d) return 1; } -static inline unsigned char -ata_scnt(unsigned char *packet) { - struct aoe_hdr *h; - struct aoe_atahdr *ah; - - h = (struct aoe_hdr *) packet; - ah = (struct aoe_atahdr *) (h+1); - return ah->scnt; -} - static void rexmit_timer(ulong vp) { -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/9] aoe: "payload" sysfs file exports per-AoE-command data transfer size
The userland aoetools package includes an "aoe-stat" command that can display a "payload size" column when the aoe driver exports this information. Users can quickly see what amount of user data is transferred inside each AoE command on the network, network headers excluded. Signed-off-by: Ed Cashin --- drivers/block/aoe/aoeblk.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c index d5aa3b8..56736cd 100644 --- a/drivers/block/aoe/aoeblk.c +++ b/drivers/block/aoe/aoeblk.c @@ -98,6 +98,14 @@ static ssize_t aoedisk_show_fwver(struct device *dev, return snprintf(page, PAGE_SIZE, "0x%04x\n", (unsigned int) d->fw_ver); } +static ssize_t aoedisk_show_payload(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + struct aoedev *d = disk->private_data; + + return snprintf(page, PAGE_SIZE, "%lu\n", d->maxbcnt); +} static DEVICE_ATTR(state, S_IRUGO, aoedisk_show_state, NULL); static DEVICE_ATTR(mac, S_IRUGO, aoedisk_show_mac, NULL); @@ -106,12 +114,14 @@ static struct device_attribute dev_attr_firmware_version = { .attr = { .name = "firmware-version", .mode = S_IRUGO }, .show = aoedisk_show_fwver, }; +static DEVICE_ATTR(payload, S_IRUGO, aoedisk_show_payload, NULL); static struct attribute *aoe_attrs[] = { _attr_state.attr, _attr_mac.attr, _attr_netif.attr, _attr_firmware_version.attr, + _attr_payload.attr, NULL, }; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/9] aoe: support larger I/O requests via aoe_maxsectors module param
The GPFS filesystem is an example of an aoe user that requires the aoe driver to support I/O request sizes larger than the default. Most users will not need large I/O request sizes, because they would need to be split up into multiple AoE commands anyway. Signed-off-by: Ed Cashin --- drivers/block/aoe/aoeblk.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c index 00dfc50..d5aa3b8 100644 --- a/drivers/block/aoe/aoeblk.c +++ b/drivers/block/aoe/aoeblk.c @@ -16,11 +16,18 @@ #include #include #include +#include #include "aoe.h" static DEFINE_MUTEX(aoeblk_mutex); static struct kmem_cache *buf_pool_cache; +/* GPFS needs a larger value than the default. */ +static int aoe_maxsectors; +module_param(aoe_maxsectors, int, 0644); +MODULE_PARM_DESC(aoe_maxsectors, + "When nonzero, set the maximum number of sectors per I/O request"); + static ssize_t aoedisk_show_state(struct device *dev, struct device_attribute *attr, char *page) { @@ -248,6 +255,8 @@ aoeblk_gdalloc(void *vp) d->blkq = gd->queue = q; q->queuedata = d; d->gd = gd; + if (aoe_maxsectors) + blk_queue_max_hw_sectors(q, aoe_maxsectors); gd->major = AOE_MAJOR; gd->first_minor = d->sysminor; gd->fops = _bdops; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] aoe: support the forgetting (flushing) of a user-specified AoE target
Users sometimes want to cause the aoe driver to forget a particular previously discovered device when it is no longer online. The aoetools provide an "aoe-flush" command that users run to perform this administrative task. The changes below provide the support needed in the driver. Signed-off-by: Ed Cashin --- drivers/block/aoe/aoedev.c | 44 ++-- 1 files changed, 38 insertions(+), 6 deletions(-) diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c index 90e5b53..63b2660 100644 --- a/drivers/block/aoe/aoedev.c +++ b/drivers/block/aoe/aoedev.c @@ -241,6 +241,30 @@ aoedev_freedev(struct aoedev *d) kfree(d); } +/* return whether the user asked for this particular + * device to be flushed + */ +static int +user_req(char *s, size_t slen, struct aoedev *d) +{ + char *p; + size_t lim; + + if (!d->gd) + return 0; + p = strrchr(d->gd->disk_name, '/'); + if (!p) + p = d->gd->disk_name; + else + p += 1; + lim = sizeof(d->gd->disk_name); + lim -= p - d->gd->disk_name; + if (slen < lim) + lim = slen; + + return !strncmp(s, p, lim); +} + int aoedev_flush(const char __user *str, size_t cnt) { @@ -249,6 +273,7 @@ aoedev_flush(const char __user *str, size_t cnt) struct aoedev *rmd = NULL; char buf[16]; int all = 0; + int specified = 0; /* flush a specific device */ if (cnt >= 3) { if (cnt > sizeof buf) @@ -256,26 +281,33 @@ aoedev_flush(const char __user *str, size_t cnt) if (copy_from_user(buf, str, cnt)) return -EFAULT; all = !strncmp(buf, "all", 3); + if (!all) + specified = 1; } spin_lock_irqsave(_lock, flags); dd = while ((d = *dd)) { spin_lock(>lock); - if ((!all && (d->flags & DEVFL_UP)) + if (specified) { + if (!user_req(buf, cnt, d)) + goto skip; + } else if ((!all && (d->flags & DEVFL_UP)) || (d->flags & (DEVFL_GDALLOC|DEVFL_NEWSIZE)) || d->nopen - || d->ref) { - spin_unlock(>lock); - dd = >next; - continue; - } + || d->ref) + goto skip; + *dd = d->next; aoedev_downdev(d); d->flags |= DEVFL_TKILL; spin_unlock(>lock); d->next = rmd; rmd = d; + continue; +skip: + spin_unlock(>lock); + dd = >next; } spin_unlock_irqrestore(_lock, flags); while ((d = rmd)) { -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/9] aoe: update cap on outstanding commands based on config query response
The ATA over Ethernet config query response contains a "buffer count" field reflecting the AoE target's capacity to buffer incoming AoE commands. By taking the current value of this field into accound, we increase performance throughput or avoid network congestion, when the value has increased or decreased, respectively. Signed-off-by: Ed Cashin --- drivers/block/aoe/aoe.h|6 +++--- drivers/block/aoe/aoecmd.c |6 +- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h index d2ed7f1..52f75c0 100644 --- a/drivers/block/aoe/aoe.h +++ b/drivers/block/aoe/aoe.h @@ -122,14 +122,14 @@ struct aoeif { struct aoetgt { unsigned char addr[6]; - ushort nframes; + ushort nframes; /* cap on frames to use */ struct aoedev *d; /* parent device I belong to */ struct list_head ffree; /* list of free frames */ struct aoeif ifs[NAOEIFS]; struct aoeif *ifp; /* current aoeif in use */ ushort nout; - ushort maxout; - ulong falloc; + ushort maxout; /* current value for max outstanding */ + ulong falloc; /* number of allocated frames */ ulong lastwadj; /* last window adjustment */ int minbcnt; int wpkts, rpkts; diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 3804a0a..2bb8c7d 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -1373,7 +1373,11 @@ aoecmd_cfg_rsp(struct sk_buff *skb) spin_lock_irqsave(>lock, flags); t = gettgt(d, h->src); - if (!t) { + if (t) { + t->nframes = n; + if (n < t->maxout) + t->maxout = n; + } else { t = addtgt(d, h->src, n); if (!t) goto bail; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/9] aoe: print warning regarding a common reason for dropped transmits
Dropped transmits are not common, but when they do occur, increasing the transmit queue length often helps. Signed-off-by: Ed Cashin --- drivers/block/aoe/aoenet.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/block/aoe/aoenet.c b/drivers/block/aoe/aoenet.c index 162c647..a1bb692 100644 --- a/drivers/block/aoe/aoenet.c +++ b/drivers/block/aoe/aoenet.c @@ -50,7 +50,11 @@ __setup("aoe_iflist=", aoe_iflist_setup); static spinlock_t txlock; static struct sk_buff_head skbtxq; -/* enters with txlock held */ +/* enters with txlock held + * + * Use __must_hold() for sparse when upcoming patch adds it to + * compiler.h. + */ static int tx(void) { @@ -58,7 +62,10 @@ tx(void) while ((skb = skb_dequeue())) { spin_unlock_irq(); - dev_queue_xmit(skb); + if (dev_queue_xmit(skb) == NET_XMIT_DROP && net_ratelimit()) + pr_warn("aoe: packet could not be sent on %s. %s\n", + skb->dev ? skb->dev->name : "netif", + "consider increasing tx_queue_len"); spin_lock_irq(); } return 0; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/9] aoe: describe the behavior of the "err" character device
Signed-off-by: Ed Cashin --- drivers/block/aoe/aoechr.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c index ed57a89..2bf6273 100644 --- a/drivers/block/aoe/aoechr.c +++ b/drivers/block/aoe/aoechr.c @@ -39,6 +39,11 @@ struct ErrMsg { }; static DEFINE_MUTEX(aoechr_mutex); + +/* A ring buffer of error messages, to be read through + * "/dev/etherd/err". When no messages are present, + * readers will block waiting for messages to appear. + */ static struct ErrMsg emsgs[NMSG]; static int emsgs_head_idx, emsgs_tail_idx; static struct completion emsgs_comp; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/9] aoe: various enhancements and cleanup from v50 to v60
This patch series is based on linux-next/akpm from 11 Oct. The patch that modifies aoenet.c:tx to print a warning does not affect locking but nonetheless causes a new sparse context warning to appear. Before a bug in sparse suppressed the warning. We will soon be able to use the new __must_hold() macro that now appears only in (not linux-next/akpm but) mm, making the warning go away by telling sparse that the tx function enters and exits with a lock held. Ed L. Cashin (9): aoe: describe the behavior of the "err" character device aoe: print warning regarding a common reason for dropped transmits aoe: update cap on outstanding commands based on config query response aoe: support the forgetting (flushing) of a user-specified AoE target aoe: support larger I/O requests via aoe_maxsectors module param aoe: "payload" sysfs file exports per-AoE-command data transfer size aoe: cleanup: remove unused ata_scnt function aoe: whitespace cleanup aoe: update driver-internal version number to 60 drivers/block/aoe/aoe.h | 10 drivers/block/aoe/aoeblk.c | 19 ++ drivers/block/aoe/aoechr.c |7 +- drivers/block/aoe/aoecmd.c | 22 +++- drivers/block/aoe/aoedev.c | 44 +- drivers/block/aoe/aoemain.c |2 +- drivers/block/aoe/aoenet.c | 15 ++--- 7 files changed, 88 insertions(+), 31 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name
convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS bytes for a function name Cc: Masami Hiramatsu Cc: Srikar Dronamraju Signed-off-by: Hyeoncheol Lee --- tools/perf/util/probe-event.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 49a256e..bb40ed4 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct perf_probe_event *pev, const char *exec) free(exec_copy); } free(pp->function); - pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS); + pp->function = zalloc(sizeof(char) * + (3 + sizeof(unsigned long long) * 2)); if (!pp->function) { ret = -ENOMEM; pr_warning("Failed to allocate memory by zalloc.\n"); goto out; } - e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr); + sprintf(pp->function, "0x%llx", vaddr); ret = 0; out: -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size
于 2012年10月15日 20:32, Myklebust, Trond 写道: > RPC is not ordered. The fact that we get one RPC reply before another > does not mean that the server sent them in that order. > > This is doubly true when you use UDP as the transport protocol. 1) is it means: nfs_inode_attrs_need_update need not consider async read_done situation ? 2) for correctness, I do not think "nfs_size_to_loff_t(fattr->size) > i_size_read(inode)" in nfs_size_need_update is enough. (at least need use "!=" instead of '>'), do you think so ? 3) another reference: A) for an old kernel version (such as 2.6.27-rc9), no such issue (because it did not have nfs_size_need_update). B) the test tools which I use is from the LTP (Linux Test Project), they use both udp and tcp to test both the nfsv2 and nfsv3. C) truly LTP has its limitations: "for stress test, LTP let nfs client and server under the same machine, which will cause kernel stable issue", but for net test, LTP use different machine (I got our issue from LTP net test). -- Chen Gang Asianux Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Q] Default SLAB allocator
Hello, Eric. 2012/10/14 Eric Dumazet : > SLUB was really bad in the common workload you describe (allocations > done by one cpu, freeing done by other cpus), because all kfree() hit > the slow path and cpus contend in __slab_free() in the loop guarded by > cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly > hit the main "struct page" to add the freed object to freelist. Could you elaborate more on how 'netperf RR' makes kernel "allocations done by one cpu, freeling done by other cpus", please? I don't have enough background network subsystem, so I'm just curious. > I played some months ago adding a percpu associative cache to SLUB, then > just moved on other strategy. > > (Idea for this per cpu cache was to build a temporary free list of > objects to batch accesses to struct page) Is this implemented and submitted? If it is, could you tell me the link for the patches? Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/7] kdb: Rename kdb_register_repeat() to kdb_register_flags()
We're about to add more options for commands behaviour, so let's give a more generic name to the low-level kdb command registration function. There are just various renames, no functional changes. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h | 10 +++--- kernel/debug/kdb/kdb_bp.c | 16 - kernel/debug/kdb/kdb_main.c | 88 ++--- kernel/trace/trace_kdb.c| 2 +- 4 files changed, 58 insertions(+), 58 deletions(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index cbd1c28..0142cd3 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -145,17 +145,17 @@ static inline const char *kdb_walk_kallsyms(loff_t *pos) /* Dynamic kdb shell command registration */ extern int kdb_register(char *, kdb_func_t, char *, char *, short); -extern int kdb_register_repeat(char *, kdb_func_t, char *, char *, - short, kdb_cmdflags_t); +extern int kdb_register_flags(char *, kdb_func_t, char *, char *, + short, kdb_cmdflags_t); extern int kdb_unregister(char *); #else /* ! CONFIG_KGDB_KDB */ static inline __printf(1, 2) int kdb_printf(const char *fmt, ...) { return 0; } static inline void kdb_init(int level) {} static inline int kdb_register(char *cmd, kdb_func_t func, char *usage, char *help, short minlen) { return 0; } -static inline int kdb_register_repeat(char *cmd, kdb_func_t func, char *usage, - char *help, short minlen, - kdb_repeat_t repeat) { return 0; } +static inline int kdb_register_flags(char *cmd, kdb_func_t func, char *usage, +char *help, short minlen, +kdb_repeat_t repeat) { return 0; } static inline int kdb_unregister(char *cmd) { return 0; } #endif /* CONFIG_KGDB_KDB */ enum { diff --git a/kernel/debug/kdb/kdb_bp.c b/kernel/debug/kdb/kdb_bp.c index 8418c2f..d2cb80d 100644 --- a/kernel/debug/kdb/kdb_bp.c +++ b/kernel/debug/kdb/kdb_bp.c @@ -545,23 +545,23 @@ void __init kdb_initbptab(void) for (i = 0, bp = kdb_breakpoints; i < KDB_MAXBPT; i++, bp++) bp->bp_free = 1; - kdb_register_repeat("bp", kdb_bp, "[]", + kdb_register_flags("bp", kdb_bp, "[]", "Set/Display breakpoints", 0, KDB_REPEAT_NO_ARGS); - kdb_register_repeat("bl", kdb_bp, "[]", + kdb_register_flags("bl", kdb_bp, "[]", "Display breakpoints", 0, KDB_REPEAT_NO_ARGS); if (arch_kgdb_ops.flags & KGDB_HW_BREAKPOINT) - kdb_register_repeat("bph", kdb_bp, "[]", + kdb_register_flags("bph", kdb_bp, "[]", "[datar [length]|dataw [length]] Set hw brk", 0, KDB_REPEAT_NO_ARGS); - kdb_register_repeat("bc", kdb_bc, "", + kdb_register_flags("bc", kdb_bc, "", "Clear Breakpoint", 0, KDB_REPEAT_NONE); - kdb_register_repeat("be", kdb_bc, "", + kdb_register_flags("be", kdb_bc, "", "Enable Breakpoint", 0, KDB_REPEAT_NONE); - kdb_register_repeat("bd", kdb_bc, "", + kdb_register_flags("bd", kdb_bc, "", "Disable Breakpoint", 0, KDB_REPEAT_NONE); - kdb_register_repeat("ss", kdb_ss, "", + kdb_register_flags("ss", kdb_ss, "", "Single Step", 1, KDB_REPEAT_NO_ARGS); - kdb_register_repeat("ssb", kdb_ss, "", + kdb_register_flags("ssb", kdb_ss, "", "Single step to branch/call", 0, KDB_REPEAT_NO_ARGS); /* * Architecture dependent initialization. diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index c7a1797..bae9a1d 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2683,7 +2683,7 @@ static int kdb_grep_help(int argc, const char **argv) } /* - * kdb_register_repeat - This function is used to register a kernel + * kdb_register_flags - This function is used to register a kernel * debugger command. * Inputs: * cmd Command name @@ -2695,12 +2695,12 @@ static int kdb_grep_help(int argc, const char **argv) * zero for success, one if a duplicate command. */ #define kdb_command_extend 50 /* arbitrary */ -int kdb_register_repeat(char *cmd, - kdb_func_t func, - char *usage, - char *help, - short minlen, - kdb_cmdflags_t flags) +int kdb_register_flags(char *cmd, + kdb_func_t func, + char *usage, + char *help, + short minlen, + kdb_cmdflags_t flags) { int i; kdbtab_t *kp; @@ -2753,13 +2753,13 @@ int kdb_register_repeat(char *cmd, return 0; } -EXPORT_SYMBOL_GPL(kdb_register_repeat); +EXPORT_SYMBOL_GPL(kdb_register_flags); /* * kdb_register - Compatibility
[PATCH 2/7] kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
We're about to add more options for command behaviour, so let's expand the meaning of kdb_repeat_t. So far we just do various renames, there should be no functional changes. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h| 4 ++-- kernel/debug/kdb/kdb_main.c| 6 +++--- kernel/debug/kdb/kdb_private.h | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index 7f6fe6e..cbd1c28 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -17,7 +17,7 @@ typedef enum { KDB_REPEAT_NONE = 0,/* Do not repeat this command */ KDB_REPEAT_NO_ARGS, /* Repeat the command without arguments */ KDB_REPEAT_WITH_ARGS, /* Repeat the command including its arguments */ -} kdb_repeat_t; +} kdb_cmdflags_t; typedef int (*kdb_func_t)(int, const char **); @@ -146,7 +146,7 @@ static inline const char *kdb_walk_kallsyms(loff_t *pos) /* Dynamic kdb shell command registration */ extern int kdb_register(char *, kdb_func_t, char *, char *, short); extern int kdb_register_repeat(char *, kdb_func_t, char *, char *, - short, kdb_repeat_t); + short, kdb_cmdflags_t); extern int kdb_unregister(char *); #else /* ! CONFIG_KGDB_KDB */ static inline __printf(1, 2) int kdb_printf(const char *fmt, ...) { return 0; } diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index cdaaa52..c7a1797 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -992,7 +992,7 @@ int kdb_parse(const char *cmdstr) if (result && ignore_errors && result > KDB_CMD_GO) result = 0; KDB_STATE_CLEAR(CMD); - switch (tp->cmd_repeat) { + switch (tp->cmd_flags) { case KDB_REPEAT_NONE: argc = 0; if (argv[0]) @@ -2700,7 +2700,7 @@ int kdb_register_repeat(char *cmd, char *usage, char *help, short minlen, - kdb_repeat_t repeat) + kdb_cmdflags_t flags) { int i; kdbtab_t *kp; @@ -2749,7 +2749,7 @@ int kdb_register_repeat(char *cmd, kp->cmd_usage = usage; kp->cmd_help = help; kp->cmd_minlen = minlen; - kp->cmd_repeat = repeat; + kp->cmd_flags = flags; return 0; } diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h index f8245b3..9e1b8e9 100644 --- a/kernel/debug/kdb/kdb_private.h +++ b/kernel/debug/kdb/kdb_private.h @@ -177,7 +177,7 @@ typedef struct _kdbtab { char*cmd_help; /* Help message for this command */ shortcmd_minlen;/* Minimum legal # command * chars required */ - kdb_repeat_t cmd_repeat;/* Does command auto repeat on enter? */ + kdb_cmdflags_t cmd_flags; /* Command behaviour flags */ } kdbtab_t; extern int kdb_bt(int, const char **); /* KDB display back trace */ -- 1.7.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/7] kdb: Remove KDB_REPEAT_NONE flag
Since we now treat KDB_REPEAT_* as flags, there is no need to pass KDB_REPEAT_NONE. It's just the default behaviour when no flags are specified. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h | 1 - kernel/debug/kdb/kdb_bp.c | 6 ++--- kernel/debug/kdb/kdb_main.c | 61 ++--- kernel/trace/trace_kdb.c| 2 +- 4 files changed, 34 insertions(+), 36 deletions(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index c6f1ec3..792779c 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -14,7 +14,6 @@ */ typedef enum { - KDB_REPEAT_NONE = 0,/* Do not repeat this command */ KDB_REPEAT_NO_ARGS = 0x1, /* Repeat the command w/o arguments */ KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */ } kdb_cmdflags_t; diff --git a/kernel/debug/kdb/kdb_bp.c b/kernel/debug/kdb/kdb_bp.c index d2cb80d..928e9e9 100644 --- a/kernel/debug/kdb/kdb_bp.c +++ b/kernel/debug/kdb/kdb_bp.c @@ -553,11 +553,11 @@ void __init kdb_initbptab(void) kdb_register_flags("bph", kdb_bp, "[]", "[datar [length]|dataw [length]] Set hw brk", 0, KDB_REPEAT_NO_ARGS); kdb_register_flags("bc", kdb_bc, "", - "Clear Breakpoint", 0, KDB_REPEAT_NONE); + "Clear Breakpoint", 0, 0); kdb_register_flags("be", kdb_bc, "", - "Enable Breakpoint", 0, KDB_REPEAT_NONE); + "Enable Breakpoint", 0, 0); kdb_register_flags("bd", kdb_bc, "", - "Disable Breakpoint", 0, KDB_REPEAT_NONE); + "Disable Breakpoint", 0, 0); kdb_register_flags("ss", kdb_ss, "", "Single Step", 1, KDB_REPEAT_NO_ARGS); diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 7245bab..172b726 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2752,7 +2752,7 @@ EXPORT_SYMBOL_GPL(kdb_register_flags); /* * kdb_register - Compatibility register function for commands that do * not need to specify a repeat state. Equivalent to - * kdb_register_flags with KDB_REPEAT_NONE. + * kdb_register_flags with flags set to 0. * Inputs: * cmd Command name * funcFunction to execute the command @@ -2767,8 +2767,7 @@ int kdb_register(char *cmd, char *help, short minlen) { - return kdb_register_flags(cmd, func, usage, help, minlen, - KDB_REPEAT_NONE); + return kdb_register_flags(cmd, func, usage, help, minlen, 0); } EXPORT_SYMBOL_GPL(kdb_register); @@ -2822,70 +2821,70 @@ static void __init kdb_inittab(void) kdb_register_flags("mm", kdb_mm, " ", "Modify Memory Contents", 0, KDB_REPEAT_NO_ARGS); kdb_register_flags("go", kdb_go, "[]", - "Continue Execution", 1, KDB_REPEAT_NONE); + "Continue Execution", 1, 0); kdb_register_flags("rd", kdb_rd, "", - "Display Registers", 0, KDB_REPEAT_NONE); + "Display Registers", 0, 0); kdb_register_flags("rm", kdb_rm, " ", - "Modify Registers", 0, KDB_REPEAT_NONE); + "Modify Registers", 0, 0); kdb_register_flags("ef", kdb_ef, "", - "Display exception frame", 0, KDB_REPEAT_NONE); + "Display exception frame", 0, 0); kdb_register_flags("bt", kdb_bt, "[]", - "Stack traceback", 1, KDB_REPEAT_NONE); + "Stack traceback", 1, 0); kdb_register_flags("btp", kdb_bt, "", - "Display stack for process ", 0, KDB_REPEAT_NONE); + "Display stack for process ", 0, 0); kdb_register_flags("bta", kdb_bt, "[DRSTCZEUIMA]", - "Display stack all processes", 0, KDB_REPEAT_NONE); + "Display stack all processes", 0, 0); kdb_register_flags("btc", kdb_bt, "", - "Backtrace current process on each cpu", 0, KDB_REPEAT_NONE); + "Backtrace current process on each cpu", 0, 0); kdb_register_flags("btt", kdb_bt, "", "Backtrace process given its struct task address", 0, - KDB_REPEAT_NONE); + 0); kdb_register_flags("ll", kdb_ll, " ", - "Execute cmd for each element in linked list", 0, KDB_REPEAT_NONE); + "Execute cmd for each element in linked list", 0, 0); kdb_register_flags("env", kdb_env, "", - "Show environment variables", 0, KDB_REPEAT_NONE); + "Show environment variables", 0, 0); kdb_register_flags("set", kdb_set, "", - "Set environment variables", 0, KDB_REPEAT_NONE); + "Set environment variables", 0, 0); kdb_register_flags("help", kdb_help, "", - "Display Help Message", 1, KDB_REPEAT_NONE); + "Display Help Message", 1, 0); kdb_register_flags("?", kdb_help, "", - "Display Help Message", 0, KDB_REPEAT_NONE); + "Display Help Message", 0, 0); kdb_register_flags("cpu",
[PATCH 7/7] kdb: Add kiosk mode
By issuing 'echo 1 > /sys/module/kdb/parameters/kiosk' or booting with kdb.kiosk=1 kernel command line option, one can still have a somewhat usable debugging facility, but not fearing that the debugger can be used to easily gain root access or dump sensitive data. Without the kiosk mode, obtaining the root rights via KDB is a matter of a few commands, and works everywhere. For example, log in as a normal user: cbou:~$ id uid=1001(cbou) gid=1001(cbou) groups=1001(cbou) Now enter KDB (for example via sysrq): Entering kdb (current=0x8800065bc740, pid 920) due to Keyboard Entry kdb> ps 23 sleeping system daemon (state M) processes suppressed, use 'ps A' to see all. Task Addr Pid Parent [*] cpu State Thread Command 0x8800065bc740 920 919 10 R 0x8800065bca20 *bash 0x88000707800010 00 S 0x8800070782e0 init [...snip...] 0x8800065be3c0 9181 00 S 0x8800065be6a0 getty 0x8800065b9c80 9191 00 S 0x8800065b9f60 login 0x8800065bc740 920 919 10 R 0x8800065bca20 *bash All we need is the offset of cred pointers. We can look up the offset in the distro's kernel source, but it is unnecessary. We can just start dumping init's task_struct, until we see the process name: kdb> md 0x880007078000 0x880007078000 0001 88000703c000 0x880007078010 00402102 .!@. [...snip...] 0x8800070782b0 8800073e0580 8800073e0580 ..>...>. 0x8800070782c0 74696e69 init ^ Here, 'init'. Creds are just above it, so the offset is 0x02b0. Now we set up init's creds for our non-privileged shell: kdb> mm 0x8800065bc740+0x02b0 0x8800073e0580 0x8800065bc9f0 = 0x8800073e0580 kdb> mm 0x8800065bc740+0x02b8 0x8800073e0580 0x8800065bc9f8 = 0x8800073e0580 And thus gaining the root: kdb> go cbou:~$ id uid=0(root) gid=0(root) groups=0(root) cbou:~$ bash root:~# p.s. No distro enables kdb by default (although, with a nice KDB-over-KMS feature availability, I would expect at least some would enable it), so it's not actually some kind of a major issue. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h | 1 + kernel/debug/kdb/kdb_main.c | 20 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index abe927c..3a2c554 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -63,6 +63,7 @@ extern atomic_t kdb_event; #define KDB_BADLENGTH (-19) #define KDB_NOBP (-20) #define KDB_BADADDR(-21) +#define KDB_NOPERM (-22) /* * kdb_diemsg diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 83c3f60..36e4c2a 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -12,6 +12,7 @@ */ #include +#include #include #include #include @@ -23,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -42,6 +44,12 @@ #include #include "kdb_private.h" +#undef MODULE_PARAM_PREFIX +#defineMODULE_PARAM_PREFIX "kdb." + +static bool kdb_kiosk; +module_param_named(kiosk, kdb_kiosk, bool, 0600); + #define GREP_LEN 256 char kdb_grep_string[GREP_LEN]; int kdb_grepping_flag; @@ -121,6 +129,7 @@ static kdbmsg_t kdbmsgs[] = { KDBMSG(BADLENGTH, "Invalid length field"), KDBMSG(NOBP, "No Breakpoint exists"), KDBMSG(BADADDR, "Invalid address"), + KDBMSG(NOPERM, "Permission denied"), }; #undef KDBMSG @@ -987,6 +996,14 @@ int kdb_parse(const char *cmdstr) if (i < kdb_max_commands) { int result; + + if (kdb_kiosk) { + if (!(tp->cmd_flags & (KDB_SAFE | KDB_SAFE_NO_ARGS))) + return KDB_NOPERM; + if (tp->cmd_flags & KDB_SAFE_NO_ARGS && argc > 1) + return KDB_NOPERM; + } + KDB_STATE_SET(CMD); result = (*tp->cmd_func)(argc-1, (const char **)argv); if (result && ignore_errors && result > KDB_CMD_GO) @@ -1009,7 +1026,7 @@ int kdb_parse(const char *cmdstr) * obtaining the address of a variable, or the nearest symbol * to an address contained in a register. */ - { + if (!kdb_kiosk) { unsigned long value; char *name = NULL; long offset; @@ -1025,6 +1042,7 @@ int kdb_parse(const char *cmdstr) kdb_printf("\n"); return 0; } + return KDB_NOPERM; } -- 1.7.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the
[PATCH 6/7] kdb: Mark safe commands as KDB_SAFE and KDB_SAFE_NO_ARGS
This patch introduces two new flags: KDB_SAFE, denotes a safe command, and KDB_SAFE_NO_ARGS, denotes a safe command when used without arguments. The word "safe" here used in the sense that the commands cannot be used to leak sensitive data from the memory, and cannot be used to change program flow in a predefined manner. These flags will be used by the "kiosk" mode, i.e. when it is possible for the ordinary user to enter the KDB (or user can get the access to KDB after the crash), but we do not allow user to read dump the memory [and thus read some sensitive data]. The following commands were marked as "safe": Display exception frame Stack traceback Display stack for process Display stack all processes Backtrace current process on each cpu Execute cmd for each element in linked list Show environment variables Set environment variables Display Help Message Switch to new cpu Display active task list Switch to another task Reboot the machine immediately List loaded kernel modules Magic SysRq key Display syslog buffer Define a set of commands, down to endefcmd Summarize the system Disable NMI entry to KDB The following commands were marked as safe when issued with no arguments: Continue Execution And the following commands are unsafe: Clear Breakpoint Enable Breakpoint Disable Breakpoint Single step Single step to branch/call Continue Execution (with address argument) Display Memory Contents Display Raw Memory Display Physical Memory Display Memory Symbolically Modify Memory Contents Display Registers Modify Registers Backtrace process given its struct task address Send a signal to a process Enter kgdb mode Display per_cpu variables Note that we mark "display registers" command unsafe, this is because single stepping + constantly dumping registers in string or memory functions can be used as a way to read sensitive data (it's actually trivial to exploit). Later we can do a bit better, i.e. not displaying general-purpose registers, but printing control registers. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h | 2 ++ kernel/debug/kdb/kdb_main.c | 44 ++-- kernel/trace/trace_kdb.c| 2 +- 3 files changed, 25 insertions(+), 23 deletions(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index 792779c..abe927c 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -16,6 +16,8 @@ typedef enum { KDB_REPEAT_NO_ARGS = 0x1, /* Repeat the command w/o arguments */ KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */ + KDB_SAFE= 0x4, /* Security-wise safe command */ + KDB_SAFE_NO_ARGS= 0x8, /* Only safe if run w/o arguments */ } kdb_cmdflags_t; typedef int (*kdb_func_t)(int, const char **); diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 172b726..83c3f60 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2821,70 +2821,70 @@ static void __init kdb_inittab(void) kdb_register_flags("mm", kdb_mm, " ", "Modify Memory Contents", 0, KDB_REPEAT_NO_ARGS); kdb_register_flags("go", kdb_go, "[]", - "Continue Execution", 1, 0); + "Continue Execution", 1, KDB_SAFE_NO_ARGS); kdb_register_flags("rd", kdb_rd, "", "Display Registers", 0, 0); kdb_register_flags("rm", kdb_rm, " ", "Modify Registers", 0, 0); kdb_register_flags("ef", kdb_ef, "", - "Display exception frame", 0, 0); + "Display exception frame", 0, KDB_SAFE); kdb_register_flags("bt", kdb_bt, "[]", - "Stack traceback", 1, 0); + "Stack traceback", 1, KDB_SAFE); kdb_register_flags("btp", kdb_bt, "", - "Display stack for process ", 0, 0); + "Display stack for process ", 0, KDB_SAFE); kdb_register_flags("bta", kdb_bt, "[DRSTCZEUIMA]", - "Display stack all processes", 0, 0); + "Display stack all processes", 0, KDB_SAFE); kdb_register_flags("btc", kdb_bt, "", - "Backtrace current process on each cpu", 0, 0); + "Backtrace current process on each cpu", 0, KDB_SAFE); kdb_register_flags("btt", kdb_bt, "", "Backtrace process given its struct task address", 0, 0); kdb_register_flags("ll", kdb_ll, " ", - "Execute cmd for each element in linked list", 0, 0); + "Execute cmd for each element in linked list", 0, KDB_SAFE); kdb_register_flags("env", kdb_env, "", - "Show environment variables", 0, 0); + "Show environment variables", 0, KDB_SAFE); kdb_register_flags("set", kdb_set, "", -
[PATCH 4/7] kdb: Use KDB_REPEAT_* values as flags
The actual values of KDB_REPEAT_* enum values and overall logic stayed the same, but we now treat the values as flags. This makes it possible to add other flags and combine them, plus makes the code a lot simpler and shorter. But functionality-wise, there should be no changes. Signed-off-by: Anton Vorontsov --- include/linux/kdb.h | 4 ++-- kernel/debug/kdb/kdb_main.c | 21 +++-- 2 files changed, 9 insertions(+), 16 deletions(-) diff --git a/include/linux/kdb.h b/include/linux/kdb.h index 0142cd3..c6f1ec3 100644 --- a/include/linux/kdb.h +++ b/include/linux/kdb.h @@ -15,8 +15,8 @@ typedef enum { KDB_REPEAT_NONE = 0,/* Do not repeat this command */ - KDB_REPEAT_NO_ARGS, /* Repeat the command without arguments */ - KDB_REPEAT_WITH_ARGS, /* Repeat the command including its arguments */ + KDB_REPEAT_NO_ARGS = 0x1, /* Repeat the command w/o arguments */ + KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */ } kdb_cmdflags_t; typedef int (*kdb_func_t)(int, const char **); diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index bae9a1d..7245bab 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -992,20 +992,13 @@ int kdb_parse(const char *cmdstr) if (result && ignore_errors && result > KDB_CMD_GO) result = 0; KDB_STATE_CLEAR(CMD); - switch (tp->cmd_flags) { - case KDB_REPEAT_NONE: - argc = 0; - if (argv[0]) - *(argv[0]) = '\0'; - break; - case KDB_REPEAT_NO_ARGS: - argc = 1; - if (argv[1]) - *(argv[1]) = '\0'; - break; - case KDB_REPEAT_WITH_ARGS: - break; - } + + if (tp->cmd_flags & KDB_REPEAT_WITH_ARGS) + return result; + + argc = tp->cmd_flags & KDB_REPEAT_NO_ARGS ? 1 : 0; + if (argv[argc]) + *(argv[argc]) = '\0'; return result; } -- 1.7.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/7] kdb: Remove currently unused kdbtab_t->cmd_flags
The struct member is never used in the code, so we can remove it. We will introduce real flags soon by renaming cmd_repeat to cmd_flags. Signed-off-by: Anton Vorontsov --- kernel/debug/kdb/kdb_main.c| 1 - kernel/debug/kdb/kdb_private.h | 1 - 2 files changed, 2 deletions(-) diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 4d5f8d5..cdaaa52 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2748,7 +2748,6 @@ int kdb_register_repeat(char *cmd, kp->cmd_func = func; kp->cmd_usage = usage; kp->cmd_help = help; - kp->cmd_flags = 0; kp->cmd_minlen = minlen; kp->cmd_repeat = repeat; diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h index 392ec6a..f8245b3 100644 --- a/kernel/debug/kdb/kdb_private.h +++ b/kernel/debug/kdb/kdb_private.h @@ -175,7 +175,6 @@ typedef struct _kdbtab { kdb_func_t cmd_func;/* Function to execute command */ char*cmd_usage; /* Usage String for this command */ char*cmd_help; /* Help message for this command */ - shortcmd_flags; /* Parsing flags */ shortcmd_minlen;/* Minimum legal # command * chars required */ kdb_repeat_t cmd_repeat;/* Does command auto repeat on enter? */ -- 1.7.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/7] KDB: Kiosk (reduced capabilities) mode
Hello Jason, Just as promised, I'm resending the series after the merge window. This patchset implements "kiosk" mode for KDB debugger. The mode reduces kdb features, so that it is no longer possible to leak sensitive data via the debugger, and not possible to change program flow in a predefined manner by an ordinary user. Root can control the capability. There are a few patches, some are just cleanups, some are churn-ish cleanups, but inevitable. And the rest implements the mode -- after all the preparations, everything is pretty straightforward. Thanks! Anton. -- include/linux/kdb.h| 20 ++-- kernel/debug/kdb/kdb_bp.c | 24 ++--- kernel/debug/kdb/kdb_main.c| 189 ++ kernel/debug/kdb/kdb_private.h | 3 +- kernel/trace/trace_kdb.c | 4 +- 5 files changed, 125 insertions(+), 115 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
usbutils for Mac OS X and Cygwin
Hi Greg, Now usbutils git almost builds successfully out of the box under Mac OS X and Cygwin (using libusbx). Just wondering if you can accept the minor fix for Mac OS X and suggest a way to fix cygwin build. For Cygwin, there is a conflict with Cygwin's w32api package. DATADIR conflicts with MinGW and cydwin's in their w32api package. http://caca.zoy.org/changeset/3404 typedef enum tag DATADIR{ DATADIR_GET=1, DATADIR_SET } DATADIR; I do not know the proper fix, so I just temporarily change objidl.h to typedef enum tag DATADIR{ DATADIR_GET=1, DATADIR_SET } DATADIR1; After that I can build usbutils. I only need one fix for Mac OS X as Apple's gcc compiler does not like --as-needed. mymacmini:usbutils xiaofanc$ git diff diff --git a/Makefile.am b/Makefile.am index 4e53e45..e8cb002 100644 --- a/Makefile.am +++ b/Makefile.am @@ -1,8 +1,7 @@ SUBDIRS = \ usbhid-dump -AM_LDFLAGS = \ - -Wl,--as-needed +AM_LDFLAGS = data_DATA = -- Xiaofan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: manual merge of the tip tree with Linus' tree
Hi all, Today's linux-next merge of the tip tree got a conflict in mm/huge_memory.c between commit 325adeb55e32 ("mm: huge_memory: Fix build error") from Linus' tree and commit 39d6cb39a817 ("mm/mpol: Use special PROT_NONE to migrate pages") from the tip tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). diff --cc mm/huge_memory.c index 40f17c3,d14c8b2..000 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@@ -17,7 -17,7 +17,8 @@@ #include #include #include +#include + #include #include #include #include "internal.h" @@@ -1347,59 -1428,55 +1418,54 @@@ static int __split_huge_page_map(struc spin_lock(>page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); - if (pmd) { - pgtable = pgtable_trans_huge_withdraw(mm); - pmd_populate(mm, &_pmd, pgtable); - - haddr = address; - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { - pte_t *pte, entry; - BUG_ON(PageCompound(page+i)); - entry = mk_pte(page + i, vma->vm_page_prot); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (!pmd_write(*pmd)) - entry = pte_wrprotect(entry); - else - BUG_ON(page_mapcount(page) != 1); - if (!pmd_young(*pmd)) - entry = pte_mkold(entry); - pte = pte_offset_map(&_pmd, haddr); - BUG_ON(!pte_none(*pte)); - set_pte_at(mm, haddr, pte, entry); - pte_unmap(pte); - } + if (!pmd) + goto unlock; - smp_wmb(); /* make pte visible before pmd */ - /* -* Up to this point the pmd is present and huge and -* userland has the whole access to the hugepage -* during the split (which happens in place). If we -* overwrite the pmd with the not-huge version -* pointing to the pte here (which of course we could -* if all CPUs were bug free), userland could trigger -* a small page size TLB miss on the small sized TLB -* while the hugepage TLB entry is still established -* in the huge TLB. Some CPU doesn't like that. See -* http://support.amd.com/us/Processor_TechDocs/41322.pdf, -* Erratum 383 on page 93. Intel should be safe but is -* also warns that it's only safe if the permission -* and cache attributes of the two entries loaded in -* the two TLB is identical (which should be the case -* here). But it is generally safer to never allow -* small and huge TLB entries for the same virtual -* address to be loaded simultaneously. So instead of -* doing "pmd_populate(); flush_tlb_range();" we first -* mark the current pmd notpresent (atomically because -* here the pmd_trans_huge and pmd_trans_splitting -* must remain set at all times on the pmd until the -* split is complete for this pmd), then we flush the -* SMP TLB and finally we write the non-huge version -* of the pmd entry with pmd_populate. -*/ - pmdp_invalidate(vma, address, pmd); - pmd_populate(mm, pmd, pgtable); - ret = 1; + prot = pmd_pgprot(*pmd); - pgtable = get_pmd_huge_pte(mm); ++ pgtable = pgtable_trans_huge_withdraw(mm); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + + BUG_ON(PageCompound(page+i)); + entry = mk_pte(page + i, prot); + entry = pte_mkdirty(entry); + if (!pmd_young(*pmd)) + entry = pte_mkold(entry); + pte = pte_offset_map(&_pmd, haddr); + BUG_ON(!pte_none(*pte)); + set_pte_at(mm, haddr, pte, entry); + pte_unmap(pte); } + + smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */ + /* +* Up to this point the pmd is present and huge. +* +* If we overwrite the pmd with the not-huge version, we could trigger +* a small page size TLB miss on the small sized TLB while the hugepage +* TLB entry is still established in the huge TLB. +* +* Some CPUs don't like that. See +* http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383 +* on page 93. +* +
Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0
On Tue, 16 Oct 2012, Michal Hocko wrote: > And a follow up for memcg.swappiness documentation which is more > specific about spwappiness==0 meaning. > --- > From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 15 Oct 2012 11:43:56 +0200 > Subject: [PATCH] doc: describe memcg swappiness more precisely > > since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim > stopped swapping out anon pages completely when 0 value is used. > Although this is somehow expected it hasn't been done for a really long > time this way and so it is probably better to be explicit about the > effect. Moreover global reclaim swapps out even when swappiness is 0 > to prevent from OOM killer. > > Signed-off-by: Michal Hocko Acked-by: David Rientjes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] SLUB: remove hard coded magic numbers from resiliency_test
On Mon, 15 Oct 2012, Christoph Lameter wrote: > > Use the always inlined function kmalloc_index to translate > > sizes to indexes, so that we don't have to have the slab indexes > > hard coded in two places. > > Acked-by: Christoph Lameter > Shouldn't this be using get_slab() instead? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0
(2012/10/16 7:07), Michal Hocko wrote: And a follow up for memcg.swappiness documentation which is more specific about spwappiness==0 meaning. --- From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 15 Oct 2012 11:43:56 +0200 Subject: [PATCH] doc: describe memcg swappiness more precisely since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim stopped swapping out anon pages completely when 0 value is used. Although this is somehow expected it hasn't been done for a really long time this way and so it is probably better to be explicit about the effect. Moreover global reclaim swapps out even when swappiness is 0 to prevent from OOM killer. Signed-off-by: Michal Hocko Nice :) Acked-by: KAMEZAWA Hiroyuki --- Documentation/cgroups/memory.txt |4 1 file changed, 4 insertions(+) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index c07f7b4..71c4da4 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -466,6 +466,10 @@ Note: 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. +Please note that unlike the global swappiness, memcg knob set to 0 +really prevents from any swapping even if there is a swap storage +available. This might lead to memcg OOM killer if there are no file +pages to reclaim. Following cgroups' swappiness can't be changed. - root cgroup (uses /proc/sys/vm/swappiness). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead
On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote: > On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote: > > > Here results of my test. Workload isn't very realistic, but at least it > > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, > > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) > > > > average results for ten runs: > > > > RA=3RA=0RA=1RA=2RA=4HughShaohua > > real time 500 542 528 519 500 523 522 > > user time 738 737 735 737 739 737 739 > > sys time93 93 91 92 96 92 93 > > pgmajfault 62918 110533 92454 78221 54342 86601 77229 > > pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 > > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 > > pswpin 462747 138873 202148 310969 739431 232710 341320 > > pswpout 646363 502599 524613 584731 697797 568784 628677 > > > > So, last two columns shows mostly equal results: +4.6% and +4.4% in > > comparison to vanilla kernel with RA=3, but your version shows more stable > > results (std-error 2.7% against 4.8%) (all this numbers in huge table in > > attachment) > > Thanks for doing this, Konstantin, but I'm stuck for anything much to say! > Shaohua and I are both about 4.5% bad for this particular test, but I'm > more consistently bad - hurrah! > > I suspect (not a convincing argument) that if the test were just slightly > different (a little more or a little less memory, SSD instead of hard > disk, diskcache instead of tmpfs), then it would come out differently. > > Did you draw any conclusions from the numbers you found? > > I haven't done any more on this in the last few days, except to verify > that once an anon_vma is judged random with Shaohua's, then it appears > to be condemned to no-readahead ever after. > > That's probably something that a hack like I had in mine would fix, > but that addition might change its balance further (and increase vma > or anon_vma size) - not tried yet. > > All I want to do right now, is suggest to Andrew that he hold Shaohua's > patch back from 3.7 for the moment: I'll send a response to Sep 7th's > mm-commits mail to suggest that - but no great disaster if he ignores me. Ok, I tested Hugh's patch. My test is a multithread random write workload. With Hugh's patch, 49:28.06elapsed With mine, 43:23.39elapsed There is 12% more time used with Hugh's patch. In the stable state of this workload, SI:SO ratio should be roughly 1:1. With Hugh's patch, it's around 1.6:1, there is still unnecessary swapin. I also tried a workload with seqential/random write mixed, Hugh's patch is 10% bad too. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] DMA-mapping & IOMMU - physically contiguous allocations
2012/10/15 Marek Szyprowski : > Hello, > > Some devices, which have IOMMU, for some use cases might require to > allocate a buffers for DMA which is contiguous in physical memory. Such > use cases appears for example in DRM subsystem when one wants to improve > performance or use secure buffer protection. > > I would like to ask if adding a new attribute, as proposed in this RFC > is a good idea? I feel that it might be an attribute just for a single > driver, but I would like to know your opinion. Should we look for other > solution? > In addition, currently we have worked dma-mapping-based iommu support for exynos drm driver with this patch set so this patch set has been tested with iommu enabled exynos drm driver and worked fine. actually, this feature is needed for secure mode such as TrustZone. in case of Exynos SoC, memory region for secure mode should be physically contiguous and also maybe OMAP but now dma-mapping framework doesn't guarantee physically continuous memory allocation so this patch set would make it possible. Tested-by: Inki Dae Reviewed-by: Inki Dae Thanks, Inki Dae > Best regards > -- > Marek Szyprowski > Samsung Poland R Center > > > Marek Szyprowski (2): > common: DMA-mapping: add DMA_ATTR_FORCE_CONTIGUOUS attribute > ARM: dma-mapping: add support for DMA_ATTR_FORCE_CONTIGUOUS attribute > > Documentation/DMA-attributes.txt |9 + > arch/arm/mm/dma-mapping.c| 41 > ++ > include/linux/dma-attrs.h|1 + > 3 files changed, 43 insertions(+), 8 deletions(-) > > -- > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mpol_to_str revisited.
On Mon, 8 Oct 2012, Dave Jones wrote: > > > diff -durpN '--exclude-from=/home/davej/.exclude' > src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c2012-05-31 > 22:32:46.778150675 -0400 > > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > > > struct mm_walk walk = {}; > > > struct mempolicy *pol; > > > int n; > > > +int ret; > > > char buffer[50]; > > > > > > if (!mm) > > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > > > walk.mm = mm; > > > > > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > > > -mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > +memset(buffer, 0, sizeof(buffer)); > > > +ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > +if (ret < 0) > > > +return 0; > > > > We should need the mpol_cond_put(pol) here before returning. > > good catch. I'll respin the patch later with this changed. > Did you get a chance to fix this issue? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] slab: Ignore internal flags in cache creation
On Mon, 8 Oct 2012, David Rientjes wrote: > > diff --git a/mm/slab.h b/mm/slab.h > > index 7deeb44..4c35c17 100644 > > --- a/mm/slab.h > > +++ b/mm/slab.h > > @@ -45,6 +45,31 @@ static inline struct kmem_cache > > *__kmem_cache_alias(const char *name, size_t siz > > #endif > > > > > > +/* Legal flag mask for kmem_cache_create(), for various configurations */ > > +#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC > > | \ > > +SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS ) > > + > > +#if defined(CONFIG_DEBUG_SLAB) > > +#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER) > > +#elif defined(CONFIG_SLUB_DEBUG) > > +#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \ > > + SLAB_TRACE | SLAB_DEBUG_FREE) > > +#else > > +#define SLAB_DEBUG_FLAGS (0) > > +#endif > > + > > +#if defined(CONFIG_SLAB) > > +#define SLAB_CACHE_FLAGS (SLAB_MEMSPREAD | SLAB_NOLEAKTRACE | \ > > s/SLAB_MEMSPREAD/SLAB_MEM_SPREAD/ > Did you have a v5 of this patch with the above fix? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/