Re: [PATCH] memory cgroup: update root memory cgroup when node is onlined

2012-10-15 Thread Wen Congyang
At 09/14/2012 09:36 AM, Hugh Dickins Wrote:
> On Thu, 13 Sep 2012, Johannes Weiner wrote:
>> On Thu, Sep 13, 2012 at 03:14:28PM +0800, Wen Congyang wrote:
>>> root_mem_cgroup->info.nodeinfo is initialized when the system boots.
>>> But NODE_DATA(nid) is null if the node is not onlined, so
>>> root_mem_cgroup->info.nodeinfo[nid]->zoneinfo[zone].lruvec.zone contains
>>> an invalid pointer. If we use numactl to bind a program to the node
>>> after onlining the node and its memory, it will cause the kernel
>>> panicked:
>>
>> Is there any chance we could get rid of the zone backpointer in lruvec
>> again instead?
> 
> It could be done, but it would make me sad :(
> 
>> Adding new nodes is a rare event and so updating every
>> single memcg in the system might be just borderline crazy.
> 
> Not horribly crazy, but rather ugly, yes.
> 
>> But can't
>> we just go back to passing the zone along with the lruvec down
>> vmscan.c paths?  I agree it's ugly to pass both, given their
>> relationship.  But I don't think the backpointer is any cleaner but in
>> addition less robust.
> 
> It's like how we use vma->mm: we could change everywhere to pass mm with
> vma, but it looks cleaner and cuts down on long arglists to have mm in vma.
>>From past experience, one of the things I worried about was adding extra
> args to the reclaim stack.
> 
>>
>> That being said, the crashing code in particular makes me wonder:
>>
>> static __always_inline void add_page_to_lru_list(struct page *page,
>>  struct lruvec *lruvec, enum lru_list lru)
>> {
>>  int nr_pages = hpage_nr_pages(page);
>>  mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
>>  list_add(>lru, >lists[lru]);
>>  __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
>> }
>>
>> Why did we ever pass zone in here and then felt the need to replace it
>> with lruvec->zone in fa9add6 "mm/memcg: apply add/del_page to lruvec"?
>> A page does not roam between zones, its zone is a static property that
>> can be retrieved with page_zone().
> 
> Just as in vmscan.c, we have the lruvec to hand, and that's what we
> mainly want to operate upon, but there is also some need for zone.
> 
> (Both Konstantin and I were looking towards the day when we move the
> lru_lock into the lruvec, removing more dependence on "zone".  Pretty
> much the only reason that hasn't happened yet, is that we have not found
> time to make a performance case convincingly - but that's another topic.)
> 
> Yes, page_zone(page) is a static property of the page, but it's not
> necessarily cheap to evaluate: depends on how complex the memory model
> and the spare page flags space, doesn't it?  We both preferred to
> derive zone from lruvec where convenient.
> 
> How do you feel about this patch, and does it work for you guys?
> 
> You'd be right if you guessed that I started out without the
> mem_cgroup_zone_lruvec part of it, but oops in get_scan_count
> told me that's needed too.
> 
> Description to be filled in later: would it be needed for -stable,
> or is onlining already broken in other ways that you're now fixing up?
> 
> Reported-by: Tang Chen 
> Signed-off-by: Hugh Dickins 

Hi, all:

What about the status of this patch?

Thanks
Wen Congyang

> ---
> 
>  include/linux/mmzone.h |2 -
>  mm/memcontrol.c|   40 ---
>  mm/mmzone.c|6 -
>  mm/page_alloc.c|2 -
>  4 files changed, 36 insertions(+), 14 deletions(-)
> 
> --- 3.6-rc5/include/linux/mmzone.h2012-08-03 08:31:26.892842267 -0700
> +++ linux/include/linux/mmzone.h  2012-09-13 17:07:51.893772372 -0700
> @@ -744,7 +744,7 @@ extern int init_currently_empty_zone(str
>unsigned long size,
>enum memmap_context context);
>  
> -extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
> +extern void lruvec_init(struct lruvec *lruvec);
>  
>  static inline struct zone *lruvec_zone(struct lruvec *lruvec)
>  {
> --- 3.6-rc5/mm/memcontrol.c   2012-08-03 08:31:27.060842270 -0700
> +++ linux/mm/memcontrol.c 2012-09-13 17:46:36.870804625 -0700
> @@ -1061,12 +1061,25 @@ struct lruvec *mem_cgroup_zone_lruvec(st
> struct mem_cgroup *memcg)
>  {
>   struct mem_cgroup_per_zone *mz;
> + struct lruvec *lruvec;
>  
> - if (mem_cgroup_disabled())
> - return >lruvec;
> + if (mem_cgroup_disabled()) {
> + lruvec = >lruvec;
> + goto out;
> + }
>  
>   mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
> - return >lruvec;
> + lruvec = >lruvec;
> +out:
> + /*
> +  * Since a node can be onlined after the mem_cgroup was created,
> +  * we have to be prepared to initialize lruvec->zone here.
> +  */
> + if (unlikely(lruvec->zone != zone)) {
> + VM_BUG_ON(lruvec->zone);
> + lruvec->zone = zone;
> +  

Re: [PATCH v2 09/13] ARM: davinci - update the dm644x soc code to use common clk drivers

2012-10-15 Thread Sekhar Nori
Hi Murali,

On 10/15/2012 9:21 PM, Karicheri, Muralidharan wrote:
> --Cut
> 
>>> Subject: Re: [PATCH v2 09/13] ARM: davinci - update the dm644x soc code to 
>>> use
>>> common clk drivers
>>>
>> You have chosen to keep all clock related data in platform files
>> while using the common clock framework to provide just the
>> infrastructure. If you look at how mxs and spear have been migrated, 
>> they have
>>> migrated the soc specific clock data to drivers/clk as well.
>> See "drivers/clk/spear/spear3xx_clock.c" or
>> "drivers/clk/mxs/clk-imx23.c

 I have to disagree on this one. I had investigated these code already
 and came up with a way that we can re-use code across all of the
 davinci platforms as well as other architectures that re-uses the clk 
 hardware IPs.
>>>
>>> Which code you are talking about here? Even if you introduce clk-dm644x.c, 
>>> clk-
>>> keystone.c etc in drivers/clk/davinci/ you can reuse the code you introduce 
>>> in patches 1-
>>> 3. I cant see how that will be prevented.
> 
> I was talking about re-use of davinci_common_clk_init in 
> drivers/clk/davinci/davinci-clock.c.
> This is meant to be re-used across all of the DaVinci devices.
> 
>>>
 spear3xx_clock.c has initialization code for each of the platforms and
 so is the case with imx23.c.
>>>
>>> By each of the platforms, you mean they all cater to a family of devices? 
>>> This depends on
>>> how close together the family of devices are.
>>> Otherwise, there would be a file per soc. DM644x also represents a family 
>>> for that matter.
>>>
 By using platform_data approach, we are able to define clks for each of 
 the SoC and
>>> then use davinci_common_clk_init() to do initialize the clk drivers based 
>>> on platform
>>> data.
>>>
>>> You need to define and register the clocks present on each SoC either which 
>>> way. I don't
>>> see why just the platform_data approach allows this.
>>> And looking closely, you have defined platform data, but don't actually 
>>> have a platform
>>> device, making things more confusing.
>>>
> 
> Ok. There are multiple ways to implement this software. We had discussed this
> internally and picked the platform_data approach. The clk drivers are written 
> not
> following the platform driver model. But I don't see why we can't use 
> platform data
> to configure this drivers. Down below, you have made two interesting points, 
> one is
> ARM code reduction. This patch already does this by moving the API that 
> initializes
> the clk drivers (davinci_common_clk_init()) out of ARM to 
> drivers/clk/davinci. So
> this + removal of existing clk driver under arm/mach-davinci/clock.[ch], we 
> have
> achieved this goal. The second point is the moving of SoC specific clk data 
> out of SoC
> code to drive. Are you 100% sure this is the right thing to do for these 
> drivers. If so,
> I can start working on this change right away. As I am working on this as a 
> background
> activity, I want to be double or triple sure before doing the rework of these 
> patches :).
> So please confirm.

Yes, this is the right way to go. And I don't see it as something
breaking new ground since there are already multiple SoCs in mainline
which are following this same approach. May be to start with just
convert one SoC and send for review.

Thanks for taking this up and helping clean-up mach-davinci.

Regards,
Sekhar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

2012-10-15 Thread Minchan Kim
On Tue, Oct 16, 2012 at 09:56:48AM +0800, Ming Lei wrote:
> On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim  wrote:
> > On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote:
> >> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> >> 'struct task_struct'), so that the flag can be set by one task
> >> to avoid doing I/O inside memory allocation in the task's context.
> >>
> >> The patch trys to solve one deadlock problem caused by block device,
> >> and the problem can be occured at least in the below situations:
> >>
> >> - during block device runtime resume situation, if memory allocation
> >> with GFP_KERNEL is called inside runtime resume callback of any one
> >> of its ancestors(or the block device itself), the deadlock may be
> >> triggered inside the memory allocation since it might not complete
> >> until the block device becomes active and the involed page I/O finishes.
> >> The situation is pointed out first by Alan Stern. It is not a good
> >> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
> >> several subsystems may be involved(for example, PCI, USB and SCSI may
> >> be involved for usb mass stoarage device)
> >
> > Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as
> > suspend path?
> 
> IMO, we could, but it is not good and might trigger memory allocation problem.
> 
> pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to
> avoid allocating page with GFP_IOFS in all contexts during system sleep,
> when processes have been frozen.
> 
> But during runtime PM, the whole system is running and all processes are
> runnable. Also runtime PM is per device and the whole system may have
> lots of devices, so taking the global gfp_allowed_mask may keep page
> allocation with ~GFP_IOFS for a considerable proportion of system
> running time, then alloc_page() will return failure easier.
> 
> The above deadlock problem may be fixed by allocating memory with
> ~GFP_IOFS only in the context of calling runtime_resume, and that is
> idea of the patch.

Fair enough but it wouldn't be a good idea that add new unlikely branch
in allocator's fast path. Please move the check into slow path which could
be in __alloc_pages_slowpath.

> 
> >
> >>
> >> - during error handling situation of usb mass storage deivce, USB
> >> bus reset will be put on the device, so there shouldn't have any
> >> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> >> the deadlock similar with above may be triggered. Unfortunately, any
> >> usb device may include one mass storage interface in theory, so it
> >> requires all usb interface drivers to handle the situation. In fact,
> >> most usb drivers don't know how to handle bus reset on the device
> >> and don't provide .pre_set() and .post_reset() callback at all, so
> >> USB core has to unbind and bind driver for these devices. So it
> >> is still not practical to resort to GFP_NOIO for solving the problem.
> >
> > I hope this case could be handled by usb core like usb_restrict_gfp_mask
> > rather than adding new branch on fast path.
> 
> See above, applying the global gfp_allowed_mask is not good.
> 
> 
> Thanks,
> --
> Ming Lei

-- 
Kind Regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 05/12] x86, hotplug, suspend: Online CPU0 for suspend or hibernate

2012-10-15 Thread Srivatsa S. Bhat
On 10/16/2012 02:20 AM, Rafael J. Wysocki wrote:
> On Friday 12 of October 2012 09:09:42 Fenghua Yu wrote:
>> From: Fenghua Yu 
>>
>> Because x86 BIOS requires CPU0 to resume from sleep, suspend or hibernate 
>> can't
>> be executed if CPU0 is detected offline. To make suspend or hibernate and
>> further resume succeed, CPU0 must be online.
>>
>> Signed-off-by: Fenghua Yu 
>> ---
>>  arch/x86/power/cpu.c |   44 
>>  1 files changed, 44 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
>> index 218cdb1..adde775 100644
>> --- a/arch/x86/power/cpu.c
>> +++ b/arch/x86/power/cpu.c
>> @@ -237,3 +237,47 @@ void restore_processor_state(void)
>>  #ifdef CONFIG_X86_32
>>  EXPORT_SYMBOL(restore_processor_state);
>>  #endif
>> +
>> +/*
>> + * When bsp_check() is called in hibernate and suspend, cpu hotplug
>> + * is disabled already. So it's unnessary to handle race condition between
>> + * cpumask query and cpu hotplug.
>> + */
>> +static int bsp_check(void)
>> +{
>> +if (cpumask_first(cpu_online_mask) != 0) {
>> +pr_warn("CPU0 is offline.\n");
>> +return -ENODEV;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static int bsp_pm_callback(struct notifier_block *nb, unsigned long action,
>> +   void *ptr)
>> +{
>> +int ret = 0;
>> +
>> +switch (action) {
>> +case PM_SUSPEND_PREPARE:
>> +case PM_HIBERNATION_PREPARE:
>> +ret = bsp_check();
>> +break;
>> +default:
>> +break;
>> +}
>> +return notifier_from_errno(ret);
>> +}
>> +
> 
> I wonder if there's anything preventing CPU0 from becoming offline after 
> you've
> done this check and before user space is frozen?
> 

Hi Rafael,

bsp_pm_callback runs as a low priority notifier callback, specifically with 
lower
priority than the cpu_hotplug_pm_callback (as mentioned in the comment below).
And cpu_hotplug_pm_callback disables regular CPU hotplug (till the 
suspend/resume
sequence is complete).. So there is no chance for CPU0 to become offline after 
that.

Or, are you thinking of some other scenario where CPU0 can go offline?

Regards,
Srivatsa S. Bhat

> 
> 
>> +static int __init bsp_pm_check_init(void)
>> +{
>> +/*
>> + * Set this bsp_pm_callback as lower priority than
>> + * cpu_hotplug_pm_callback. So cpu_hotplug_pm_callback will be called
>> + * earlier to disable cpu hotplug before bsp online check.
>> + */
>> +pm_notifier(bsp_pm_callback, -INT_MAX);
>> +return 0;
>> +}
>> +
>> +core_initcall(bsp_pm_check_init);
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] ASoC: imx: Don't use {en,dis}able_fiq() calls

2012-10-15 Thread Mark Brown
On Mon, Oct 15, 2012 at 02:51:28PM -0700, Anton Vorontsov wrote:
> The driver uses platform-specific mxc_set_irq_fiq() with the VIRQ cookie
> passed to it, so it's pretty clear that the driver is absolutely sure
> that the FIQ is routed via platform-specific IC, and that the cookie can
> be used to mask/unmask FIQs. So, let's switch to the genirq routines,
> since we're about to remove FIQ-specific variants.

Acked-by: Mark Brown 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] regulator: core: Check before enabling regulator while setting constraints.

2012-10-15 Thread Mark Brown
On Tue, Oct 16, 2012 at 10:54:19AM +0530, Yadwinder Singh Brar wrote:
> This patch adds check, whether regulator is already enabled before enabling it
> while setting machine constraints. Since some PMICs have same register bits
> for setting opmode and enabling/disabling the regulator, so it will overwrite
> the settings (if any)done by set_mode/set_suspend_mode callbacks when it
> enables regulator without checking previous status.

This sounds like a bug in the driver, these ops are supposed to be
repeatable at will.  The driver needs to remember the mode setting when
doing enable or disable, and setting the mode should not change the
enable status.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix scheduling-while-atomic problem in console_cpu_notify()

2012-10-15 Thread Srivatsa S. Bhat
On 10/16/2012 10:05 AM, Paul E. McKenney wrote:
> On Mon, Oct 15, 2012 at 05:31:28PM -0700, Paul E. McKenney wrote:
>> The console_cpu_notify( function runs with interrupts disabled in
>> the CPU_DEAD case.  It therefore cannot block, for example, as will
>> happen when it calls console_lock().  Therefore, remove the CPU_DEAD
>> leg of the switch statement to avoid this problem.
>>
>> Signed-off-by: Paul E. McKenney 
> 
> s/CPU_DEAD/CPU_DYING/
> 
> Apparently it is a bad idea to compose and send a patch while in a
> C++ standards committee meeting where people are arguing about async
> futures...  Fixed patch below.
> 
>   Thanx, Paul
> 
> 
> 
> printk: Fix scheduling-while-atomic problem in console_cpu_notify()
> 
> The console_cpu_notify( function runs with interrupts disabled in
> the CPU_DYING case.  It therefore cannot block, for example, as will
> happen when it calls console_lock().  Therefore, remove the CPU_DYING
> leg of the switch statement to avoid this problem.
> 
> Signed-off-by: Paul E. McKenney 
> 

Reviewed-by: Srivatsa S. Bhat 

Regards,
Srivatsa S. Bhat

> diff --git a/kernel/printk.c b/kernel/printk.c
> index 66a2ea3..2d607f4 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -1890,7 +1890,6 @@ static int __cpuinit console_cpu_notify(struct 
> notifier_block *self,
>   switch (action) {
>   case CPU_ONLINE:
>   case CPU_DEAD:
> - case CPU_DYING:
>   case CPU_DOWN_FAILED:
>   case CPU_UP_CANCELED:
>   console_lock();
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] regulator: max77686: Add set_suspend_disable/set_suspend_mode callbacks.

2012-10-15 Thread Yadwinder Singh Brar
This patch implements set_suspend_disable callback for BUCKs which
support only switch ON/OFF modes during system suspend state, and
set_suspend_mode callbacks for LDOs which also suport Low power mode and
switch ON/OFF modes.

Signed-off-by: Yadwinder Singh Brar 
---
 drivers/regulator/max77686.c |  142 +++--
 1 files changed, 135 insertions(+), 7 deletions(-)

diff --git a/drivers/regulator/max77686.c b/drivers/regulator/max77686.c
index 2a67d08..e83db38 100644
--- a/drivers/regulator/max77686.c
+++ b/drivers/regulator/max77686.c
@@ -69,6 +69,76 @@ struct max77686_data {
struct regulator_dev *rdev[MAX77686_REGULATORS];
 };
 
+/* Some BUCKS supports Normal[ON/OFF] mode during suspend */
+static int max77686_buck_set_suspend_disable(struct regulator_dev *rdev)
+{
+   unsigned int val;
+
+   if (rdev->desc->id == MAX77686_BUCK1)
+   val = 0x1;
+   else
+   val = 0x1 << MAX77686_OPMODE_BUCK234_SHIFT;
+
+   return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg,
+ rdev->desc->enable_mask,
+ val);
+}
+
+/* Some LDOs supports [LPM/Normal]ON mode during suspend state */
+static int max77686_set_suspend_mode(struct regulator_dev *rdev,
+unsigned int mode)
+{
+   unsigned int val;
+
+   /* BUCK[5-9] doesn't support this feature */
+   if (rdev->desc->id >= MAX77686_BUCK5)
+   return 0;
+
+   switch (mode) {
+   case REGULATOR_MODE_IDLE:   /* ON in LP Mode */
+   val = 0x2 << MAX77686_OPMODE_SHIFT;
+   break;
+   case REGULATOR_MODE_NORMAL: /* ON in Normal Mode */
+   val = 0x3 << MAX77686_OPMODE_SHIFT;
+   break;
+   default:
+   pr_warn("%s: regulator_suspend_mode : 0x%x not supported\n",
+   rdev->desc->name, mode);
+   return -EINVAL;
+   }
+
+   return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg,
+ rdev->desc->enable_mask,
+ val);
+}
+
+/* Some LDOs supports LPM-ON/OFF/Normal-ON mode during suspend state */
+static int max77686_ldo_set_suspend_mode(struct regulator_dev *rdev,
+unsigned int mode)
+{
+   unsigned int val;
+
+   switch (mode) {
+   case REGULATOR_MODE_STANDBY:/* switch off */
+   val = 0x1 << MAX77686_OPMODE_SHIFT;
+   break;
+   case REGULATOR_MODE_IDLE:   /* ON in LP Mode */
+   val = 0x2 << MAX77686_OPMODE_SHIFT;
+   break;
+   case REGULATOR_MODE_NORMAL: /* ON in Normal Mode */
+   val = 0x3 << MAX77686_OPMODE_SHIFT;
+   break;
+   default:
+   pr_warn("%s: regulator_suspend_mode : 0x%x not supported\n",
+   rdev->desc->name, mode);
+   return -EINVAL;
+   }
+
+   return regmap_update_bits(rdev->regmap, rdev->desc->enable_reg,
+ rdev->desc->enable_mask,
+ val);
+}
+
 static int max77686_set_ramp_delay(struct regulator_dev *rdev, int ramp_delay)
 {
unsigned int ramp_value = RAMP_RATE_NO_CTRL;
@@ -103,6 +173,31 @@ static struct regulator_ops max77686_ops = {
.get_voltage_sel= regulator_get_voltage_sel_regmap,
.set_voltage_sel= regulator_set_voltage_sel_regmap,
.set_voltage_time_sel   = regulator_set_voltage_time_sel,
+   .set_suspend_mode   = max77686_set_suspend_mode,
+};
+
+static struct regulator_ops max77686_ldo_ops = {
+   .list_voltage   = regulator_list_voltage_linear,
+   .map_voltage= regulator_map_voltage_linear,
+   .is_enabled = regulator_is_enabled_regmap,
+   .enable = regulator_enable_regmap,
+   .disable= regulator_disable_regmap,
+   .get_voltage_sel= regulator_get_voltage_sel_regmap,
+   .set_voltage_sel= regulator_set_voltage_sel_regmap,
+   .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+   .set_suspend_mode   = max77686_ldo_set_suspend_mode,
+};
+
+static struct regulator_ops max77686_buck1_ops = {
+   .list_voltage   = regulator_list_voltage_linear,
+   .map_voltage= regulator_map_voltage_linear,
+   .is_enabled = regulator_is_enabled_regmap,
+   .enable = regulator_enable_regmap,
+   .disable= regulator_disable_regmap,
+   .get_voltage_sel= regulator_get_voltage_sel_regmap,
+   .set_voltage_sel= regulator_set_voltage_sel_regmap,
+   .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+   .set_suspend_disable= 

Re: [RFC][PATCH] perf: Add a few generic stalled-cycles events

2012-10-15 Thread Anshuman Khandual
On 10/15/2012 10:53 PM, Arun Sharma wrote:
> On 10/15/12 8:55 AM, Robert Richter wrote:
> 
> [..]
>> Perf tool works then out-of-the-box with:
>>
>>   $ perf record -e cpu/stalled-cycles-fixed-point/ ...
>>
>> The event string can easily be reused by other architectures as a
>> quasi standard.
> 
> I like Robert's proposal better. It's hard to model all the stall events
> (eg: instruction decoder related stalls on x86) in a hardware
> independent way.
> 
> Another area to think about: software engineers are generally busy and
> have a limited amount of time to devote to hardware event based
> optimizations. The most common question I hear is: what is the expected
> perf gain if I fix this? It's hard to answer that with just the stall
> events.
> 

Hardware event based optimization is a very important aspect of real world 
application
tuning. CPI stack analysis is a good reason why perf should have stall events 
as generic
ones. But I am not clear on situations where we consider adding these new 
generic events
into linux/perf_event.h and the situations where we should go with the sys fs 
interface.
Could you please elaborate on this ?

Regards
Anshuman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] regulator: core: Check before enabling regulator while setting constraints.

2012-10-15 Thread Yadwinder Singh Brar
This patch adds check, whether regulator is already enabled before enabling it
while setting machine constraints. Since some PMICs have same register bits
for setting opmode and enabling/disabling the regulator, so it will overwrite
the settings (if any)done by set_mode/set_suspend_mode callbacks when it
enables regulator without checking previous status.

Signed-off-by: Yadwinder Singh Brar 
---
 drivers/regulator/core.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index f7c74db..9e3a0c7 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -958,6 +958,9 @@ static int set_machine_constraints(struct regulator_dev 
*rdev,
 */
if ((rdev->constraints->always_on || rdev->constraints->boot_on) &&
ops->enable) {
+   if (ops->is_enabled && ops->is_enabled(rdev))
+   goto enabled;
+
ret = ops->enable(rdev);
if (ret < 0) {
rdev_err(rdev, "failed to enable\n");
@@ -965,6 +968,7 @@ static int set_machine_constraints(struct regulator_dev 
*rdev,
}
}
 
+enabled:
if (rdev->constraints->ramp_delay && ops->set_ramp_delay) {
ret = ops->set_ramp_delay(rdev, rdev->constraints->ramp_delay);
if (ret < 0) {
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef CONFIG_HIBERNATION

2012-10-15 Thread Yuanhan Liu
acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block,
but the current code put the declaration in #ifdef CONFIG_PM_SLEEP block.

I happened to meet this issue when I turned off PM_SLEEP config manually:
arch/x86/kernel/acpi/sleep.c:100:4: error: implicit declaration of function 
‘acpi_no_s4_hw_signature’ [-Werror=implicit-function-declaration]

v2: take better title and add build error message suggested by Fengguang

Signed-off-by: Yuanhan Liu 
Reviewed-by: Fengguang Wu 
---
 include/linux/acpi.h |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 90be989..a468429 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, 
resource_size_t n,
 
 int acpi_resources_are_enforced(void);
 
-#ifdef CONFIG_PM_SLEEP
+#ifdef CONFIG_HIBERNATION
 void __init acpi_no_s4_hw_signature(void);
+#endif
+
+#ifdef CONFIG_PM_SLEEP
 void __init acpi_old_suspend_ordering(void);
 void __init acpi_nvs_nosave(void);
 #endif /* CONFIG_PM_SLEEP */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature

2012-10-15 Thread Yuanhan Liu
On Tue, Oct 16, 2012 at 12:27:13PM +0800, Fengguang Wu wrote:
> The title could be made more descriptive:
> 
> ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef 
> CONFIG_HIBERNATION

Yes, much better.
> 
> On Tue, Oct 16, 2012 at 12:05:03PM +0800, Yuanhan Liu wrote:
> > acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block,
> > but the current code put the declare in #ifdef CONFIG_PM_SLEEP block.
>  
> And it's better to always include the original build error/warning
> messages when fixing build problems.

Got it. Will send out v2 soon.

Thanks,
Yuanhan Liu

> 
> Otherwise looks good to me.
> 
> Reviewed-by: Fengguang Wu 
> 
> > Signed-off-by: Yuanhan Liu 
> > ---
> >  include/linux/acpi.h |5 -
> >  1 files changed, 4 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> > index 90be989..a468429 100644
> > --- a/include/linux/acpi.h
> > +++ b/include/linux/acpi.h
> > @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, 
> > resource_size_t n,
> >  
> >  int acpi_resources_are_enforced(void);
> >  
> > -#ifdef CONFIG_PM_SLEEP
> > +#ifdef CONFIG_HIBERNATION
> >  void __init acpi_no_s4_hw_signature(void);
> > +#endif
> > +
> > +#ifdef CONFIG_PM_SLEEP
> >  void __init acpi_old_suspend_ordering(void);
> >  void __init acpi_nvs_nosave(void);
> >  #endif /* CONFIG_PM_SLEEP */
> > -- 
> > 1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread HATAYAMA Daisuke
From: HATAYAMA Daisuke 
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Tue, 16 Oct 2012 14:03:13 +0900

> From: "Yu, Fenghua" 
> Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> Date: Tue, 16 Oct 2012 04:51:36 +
> 
>>> -Original Message-
>>> From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com]
>>> Sent: Monday, October 15, 2012 9:35 PM
>>> To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org;
>>> x...@kernel.org
>>> Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu,
>>> Fenghua; vgo...@redhat.com; ebied...@xmission.com;
>>> grant.lik...@secretlab.ca; rob.herr...@calxeda.com;
>>> d.hatay...@jp.fujitsu.com
>>> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>>> 
>>> Multiple CPUs are useful for CPU-bound processing like compression and
>>> I do want to use compression to generate crash dump quickly. But now
>>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>>> haling in the 1st kernel or possibly in any fatal system error state.
>>> 
>>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>>> or immediate reset, depending on the BIOS init code.
>>> 
>>> AP can be initiated by INIT even in a fatal state: MP spec explains
>>> that processor-specific INIT can be used to recover AP from a fatal
>>> system error. On the other hand, there's no method for BSP to recover;
>>> it might be possible to do so by NMI plus any hand-coded reset code
>>> that is carefully designed, but at least I have no idea in this
>>> direction now.
>> 
>> In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
>> not in tip tree yet.
>> 
>> BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336
>> 
>>> 
>>> Therefore, the idea I do in this patch set is simply to disable BSP if
>>> vboot cpu is AP.
>>> 
>> 
>> The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
>> patcheset, you can wake up BSP and don't need to disable it.
>> 
>>> My motivation is to use multiple CPUs in order to quickly generate
>>> crash dump on the machine with huge amount of memory. I assume such
>>> machine tends to also have a lot of CPUs. So disabling one CPU would
>>> be no problem.
>> 
>> Luckily you don't need to disable any CPU to archive your goal with
>> the BSP hotplug pachest:)
>> 
>> On a dual core/single thread machine, this means you get 100% performance
>> boost with BSP's help.
>> 
>> Plus crash dump kernel code is better structured by not treating BSP
>> specially.
>> 
> 
> Hello Fenghua.
> 
> I've of course noticed your patch set and locally tested, but I saw
> NMI to BSP failed in the 2nd kernel. I'll send a log to you later.
> 
> BTW, I tested with your previous v8 patch set. Did you change
> something during v8 to v9 relevant to this issue?
> 

I've fogetten saying one comment that your patch distinguish BSP by
CPU#0. CPU#0 is assigned to the boot cpu, which can be AP in the kdump
2nd kernel. Distinguishing BSP by CPU#0 is not enough here.

I have my local patch set based on your v8 patch doing this, but NMI
to BSP failed. I guess this comes from the difference of BSP states:
halting in play dead in your NMI method and halting in the 1st kernel
on crash or possibly in a fatal system error on actual situation.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Disintegrate UAPI for xtensa [ver #2]

2012-10-15 Thread Max Filippov
On Tue, Oct 9, 2012 at 1:16 PM, David Howells  wrote:
> Can you merge the following branch into the xtensa tree please.
>
> This is to complete part of the UAPI disintegration for which the preparatory
> patches were pulled recently.
>
> Now that the fixups and the asm-generic chunk have been merged, I've
> regenerated the patches to get rid of those dependencies and to take account 
> of
> any changes made so far in the merge window.  If you have already pulled the
> older version of the branch aimed at you, then please feel free to ignore this
> request.
>
> The following changes since commit 9e2d8656f5e8aa214e66b462680cf86b210b74a8:
>
>   Merge branch 'akpm' (Andrew's patch-bomb) (2012-10-09 16:23:15 +0900)
>
> are available in the git repository at:
>
>
>   git://git.infradead.org/users/dhowells/linux-headers.git 
> tags/disintegrate-xtensa-20121009
>
> for you to fetch changes up to 91a0696e40414e9f1554cd91060f6b404d545cb3:
>
>   UAPI: (Scripted) Disintegrate arch/xtensa/include/asm (2012-10-09 09:47:57 
> +0100)
>
> 
> UAPI Disintegration 2012-10-09
>
> 
> David Howells (1):
>   UAPI: (Scripted) Disintegrate arch/xtensa/include/asm

Thanks, applied to the xtensa_next tree.

-- 
-- Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ASoC: Ux500: Dispose of device nodes correctly

2012-10-15 Thread Mark Brown
On Mon, Oct 15, 2012 at 02:13:25PM +0100, Lee Jones wrote:
> When of_parse_phandle() is used to find a device node, its
> reference count is incremented by the helper. Once we're
> finished with them, it's our responsibly to ensure they
> are freed in the correct manor.

Applied both, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread Yu, Fenghua
> >> My motivation is to use multiple CPUs in order to quickly generate
> >> crash dump on the machine with huge amount of memory. I assume such
> >> machine tends to also have a lot of CPUs. So disabling one CPU would
> >> be no problem.
> >
> > Luckily you don't need to disable any CPU to archive your goal with
> > the BSP hotplug pachest:)
> >
> > On a dual core/single thread machine, this means you get 100%
> performance
> > boost with BSP's help.
> >
> > Plus crash dump kernel code is better structured by not treating BSP
> > specially.
> >
> 
> Hello Fenghua.
> 
> I've of course noticed your patch set and locally tested, but I saw
> NMI to BSP failed in the 2nd kernel. I'll send a log to you later.
> 
> BTW, I tested with your previous v8 patch set. Did you change
> something during v8 to v9 relevant to this issue?

In the patch 0/12 in v9, I describe what change is in v9 on the top of v8:

v9: Add Intel vendor check to support the feature on Intel platforms only.

Did you see the BSP wake up failure on the latest tip tree?

There is a rcu regression issue which prevents BSP from waking up in 3.6.0.
The issue has been fixed on 10/12. The commit is a4fbe35a. Please make sure
your tip tree has this commit.

Thanks.

-Fenghua


Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] regulator: gpio-regulator: Allow use of GPIO controlled regulators though DT

2012-10-15 Thread Mark Brown
On Mon, Oct 15, 2012 at 02:16:59PM +0100, Lee Jones wrote:
> Here we provide the GPIO Regulator driver with Device Tree capability, so
> that when a platform is booting with DT instead of platform data we can
> still make full use of it.

Not looked at the patch yet but patch 2 doesn't seem to have appeared?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 25/25] xtensa: Use Kbuild infrastructure to handle asm-generic headers

2012-10-15 Thread Max Filippov
On Sat, Oct 13, 2012 at 6:26 AM, Steven Rostedt  wrote:
> From: Steven Rostedt 
>
> Use Kbuild infrastructure to handle the asm-generic headers
> and remove the wrapper headers that call them.
>
> This only affects headers that do nothing but include the generic
> equivalent. It does not touch any header that does a little more.
>
> Cc: linux-kbu...@vger.kernel.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: Chris Zankel 
> Cc: Max Filippov 
> Signed-off-by: Steven Rostedt 

Thanks, rebased on top of UAPI changes and applied to
the xtensa_next tree.

-- 
-- Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mpol_to_str revisited.

2012-10-15 Thread KOSAKI Motohiro
On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes  wrote:
> On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:
>
>> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
>
> It's certainly not a complete fix, but I think it's a much better result
> of the race, i.e. we don't panic anymore, we simply fail the read()
> instead.

Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
to caller complex. That's not good and have no worth.

>> we should
>> close a race (or kill remain ref count leak) if we still have.
>
> As I mentioned earlier in the thread, the read() is done here on a task
> while only a reference to the task_struct is taken and we do not hold
> task_lock() which is required for task->mempolicy.  Once that is fixed,
> mpol_to_str() should never be called for !task->mempolicy so it will never
> need to return -EINVAL in such a condition.

I agree that's obviously a bug and we should fix it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU utilization between physical CPU and virtual CPU in KVM

2012-10-15 Thread Dennis Chen
Any body can be help about this or  a little bit clues?  Thanks!

On Mon, Oct 8, 2012 at 3:01 PM, Dennis Chen  wrote:
> Hi All,
>
> I am confused by the following observed scenario:
>
> In my 4-CPU (KVM supported, 2 core with 2 thread for each) host
> machine box, I create only one VM with 3-vCPU through virsh/libvirt
> tools and also I pin this VM process to the physical processor 3. I
> guess the CPU utilization for the processor 3 will not exceed 100%,
> then I create 3 process (dead loop-- while(1);) and bind each of them
> to vCPU[0-2] respectively, through the "top -c" command in VM
> environment, I can see the CPU utilization for each of the vCPU is
> about 100%, but interesting, I found that the CPU utilization of
> processor 3 in the host machine is about 300% with "toc -c" command.
> why does a single process bound to a CPU can get ~300% cpu bandwidth
> in this case, does the kernel scheduler dispatch the idle cycle
> capacity of the CPUs to the virtual CPU of the VM, other word, the
> scheduler knows the vCPU info in the VM process?
>
> For the same case, if I create another 4 new dead-loop processes and
> bind them to the physical CPU[0-3] equally, then I find the vCPU0/1 in
> VM will not be 100%, eg. 32%,  (I think the scheduler in the guest OS
> doesn't know it's running in a virtual environment, so the utilization
> of the vCPU will not change to adapt to the physical processor
> utilization, but it did, why?
>
> -org-gnu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread HATAYAMA Daisuke
From: "Yu, Fenghua" 
Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Tue, 16 Oct 2012 04:51:36 +

>> -Original Message-
>> From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com]
>> Sent: Monday, October 15, 2012 9:35 PM
>> To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org;
>> x...@kernel.org
>> Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu,
>> Fenghua; vgo...@redhat.com; ebied...@xmission.com;
>> grant.lik...@secretlab.ca; rob.herr...@calxeda.com;
>> d.hatay...@jp.fujitsu.com
>> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>> 
>> Multiple CPUs are useful for CPU-bound processing like compression and
>> I do want to use compression to generate crash dump quickly. But now
>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>> haling in the 1st kernel or possibly in any fatal system error state.
>> 
>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>> or immediate reset, depending on the BIOS init code.
>> 
>> AP can be initiated by INIT even in a fatal state: MP spec explains
>> that processor-specific INIT can be used to recover AP from a fatal
>> system error. On the other hand, there's no method for BSP to recover;
>> it might be possible to do so by NMI plus any hand-coded reset code
>> that is carefully designed, but at least I have no idea in this
>> direction now.
> 
> In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
> not in tip tree yet.
> 
> BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336
> 
>> 
>> Therefore, the idea I do in this patch set is simply to disable BSP if
>> vboot cpu is AP.
>> 
> 
> The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
> patcheset, you can wake up BSP and don't need to disable it.
> 
>> My motivation is to use multiple CPUs in order to quickly generate
>> crash dump on the machine with huge amount of memory. I assume such
>> machine tends to also have a lot of CPUs. So disabling one CPU would
>> be no problem.
> 
> Luckily you don't need to disable any CPU to archive your goal with
> the BSP hotplug pachest:)
> 
> On a dual core/single thread machine, this means you get 100% performance
> boost with BSP's help.
> 
> Plus crash dump kernel code is better structured by not treating BSP
> specially.
> 

Hello Fenghua.

I've of course noticed your patch set and locally tested, but I saw
NMI to BSP failed in the 2nd kernel. I'll send a log to you later.

BTW, I tested with your previous v8 patch set. Did you change
something during v8 to v9 relevant to this issue?

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread Yu, Fenghua
> -Original Message-
> From: HATAYAMA Daisuke [mailto:d.hatay...@jp.fujitsu.com]
> Sent: Monday, October 15, 2012 9:35 PM
> To: linux-kernel@vger.kernel.org; ke...@lists.infradead.org;
> x...@kernel.org
> Cc: mi...@elte.hu; t...@linutronix.de; h...@zytor.com; Brown, Len; Yu,
> Fenghua; vgo...@redhat.com; ebied...@xmission.com;
> grant.lik...@secretlab.ca; rob.herr...@calxeda.com;
> d.hatay...@jp.fujitsu.com
> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> 
> Multiple CPUs are useful for CPU-bound processing like compression and
> I do want to use compression to generate crash dump quickly. But now
> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
> crash happens on AP. If crash happens on AP, kexec enters the 2nd
> kernel with the AP, and there BSP in the 1st kernel is expected to be
> haling in the 1st kernel or possibly in any fatal system error state.
> 
> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
> BSP to jump into BIOS init code. A typical visible behaviour is hang
> or immediate reset, depending on the BIOS init code.
> 
> AP can be initiated by INIT even in a fatal state: MP spec explains
> that processor-specific INIT can be used to recover AP from a fatal
> system error. On the other hand, there's no method for BSP to recover;
> it might be possible to do so by NMI plus any hand-coded reset code
> that is carefully designed, but at least I have no idea in this
> direction now.

In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
not in tip tree yet.

BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336

> 
> Therefore, the idea I do in this patch set is simply to disable BSP if
> vboot cpu is AP.
> 

The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
patcheset, you can wake up BSP and don't need to disable it.

> My motivation is to use multiple CPUs in order to quickly generate
> crash dump on the machine with huge amount of memory. I assume such
> machine tends to also have a lot of CPUs. So disabling one CPU would
> be no problem.

Luckily you don't need to disable any CPU to archive your goal with
the BSP hotplug pachest:)

On a dual core/single thread machine, this means you get 100% performance
boost with BSP's help.

Plus crash dump kernel code is better structured by not treating BSP
specially.

Thanks.

-Fenghua




Re: [PATCH 05/16] f2fs: add checkpoint operations

2012-10-15 Thread NeilBrown
On Sat, 13 Oct 2012 00:49:06 +0900 Jaegeuk Kim  wrote:

> 2012-10-11 (목), 09:24 +1100, NeilBrown:
> > On Fri, 05 Oct 2012 20:59:29 +0900 김재극  wrote:
> > 
> > > +static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
> > > +{
> > > + struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
> > > + nid_t last_nid = 0;
> > > + int nat_upd_blkoff[3];
> > > + block_t start_blk;
> > > + struct page *cp_page;
> > > + unsigned int data_sum_blocks, orphan_blocks;
> > > + void *kaddr;
> > > + __u32 crc32 = 0;
> > > + int i;
> > > +
> > > + /* Flush all the NAT/SIT pages */
> > > + while (get_pages(sbi, F2FS_DIRTY_META))
> > > + sync_meta_pages(sbi, META, LONG_MAX);
> > > +
> > > + next_free_nid(sbi, _nid);
> > > +
> > > + /*
> > > +  * modify checkpoint
> > > +  * version number is already updated
> > > +  */
> > > + ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi));
> > > + ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
> > > + ckpt->free_segment_count = cpu_to_le32(free_segments(sbi));
> > > + for (i = 0; i < 3; i++) {
> > > + ckpt->cur_node_segno[i] =
> > > + cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE));
> > > + ckpt->cur_node_blkoff[i] =
> > > + cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE));
> > > + nat_upd_blkoff[i] = NM_I(sbi)->nat_upd_blkoff[i];
> > > + ckpt->nat_upd_blkoff[i] = cpu_to_le16(nat_upd_blkoff[i]);
> > > + ckpt->alloc_type[i + CURSEG_HOT_NODE] =
> > > + curseg_alloc_type(sbi, i + CURSEG_HOT_NODE);
> > > + }
> > > + for (i = 0; i < 3; i++) {
> > > + ckpt->cur_data_segno[i] =
> > > + cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA));
> > > + ckpt->cur_data_blkoff[i] =
> > > + cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA));
> > > + ckpt->alloc_type[i + CURSEG_HOT_DATA] =
> > > + curseg_alloc_type(sbi, i + CURSEG_HOT_DATA);
> > > + }
> > > +
> > > + ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
> > > + ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
> > > + ckpt->next_free_nid = cpu_to_le32(last_nid);
> > > +
> > > + /* 2 cp  + n data seg summary + orphan inode blocks */
> > > + data_sum_blocks = npages_for_summary_flush(sbi);
> > > + if (data_sum_blocks < 3)
> > > + ckpt->ckpt_flags |= CP_COMPACT_SUM_FLAG;
> > > + else
> > > + ckpt->ckpt_flags &= (~CP_COMPACT_SUM_FLAG);
> > > +
> > > + orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1)
> > > + / F2FS_ORPHANS_PER_BLOCK;
> > > + ckpt->cp_pack_start_sum = 1 + orphan_blocks;
> > > + ckpt->cp_pack_total_block_count = 2 + data_sum_blocks + orphan_blocks;
> > 
> > This looks a bit weird to me, though I might be misunderstanding something.
> > 
> > data_sum_blocks is either 1, 2, or 3.
> > "3" actually means "at least 3".
> > 
> > If it is 3, you choose not to set CP_COMPACT_SUM_FLAG.  In that case the NAT
> > and SIT journal entries go into SSA blocks, not into the checkpoint at all.
> > So in that case, zero blocks of the checkpoint are used for journalling.  
> > Yet
> > you still add data_sum_blocks (==3) to the cp_pack_total_block_count (and
> > later to the start block).
> > Is that really what you want to do?  Leave 3 empty blocks?
> > 
> > I would suggest changing npages_for_summary_flush to return 0 if the number
> > of blocks needed would be more than three, and set CP_COMPACT_SUM_FLAG only
> > when data_sum_blocks > 0.
> > 
> > I don't know if you would need to make a corresponding change to the 
> > recovery
> > code, I haven't fully examined that yet.
> 
> Ok, let me explain about CP_COMPACT_SUM_FLAG.
> Let's assume that there are some journal entries and data summaries.
> Note that this scenario is not from the umount procedure.
> 
> Basically f2fs writes three data summary blocks for current active logs
> inside the checkpoint pack.
> And NAT and SIT journal entries are stored in hot and cold data summary
> blocks.
> So, if the CP_COMPACT_SUM_FLAG is not set, f2fs writes the checkpoint
> pack like this.
> 
> [CP 0]
> [Orphan blocks]
> [Hot sum block w/ NAT journal]
> [Warm sum block]
> [Cold sum block w/ SIT journal]
> [CP 0']
> 
> But, if the CP_COMPACT_SUM_FLAG is set, the checkpoint pack consists of
> 1 or 2 summary blocks as follows.
> 
> [CP 0]
> [Orphan blocks]
> [summary entries w/ NAT and SIT journal]
> [CP 0']
> 
> or,
> 
> [CP 0]
> [Orphan blocks]
> [summary entries]
> [summary entries w/ NAT and SIT journal]
> [CP 0']
> 
> So, I think it needs no change.
> Any idea?
> Thanks,

I see.  I missed the fact that the current data summary blocks are always
written to the checkpoint area - I assumed they were being written back to
the SSA.

So it makes sense now and you are right - no change needed.

Thanks,
NeilBrown


> 
> > 
> > Regards,
> > NeilBrown
> 



signature.asc
Description: PGP signature


Re: [PATCH 07/16] f2fs: add segment operations

2012-10-15 Thread NeilBrown
On Sat, 13 Oct 2012 00:12:53 +0900 Jaegeuk Kim  wrote:

> 2012-10-11 (목), 09:37 +1100, NeilBrown:
> > On Fri, 05 Oct 2012 21:00:55 +0900 김재극  wrote:
> > 
> > > +/**
> > > + * Find a new segment from the free segments bitmap to right order
> > > + * This function should be returned with success, otherwise BUG
> > > + */
> > > +static void get_new_segment(struct f2fs_sb_info *sbi,
> > > + unsigned int *newseg, bool new_sec, int dir)
> > > +{
> > > + struct free_segmap_info *free_i = FREE_I(sbi);
> > > + unsigned int total_secs = sbi->total_sections;
> > > + unsigned int segno, secno, zoneno;
> > > + unsigned int total_zones = sbi->total_sections / sbi->secs_per_zone;
> > > + unsigned int hint = *newseg >> sbi->log_segs_per_sec;
> > > + unsigned int old_zoneno = GET_ZONENO_FROM_SEGNO(sbi, *newseg);
> > > + unsigned int left_start = hint;
> > > + bool init = true;
> > > + int go_left = 0;
> > > + int i;
> > > +
> > > + write_lock(_i->segmap_lock);
> > > +
> > > + if (!new_sec && ((*newseg + 1) % sbi->segs_per_sec)) {
> > > + segno = find_next_zero_bit(free_i->free_segmap,
> > > + TOTAL_SEGS(sbi), *newseg + 1);
> > > + if (segno < TOTAL_SEGS(sbi))
> > > + goto got_it;
> > > + }
> > > +find_other_zone:
> > > + secno = find_next_zero_bit(free_i->free_secmap, total_secs, hint);
> > > + if (secno >= total_secs) {
> > > + if (dir == ALLOC_RIGHT) {
> > > + secno = find_next_zero_bit(free_i->free_secmap,
> > > + total_secs, 0);
> > > + BUG_ON(secno >= total_secs);
> > > + } else {
> > > + go_left = 1;
> > > + left_start = hint - 1;
> > > + }
> > > + }
> > > + if (go_left == 0)
> > > + goto skip_left;
> > > +
> > > + while (test_bit(left_start, free_i->free_secmap)) {
> > > + if (left_start > 0) {
> > > + left_start--;
> > > + continue;
> > > + }
> > > + left_start = find_next_zero_bit(free_i->free_secmap,
> > > + total_secs, 0);
> > > + BUG_ON(left_start >= total_secs);
> > > + break;
> > > + }
> > > + secno = left_start;
> > > +skip_left:
> > > + hint = secno;
> > > + segno = secno << sbi->log_segs_per_sec;
> > > + zoneno = secno / sbi->secs_per_zone;
> > > +
> > > + if (sbi->secs_per_zone == 1)
> > > + goto got_it;
> > > + if (zoneno == old_zoneno)
> > > + goto got_it;
> > > + if (dir == ALLOC_LEFT) {
> > > + if (!go_left && zoneno + 1 >= total_zones)
> > > + goto got_it;
> > > + if (go_left && zoneno == 0)
> > > + goto got_it;
> > > + }
> > > +
> > > + for (i = 0; i < DEFAULT_CURSEGS; i++) {
> > > + struct curseg_info *curseg = CURSEG_I(sbi, i);
> > > +
> > > + if (curseg->zone != zoneno)
> > > + continue;
> > > + if (!init)
> > > + continue;
> > > +
> > > + if (go_left)
> > > + hint = zoneno * sbi->secs_per_zone - 1;
> > > + else if (zoneno + 1 >= total_zones)
> > > + hint = 0;
> > > + else
> > > + hint = (zoneno + 1) * sbi->secs_per_zone;
> > > + init = false;
> > > + goto find_other_zone;
> > > + }
> > 
> > I think this code is correct, but I found it very confusing to read.
> > The  point of the loop is simply to find out if any current segment using 
> > the
> > given zone.  But that isn't obvious, it seem to do more.
> > I would re-write it as:
> > 
> >   for (i = 0; i < DEFAULT_CURSEGS ; i++) {
> >struct curseg_info *curseg = CURSEG_I(sbi, i);
> >if (curseg->zone == zoneno)
> >break;
> >   }
> >   if (i < DEFAULT_CURSEGS && init) {
> > /* Zone is in use,try another */
> > if (go_left)
> > hint = 
> > else if ()
> > hint = 0;
> > else
> > hint = ..;
> > init = false;
> > goto find_other_zone;
> >   }
> > 
> > To me, that makes it much clearer what is happening.
> > 
> 
> Ok. 
> I think it had better change like this to avoid unecessary loop.
> 
> /* give up on finding another zone */
> if (!init)
>   goto got_it;
> 
> for (i = 0; i < DEFAULT_CURSEGS; i++) {
>   if (CURSEG_I(sbi, i)->zone == zoneno)
>   break;
> }
> 
> if (i < DEFAULT_CURSEGS) {
>   /* zone is in use, try another */
> if (go_left)
>   hint = 
>   else if ()
>   hint = 0;
>   else
>   hint = ..;
>   init = false;
>   goto find_other_zone;
> }

Yes, that looks good.  Thanks.


> 
> > > +static void f2fs_end_io_write(struct bio *bio, int err)
> > > +{
> > > + const int uptodate = test_bit(BIO_UPTODATE, >bi_flags);
> > > + struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> > > + struct 

Re: [PATCH RFC] random: Account for entropy loss due to overwrites

2012-10-15 Thread H. Peter Anvin
On 10/15/2012 09:08 PM, Theodore Ts'o wrote:
> On Sat, Sep 29, 2012 at 12:47:04PM -0700, H. Peter Anvin wrote:
>>> -static struct poolinfo {
>>> +static const struct poolinfo {
>>> +   int poolshift;  /* log2(POOLBITS) */
>>> int poolwords;
>>> int tap1, tap2, tap3, tap4, tap5;
> 
> Poolshift is duplicated information; it's just log2(poolwords) + 5
> (since POOLBITS is poolwords*32).
> 
> Granted you don't want to recalculate it every single time you need to
> use it, but perhaps it would be better to add poolshift to struct
> entropy_store, and set it in init_std_data()?
> 

Or we could compute poolwords (and poolbits, and poolbytes) from it,
since shifts generally are cheap.  I don't strongly care, whatever your
preference is.

-hpa



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 08/12] x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI

2012-10-15 Thread HATAYAMA Daisuke
From: Fenghua Yu 
Subject: [PATCH v9 08/12] x86, hotplug: Wake up CPU0 via NMI instead of INIT, 
SIPI, SIPI
Date: Fri, 12 Oct 2012 09:09:45 -0700



> @@ -1037,6 +1101,8 @@ void __init native_smp_prepare_cpus(unsigned int 
> max_cpus)
>*/
>   setup_local_APIC();
>  
> + cpu0_logical_apicid = GET_APIC_LOGICAL_ID(apic_read(APIC_LDR));
> +

In x2apic mode, logical apicid occupies a whole 32-bit length of LDR,
but GET_APIC_LOGICAL_ID returns high 31-24 bits only, and this is only
for xapic mode.

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] make GFP_NOTRACK flag unconditional

2012-10-15 Thread Andrew Morton
On Mon, 15 Oct 2012 21:02:45 -0700 (PDT) David Rientjes  
wrote:

> On Tue, 2 Oct 2012, David Rientjes wrote:
> 
> > > There was a general sentiment in a recent discussion (See
> > > https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
> > > defined unconditionally. Currently, the only offender is GFP_NOTRACK,
> > > which is conditional to KMEMCHECK.
> > > 
> > > This simple patch makes it unconditional.
> > > 
> > > Signed-off-by: Glauber Costa 
> > > CC: Christoph Lameter 
> > > CC: Mel Gorman 
> > > CC: Andrew Morton 
> > 
> > Acked-by: David Rientjes 
> > 
> > I think it was done this way to show that if CONFIG_KMEMCHECK=n then the 
> > bit could be reused for something else but I can't think of any reason why 
> > that would be useful; what would need to add a gfp bit that would also 
> > happen to depend on CONFIG_KMEMCHECK=n?  Nothing comes to mind to save a 
> > bit.
> > 
> > There are other cases of this as well, like __GFP_OTHER_NODE which is only 
> > useful for thp and it's defined unconditionally.  So this seems fine to 
> > me.
> > 
> 
> Still missing from linux-next as of this morning, I think this patch 
> should be merged.

It's in 3.7-rc1.

commit 3e648ebe076390018c317881d7d926f24d7bac6b
Author: Glauber Costa 
Date:   Mon Oct 8 16:33:52 2012 -0700

make GFP_NOTRACK definition unconditional

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name

2012-10-15 Thread Srikar Dronamraju
* Masami Hiramatsu  [2012-10-16 13:19:57]:

> (2012/10/16 10:37), Hyeoncheol Lee wrote:
> > convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS
> > bytes for a function name
> 
> Yeah, that one was from my laziness...
> 

Guess not your fault, but mine.

> > 
> > Cc: Masami Hiramatsu 
> > Cc: Srikar Dronamraju 
> > Signed-off-by: Hyeoncheol Lee 
> > ---
> >  tools/perf/util/probe-event.c |5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> > index 49a256e..bb40ed4 100644
> > --- a/tools/perf/util/probe-event.c
> > +++ b/tools/perf/util/probe-event.c
> > @@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct 
> > perf_probe_event *pev, const char *exec)
> > free(exec_copy);
> > }
> > free(pp->function);
> > -   pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
> > +   pp->function = zalloc(sizeof(char) *
> > + (3 + sizeof(unsigned long long) * 2));
> 
> Could you comment that this is enough long here?

Also can we move the arith into a macro?

> 
> > if (!pp->function) {
> > ret = -ENOMEM;
> > pr_warning("Failed to allocate memory by zalloc.\n");
> > goto out;
> > }
> > -   e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr);
> > +   sprintf(pp->function, "0x%llx", vaddr);
> 
> And at least we should use snprintf instead of sprintf...
> (I think ret = e_snprintf(...) is better)
> 

Agree.

> > ret = 0;
> >  
> >  out:
> > 
> 

-- 
Thanks and Regards
Srikar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix scheduling-while-atomic problem in console_cpu_notify()

2012-10-15 Thread Paul E. McKenney
On Mon, Oct 15, 2012 at 05:31:28PM -0700, Paul E. McKenney wrote:
> The console_cpu_notify( function runs with interrupts disabled in
> the CPU_DEAD case.  It therefore cannot block, for example, as will
> happen when it calls console_lock().  Therefore, remove the CPU_DEAD
> leg of the switch statement to avoid this problem.
> 
> Signed-off-by: Paul E. McKenney 

s/CPU_DEAD/CPU_DYING/

Apparently it is a bad idea to compose and send a patch while in a
C++ standards committee meeting where people are arguing about async
futures...  Fixed patch below.

Thanx, Paul



printk: Fix scheduling-while-atomic problem in console_cpu_notify()

The console_cpu_notify( function runs with interrupts disabled in
the CPU_DYING case.  It therefore cannot block, for example, as will
happen when it calls console_lock().  Therefore, remove the CPU_DYING
leg of the switch statement to avoid this problem.

Signed-off-by: Paul E. McKenney 

diff --git a/kernel/printk.c b/kernel/printk.c
index 66a2ea3..2d607f4 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1890,7 +1890,6 @@ static int __cpuinit console_cpu_notify(struct 
notifier_block *self,
switch (action) {
case CPU_ONLINE:
case CPU_DEAD:
-   case CPU_DYING:
case CPU_DOWN_FAILED:
case CPU_UP_CANCELED:
console_lock();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread HATAYAMA Daisuke
We disable BSP if boot cpu is AP.

INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for
BSP since it causes BSP jump to BIOS init code; typical visible
behaviour is hang or immediate reset, depending on the BIOS init code.

INIT can be used to reset AP in a fatal system error state as
described in MP spec 3.7.3 Processor-specific INIT. In contrast, there
is no processor-specific INIT for BSP to initilize from a fatal system
error. It might be possible to do so by NMI plus any hand-crafted
reset code that is carefully designed, but at least I have no idea in
this direction now.

By the way, my motivation is to generate crash dump quickly on the
system with huge memory. I think we can assume such system also has a
lot of cpus. If so, it would be no problem if only one cpu gets
unavailable.

We lookup ACPI table or MP table to get BSP information because we
cannot run rdmsr instruction on the CPU we are about to wake up just
now.

One thing to be concerned about here is that ACPI guidlines BIOS
*should* list the BSP in the first MADT LAPIC entry; not *must*. In
this sense, this logic relis on BIOS following ACPI's guideline. On
the other hand, we don't need to worry about this in MP table case
because it has explit BSP flag.

To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (Of course, it's problematic in maxcpu=1
case if trying to wake up other cpus in user space later.)

Some firmware features such as hibernation and suspend needs to switch
its CPU to BSP before transitting its execution to firmware, so these
features are unavailable on the BSP-disabled setting. This is no
problem because we don't need hibernation and suspend in the kdump 2nd
kernel.

SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.

Signed-off-by: HATAYAMA Daisuke 
---

 arch/x86/include/asm/mpspec.h |2 +-
 arch/x86/kernel/acpi/boot.c   |   10 +-
 arch/x86/kernel/apic/apic.c   |   21 -
 arch/x86/kernel/devicetree.c  |2 +-
 arch/x86/kernel/mpparse.c |   15 +--
 arch/x86/platform/sfi/sfi.c   |2 +-
 6 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index d56f253..b5d8e23 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -97,7 +97,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #define default_get_smp_config x86_init_uint_noop
 #endif
 
-void __cpuinit generic_processor_info(int apicid, int version);
+void __cpuinit generic_processor_info(int apicid, bool isbsp, int version);
 #ifdef CONFIG_ACPI
 extern void mp_register_ioapic(int id, u32 address, u32 gsi_base);
 extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e651f7a..e873c09 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -198,6 +198,7 @@ static int __init acpi_parse_madt(struct acpi_table_header 
*table)
 static void __cpuinit acpi_register_lapic(int id, u8 enabled)
 {
unsigned int ver = 0;
+   bool isbsp = false;
 
if (id >= (MAX_LOCAL_APIC-1)) {
printk(KERN_INFO PREFIX "skipped apicid that is too big\n");
@@ -212,7 +213,14 @@ static void __cpuinit acpi_register_lapic(int id, u8 
enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   generic_processor_info(id, ver);
+   /*
+* ACPI says BIOS should list BSP in the first MADT LAPIC
+* entry.
+*/
+   if (!num_processors && !disabled_cpus)
+   isbsp = true;
+
+   generic_processor_info(id, isbsp, ver);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d8d69e4..4184853 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2034,13 +2034,32 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-void __cpuinit generic_processor_info(int apicid, int version)
+void __cpuinit generic_processor_info(int apicid, bool isbsp, int version)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);
 
/*
+* If boot cpu is AP, we now don't have any way to initialize
+* BSP. To save memory consumed, we disable BSP this case.
+*
+* Then, we cannot use the features specific to BSP such as
+* hibernation and suspend. This is no problem because AP
+* becomes boot cpu only on kexec triggered by crash.
+*/
+   

[PATCH v1 1/2] x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP or not

2012-10-15 Thread HATAYAMA Daisuke
Part of boot-up code assumes booting CPU is BSP, but kexec can enter
the 2nd kernel with AP. To be able to distinguish these throughout
kernel processing, introduce boot_cpu_is_bsp.

Signed-off-by: HATAYAMA Daisuke 
---

 arch/x86/include/asm/mpspec.h |3 +++
 arch/x86/kernel/apic/apic.c   |   13 +
 arch/x86/kernel/setup.c   |2 ++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 3e2f42a..d56f253 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,10 +47,13 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
 extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
 
 extern unsigned int boot_cpu_physical_apicid;
+extern bool boot_cpu_is_bsp;
 extern unsigned int max_physical_apicid;
 extern int mpc_default_type;
 extern unsigned long mp_lapic_addr;
 
+extern void boot_cpu_is_bsp_init(void);
+
 #ifdef CONFIG_X86_LOCAL_APIC
 extern int smp_found_config;
 #else
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..d8d69e4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -62,6 +62,10 @@ unsigned disabled_cpus __cpuinitdata;
 /* Processor that is doing the boot up */
 unsigned int boot_cpu_physical_apicid = -1U;
 
+/* Indicates whether the processor that is doing the boot up, is BSP
+ * processor or not */
+bool boot_cpu_is_bsp;
+
 /*
  * The highest APIC ID seen during enumeration.
  */
@@ -2515,3 +2519,12 @@ static int __init lapic_insert_resource(void)
  * that is using request_resource
  */
 late_initcall(lapic_insert_resource);
+
+void boot_cpu_is_bsp_init(void)
+{
+   u32 l, h;
+
+   rdmsr(MSR_IA32_APICBASE, l, h);
+
+   boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a2bb18e..6ecb9bc 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -988,6 +988,8 @@ void __init setup_arch(char **cmdline_p)
 
early_quirks();
 
+   boot_cpu_is_bsp_init();
+
/*
 * Read APIC and some other early information from ACPI tables.
 */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

2012-10-15 Thread HATAYAMA Daisuke
Multiple CPUs are useful for CPU-bound processing like compression and
I do want to use compression to generate crash dump quickly. But now
we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
crash happens on AP. If crash happens on AP, kexec enters the 2nd
kernel with the AP, and there BSP in the 1st kernel is expected to be
haling in the 1st kernel or possibly in any fatal system error state.

To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
BSP to jump into BIOS init code. A typical visible behaviour is hang
or immediate reset, depending on the BIOS init code.

AP can be initiated by INIT even in a fatal state: MP spec explains
that processor-specific INIT can be used to recover AP from a fatal
system error. On the other hand, there's no method for BSP to recover;
it might be possible to do so by NMI plus any hand-coded reset code
that is carefully designed, but at least I have no idea in this
direction now.

Therefore, the idea I do in this patch set is simply to disable BSP if
vboot cpu is AP.

My motivation is to use multiple CPUs in order to quickly generate
crash dump on the machine with huge amount of memory. I assume such
machine tends to also have a lot of CPUs. So disabling one CPU would
be no problem.

On most BIOSs, BSP is always assigned to cpu#1; on other BIOSs, BSP
could probably be assigned to a fixed cpu number. Assuming this fact,
it might be possible to choose an idea that waking up the cpus except
for cpu#1, not waking up cpu#1 only. But I don't choose this in this
patch set because:

- It's ugly desgin to keep switch in sysfs that can unintentionally
  cause system to enter undefined behaviour.

- Memory space for BSP is never used if BSP is not running. Amount of
  reserved memory for 2nd kernel is typically from 128MB to 512MB
  only, severely limited. If BSP is unused, I want to use the space
  for another AP instead.

Note: recent upstream kernel fails reserving memory for kdump 2nd
kernel. To run kdump, please apply the patch below on top of this
patch set:
https://lkml.org/lkml/2012/8/31/238

---

HATAYAMA Daisuke (2):
  x86, apic: Disable BSP if boot cpu is AP
  x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP 
or not


 arch/x86/include/asm/mpspec.h |5 -
 arch/x86/kernel/acpi/boot.c   |   10 +-
 arch/x86/kernel/apic/apic.c   |   34 +-
 arch/x86/kernel/devicetree.c  |2 +-
 arch/x86/kernel/mpparse.c |   15 +--
 arch/x86/kernel/setup.c   |2 ++
 arch/x86/platform/sfi/sfi.c   |2 +-
 7 files changed, 63 insertions(+), 7 deletions(-)

-- 
Thanks.
HATAYAMA, Daisuke
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/2] x86, pci: Reset PCIe devices at boot time

2012-10-15 Thread Takao Indoh

(2012/10/16 3:36), Yinghai Lu wrote:

On Mon, Oct 15, 2012 at 12:00 AM, Takao Indoh
 wrote:

This patch resets PCIe devices at boot time by hot reset when
"reset_devices" is specified.


how about pci devices that domain_nr is not zero ?


This patch does not support multiple domains yet.



Signed-off-by: Takao Indoh 
---
  arch/x86/include/asm/pci-direct.h |1
  arch/x86/kernel/setup.c   |3
  arch/x86/pci/early.c  |  344 
  include/linux/pci.h   |2
  init/main.c   |4
  5 files changed, 352 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pci-direct.h 
b/arch/x86/include/asm/pci-direct.h
index b1e7a45..de30db2 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -18,4 +18,5 @@ extern int early_pci_allowed(void);
  extern unsigned int pci_early_dump_regs;
  extern void early_dump_pci_device(u8 bus, u8 slot, u8 func);
  extern void early_dump_pci_devices(void);
+extern void early_reset_pcie_devices(void);
  #endif /* _ASM_X86_PCI_DIRECT_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a2bb18e..73d3425 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -987,6 +987,9 @@ void __init setup_arch(char **cmdline_p)
 generic_apic_probe();

 early_quirks();
+#ifdef CONFIG_PCI
+   early_reset_pcie_devices();
+#endif

 /*
  * Read APIC and some other early information from ACPI tables.
diff --git a/arch/x86/pci/early.c b/arch/x86/pci/early.c
index d1067d5..683b30f 100644
--- a/arch/x86/pci/early.c
+++ b/arch/x86/pci/early.c
@@ -1,5 +1,6 @@
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -109,3 +110,346 @@ void early_dump_pci_devices(void)
 }
 }
  }
+
+#define PCI_EXP_SAVE_REGS  7
+#define pcie_cap_has_devctl(type, flags)   1
+#define pcie_cap_has_lnkctl(type, flags)   \
+   ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\
+(type == PCI_EXP_TYPE_ROOT_PORT || \
+ type == PCI_EXP_TYPE_ENDPOINT ||  \
+ type == PCI_EXP_TYPE_LEG_END))
+#define pcie_cap_has_sltctl(type, flags)   \
+   ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\
+((type == PCI_EXP_TYPE_ROOT_PORT) ||   \
+ (type == PCI_EXP_TYPE_DOWNSTREAM &&   \
+  (flags & PCI_EXP_FLAGS_SLOT
+#define pcie_cap_has_rtctl(type, flags)\
+   ((flags & PCI_EXP_FLAGS_VERS) > 1 ||\
+(type == PCI_EXP_TYPE_ROOT_PORT || \
+ type == PCI_EXP_TYPE_RC_EC))
+
+struct save_config {
+   u32 pci[16];
+   u16 pcie[PCI_EXP_SAVE_REGS];
+};
+
+struct pcie_dev {
+   int cap;   /* position of PCI Express capability */
+   int flags; /* PCI_EXP_FLAGS */
+   struct save_config save; /* saved configration register */
+};
+
+struct pcie_port {
+   struct list_head dev;
+   u8 secondary;
+   struct pcie_dev child[PCI_MAX_FUNCTIONS];
+};
+
+static LIST_HEAD(device_list);
+static void __init pci_udelay(int loops)
+{
+   while (loops--) {
+   /* Approximately 1 us */
+   native_io_delay();
+   }
+}
+
+/* Derived from drivers/pci/pci.c */
+#define PCI_FIND_CAP_TTL   48
+static int __init __pci_find_next_cap_ttl(u8 bus, u8 slot, u8 func,
+ u8 pos, int cap, int *ttl)
+{
+   u8 id;
+
+   while ((*ttl)--) {
+   pos = read_pci_config_byte(bus, slot, func, pos);
+   if (pos < 0x40)
+   break;
+   pos &= ~3;
+   id = read_pci_config_byte(bus, slot, func,
+   pos + PCI_CAP_LIST_ID);
+   if (id == 0xff)
+   break;
+   if (id == cap)
+   return pos;
+   pos += PCI_CAP_LIST_NEXT;
+   }
+   return 0;
+}
+
+static int __init __pci_find_next_cap(u8 bus, u8 slot, u8 func, u8 pos, int 
cap)
+{
+   int ttl = PCI_FIND_CAP_TTL;
+
+   return __pci_find_next_cap_ttl(bus, slot, func, pos, cap, );
+}
+
+static int __init __pci_bus_find_cap_start(u8 bus, u8 slot, u8 func,
+  u8 hdr_type)
+{
+   u16 status;
+
+   status = read_pci_config_16(bus, slot, func, PCI_STATUS);
+   if (!(status & PCI_STATUS_CAP_LIST))
+   return 0;
+
+   switch (hdr_type) {
+   case PCI_HEADER_TYPE_NORMAL:
+   case PCI_HEADER_TYPE_BRIDGE:
+   return PCI_CAPABILITY_LIST;
+   case PCI_HEADER_TYPE_CARDBUS:
+   return PCI_CB_CAPABILITY_LIST;
+   default:
+   return 0;
+   }
+
+   return 0;
+}
+
+static int __init early_pci_find_capability(u8 bus, u8 slot, u8 func, int cap)
+{
+   int pos;
+   u8 type = 

Re: [PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature

2012-10-15 Thread Fengguang Wu
The title could be made more descriptive:

ACPI: move acpi_no_s4_hw_signature() declaration into #ifdef CONFIG_HIBERNATION

On Tue, Oct 16, 2012 at 12:05:03PM +0800, Yuanhan Liu wrote:
> acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block,
> but the current code put the declare in #ifdef CONFIG_PM_SLEEP block.
 
And it's better to always include the original build error/warning
messages when fixing build problems.

Otherwise looks good to me.

Reviewed-by: Fengguang Wu 

> Signed-off-by: Yuanhan Liu 
> ---
>  include/linux/acpi.h |5 -
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 90be989..a468429 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, 
> resource_size_t n,
>  
>  int acpi_resources_are_enforced(void);
>  
> -#ifdef CONFIG_PM_SLEEP
> +#ifdef CONFIG_HIBERNATION
>  void __init acpi_no_s4_hw_signature(void);
> +#endif
> +
> +#ifdef CONFIG_PM_SLEEP
>  void __init acpi_old_suspend_ordering(void);
>  void __init acpi_nvs_nosave(void);
>  #endif /* CONFIG_PM_SLEEP */
> -- 
> 1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build failure after merge of the final tree

2012-10-15 Thread Al Viro
On Tue, Oct 16, 2012 at 02:50:29PM +1100, Stephen Rothwell wrote:
> Hi Al,
> 
> After merging the final tree, today's linux-next build (sparc64 defconfig)
> failed like this:
> 
> arch/sparc/kernel/head_64.o: In function `sys64_execve':
> (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `sys_execve' defined in .text section in fs/built-in.o
> arch/sparc/kernel/head_64.o: In function `sys32_execve':
> (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `compat_sys_execve' defined in .text section in fs/built-in.o
> 
> Probably caused by commit 3223f8aab885 ("sparc64: convert to generic
> execve") and following from the signal tree.
> 
> I have added this patch you suggested on IRC:
> 
> From: Stephen Rothwell 
> Date: Tue, 16 Oct 2012 14:43:51 +1100
> Subject: [PATCH] sparc: fixup for conversion to generic execve
> 
> Fixes these errors:
> 
> arch/sparc/kernel/head_64.o: In function `sys64_execve':
> (.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `sys_execve' defined in .text section in fs/built-in.o
> arch/sparc/kernel/head_64.o: In function `sys32_execve':
> (.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `compat_sys_execve' defined in .text section in fs/built-in.o
> 
> Dictated-by: Al Viro 
> Signed-off-by: Stephen Rothwell 
> ---
>  arch/sparc/kernel/syscalls.S |   12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S
> index 4bae096..f667cdf 100644
> --- a/arch/sparc/kernel/syscalls.S
> +++ b/arch/sparc/kernel/syscalls.S
> @@ -2,15 +2,19 @@
>* environment settings are the same as the calling processes.
>*/
>  sys64_execve:
> - ba,pt   %xcc,sys_execve
> -  flushw
> + flushw
> + mov %o7, %l5
> + callsys_execve
> +  mov%l5, %o7
>  
>  #ifdef CONFIG_COMPAT
>  sunos_execv:
>   mov %g0, %o2
>  sys32_execve:
> - ba,pt   %xcc,compat_sys_execve
> -  flushw
> + flushw
> + mov %o7, %l5
> + callcompat_sys_execve
> +  mov%l5, %o7
>  #endif

BTW, that's really quick and dirty; I'm not at all sure we need that
flushw there, which could make things much simpler.  Namely, kill
sys64_execve completely, making it equivalent to sys_execve(), do the
same to sys32_execve() (== compat_sys_execve()) and as for sunos_execv(),
I'd simply put it into sys_sparc32.c as
SYSCALL_DEFINE2(sunos_execv,
char __user *, filename,
const char __user *const __user *, argv)
{
return compat_sys_execve(filename, argv, NULL);
}
We definitely want flushw in fork and friends, but I'm not sure what we
need it for in execve(2)...

Anyway, the brute-force variant works.  I had been lucky to stay within the
ba,pt target limit on the config I used (very heavily modular, so not much
code in vmlinux in the first place, let alone before fs/exec.o), so I'd missed
the problem until now.  I've booted that with fatter config that would blow the
previous variant at link time and it works.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 1/3] dmaengine: dw_dmac: Update documentation style comments for dw_dma_platform_data

2012-10-15 Thread Viresh Kumar
Documentation style comments were missing for few fields in struct
dw_dma_platform_data. Add these.

Signed-off-by: Viresh Kumar 
Reviewed-by: Andy Shevchenko 
---
 include/linux/dw_dmac.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/dw_dmac.h b/include/linux/dw_dmac.h
index e1c8c9e..62a6190 100644
--- a/include/linux/dw_dmac.h
+++ b/include/linux/dw_dmac.h
@@ -19,6 +19,8 @@
  * @nr_channels: Number of channels supported by hardware (max 8)
  * @is_private: The device channels should be marked as private and not for
  * by the general purpose DMA channel allocator.
+ * @chan_allocation_order: Allocate channels starting from 0 or 7
+ * @chan_priority: Set channel priority increasing from 0 to 7 or 7 to 0.
  * @block_size: Maximum block size supported by the controller
  * @nr_masters: Number of AHB masters supported by the controller
  * @data_width: Maximum data width supported by hardware per AHB master
-- 
1.7.12.rc2.18.g61b472e


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name

2012-10-15 Thread Masami Hiramatsu
(2012/10/16 10:37), Hyeoncheol Lee wrote:
> convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS
> bytes for a function name

Yeah, that one was from my laziness...

> 
> Cc: Masami Hiramatsu 
> Cc: Srikar Dronamraju 
> Signed-off-by: Hyeoncheol Lee 
> ---
>  tools/perf/util/probe-event.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> index 49a256e..bb40ed4 100644
> --- a/tools/perf/util/probe-event.c
> +++ b/tools/perf/util/probe-event.c
> @@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct 
> perf_probe_event *pev, const char *exec)
>   free(exec_copy);
>   }
>   free(pp->function);
> - pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
> + pp->function = zalloc(sizeof(char) *
> +   (3 + sizeof(unsigned long long) * 2));

Could you comment that this is enough long here?

>   if (!pp->function) {
>   ret = -ENOMEM;
>   pr_warning("Failed to allocate memory by zalloc.\n");
>   goto out;
>   }
> - e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr);
> + sprintf(pp->function, "0x%llx", vaddr);

And at least we should use snprintf instead of sprintf...
(I think ret = e_snprintf(...) is better)

>   ret = 0;
>  
>  out:
> 

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 2/3] dmaengine: dw_dmac: Enhance device tree support

2012-10-15 Thread Viresh Kumar
dw_dmac driver already supports device tree but it used to have its platform
data passed the non-DT way.

This patch does following changes:
- pass platform data via DT, non-DT way still takes precedence if both are used.
- create generic filter routine
- Earlier slave information was made available by slave specific filter routines
  in chan->private field. Now, this information would be passed from within dmac
  DT node. Slave drivers would now be required to pass bus_id (a string) as
  parameter to this generic filter(), which would be compared against the slave
  data passed from DT, by the generic filter routine.
- Update binding document

Signed-off-by: Viresh Kumar 
Reviewed-by: Andy Shevchenko 
---
V2->V3:
--
- Simplified an equation in filter routine
- renamed variable 'val' as 'tmp' in DT parsing routine

V1->V2:
--
- Optimized filter & DT parsing routine
- Removed unnecessary casts from changes
- renamed filter function
- Fixed function prototype and return value of DT parsing routine for !CONFIG_OF
  case
- use of_get_child_count()

 Documentation/devicetree/bindings/dma/snps-dma.txt |  44 +++
 drivers/dma/dw_dmac.c  | 134 +
 drivers/dma/dw_dmac_regs.h |   4 +
 include/linux/dw_dmac.h|  43 ---
 4 files changed, 208 insertions(+), 17 deletions(-)

diff --git a/Documentation/devicetree/bindings/dma/snps-dma.txt 
b/Documentation/devicetree/bindings/dma/snps-dma.txt
index c0d85db..5bb3dfb 100644
--- a/Documentation/devicetree/bindings/dma/snps-dma.txt
+++ b/Documentation/devicetree/bindings/dma/snps-dma.txt
@@ -6,6 +6,26 @@ Required properties:
 - interrupt-parent: Should be the phandle for the interrupt controller
   that services interrupts for this device
 - interrupt: Should contain the DMAC interrupt number
+- nr_channels: Number of channels supported by hardware
+- is_private: The device channels should be marked as private and not for by 
the
+  general purpose DMA channel allocator. False if not passed.
+- chan_allocation_order: order of allocation of channel, 0 (default): 
ascending,
+  1: descending
+- chan_priority: priority of channels. 0 (default): increase from chan 0->n, 1:
+  increase from chan n->0
+- block_size: Maximum block size supported by the controller
+- nr_masters: Number of AHB masters supported by the controller
+- data_width: Maximum data width supported by hardware per AHB master
+  (0 - 8bits, 1 - 16bits, ..., 5 - 256bits)
+- slave_info:
+   - bus_id: name of this device channel, not just a device name since
+ devices may have more than one channel e.g. "foo_tx". For using the
+ dw_generic_filter(), slave drivers must pass exactly this string as
+ param to filter function.
+   - cfg_hi: Platform-specific initializer for the CFG_HI register
+   - cfg_lo: Platform-specific initializer for the CFG_LO register
+   - src_master: src master for transfers on allocated channel.
+   - dst_master: dest master for transfers on allocated channel.
 
 Example:
 
@@ -14,4 +34,28 @@ Example:
reg = <0xfc00 0x1000>;
interrupt-parent = <>;
interrupts = <12>;
+
+   nr_channels = <8>;
+   chan_allocation_order = <1>;
+   chan_priority = <1>;
+   block_size = <0xfff>;
+   nr_masters = <2>;
+   data_width = <3 3 0 0>;
+
+   slave_info {
+   uart0-tx {
+   bus_id = "uart0-tx";
+   cfg_hi = <0x4000>;  /* 0x8 << 11 */
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <1>;
+   };
+   spi0-tx {
+   bus_id = "spi0-tx";
+   cfg_hi = <0x2000>;  /* 0x4 << 11 */
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <0>;
+   };
+   };
};
diff --git a/drivers/dma/dw_dmac.c b/drivers/dma/dw_dmac.c
index c4b0eb3..98f33a7 100644
--- a/drivers/dma/dw_dmac.c
+++ b/drivers/dma/dw_dmac.c
@@ -1179,6 +1179,50 @@ static void dwc_free_chan_resources(struct dma_chan 
*chan)
dev_vdbg(chan2dev(chan), "%s: done\n", __func__);
 }
 
+bool dw_dma_generic_filter(struct dma_chan *chan, void *param)
+{
+   struct dw_dma *dw = to_dw_dma(chan->device);
+   static struct dw_dma *last_dw;
+   static char *last_bus_id;
+   int i = -1;
+
+   /*
+* dmaengine framework calls this routine for all channels of all dma
+* controller, until true is returned. If 'param' bus_id is not
+* registered with a dma controller (dw), then there is no need of
+* running below function for all channels of 

[PATCH V3 3/3] ARM: SPEAr13xx: Pass DW DMAC platform data from DT

2012-10-15 Thread Viresh Kumar
This patch adds dw_dmac's platform data to DT node. It also creates slave info
node for SPEAr13xx, for the devices which were using dw_dmac.

Signed-off-by: Viresh Kumar 
---
V1->V3:
--
- renamed filter function

 arch/arm/boot/dts/spear1340.dtsi | 19 ++
 arch/arm/boot/dts/spear13xx.dtsi | 38 
 arch/arm/mach-spear13xx/include/mach/spear.h |  2 --
 arch/arm/mach-spear13xx/spear1310.c  |  4 +--
 arch/arm/mach-spear13xx/spear1340.c  | 27 +++---
 arch/arm/mach-spear13xx/spear13xx.c  | 54 ++--
 6 files changed, 65 insertions(+), 79 deletions(-)

diff --git a/arch/arm/boot/dts/spear1340.dtsi b/arch/arm/boot/dts/spear1340.dtsi
index d71fe2a..8ea3f66 100644
--- a/arch/arm/boot/dts/spear1340.dtsi
+++ b/arch/arm/boot/dts/spear1340.dtsi
@@ -24,6 +24,25 @@
status = "disabled";
};
 
+   dma@ea80 {
+   slave_info {
+   uart1_tx {
+   bus_id = "uart1_tx";
+   cfg_hi = <0x6000>;  /* 0xC << 11 */
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <1>;
+   };
+   uart1_tx {
+   bus_id = "uart1_tx";
+   cfg_hi = <0x680>;   /* 0xD << 7 */
+   cfg_lo = <0>;
+   src_master = <1>;
+   dst_master = <0>;
+   };
+   };
+   };
+
spi1: spi@5d40 {
compatible = "arm,pl022", "arm,primecell";
reg = <0x5d40 0x1000>;
diff --git a/arch/arm/boot/dts/spear13xx.dtsi b/arch/arm/boot/dts/spear13xx.dtsi
index f7b84ac..f06bb50 100644
--- a/arch/arm/boot/dts/spear13xx.dtsi
+++ b/arch/arm/boot/dts/spear13xx.dtsi
@@ -91,6 +91,37 @@
reg = <0xea80 0x1000>;
interrupts = <0 19 0x4>;
status = "disabled";
+
+   nr_channels = <8>;
+   chan_allocation_order = <1>;
+   chan_priority = <1>;
+   block_size = <0xfff>;
+   nr_masters = <2>;
+   data_width = <3 3 0 0>;
+
+   slave_info {
+   ssp0_tx {
+   bus_id = "ssp0_tx";
+   cfg_hi = <0x2000>;  /* 0x4 << 11 */
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <0>;
+   };
+   ssp0_rx {
+   bus_id = "ssp0_rx";
+   cfg_hi = <0x280>;   /* 0x5 << 7 */
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <0>;
+   };
+   cf {
+   bus_id = "cf";
+   cfg_hi = <0>;
+   cfg_lo = <0>;
+   src_master = <0>;
+   dst_master = <0>;
+   };
+   };
};
 
dma@eb00 {
@@ -98,6 +129,13 @@
reg = <0xeb00 0x1000>;
interrupts = <0 59 0x4>;
status = "disabled";
+
+   nr_channels = <8>;
+   chan_allocation_order = <1>;
+   chan_priority = <1>;
+   block_size = <0xfff>;
+   nr_masters = <2>;
+   data_width = <3 3 0 0>;
};
 
fsmc: flash@b000 {
diff --git a/arch/arm/mach-spear13xx/include/mach/spear.h 
b/arch/arm/mach-spear13xx/include/mach/spear.h
index 07d90ac..71bf5b6 100644
--- a/arch/arm/mach-spear13xx/include/mach/spear.h
+++ b/arch/arm/mach-spear13xx/include/mach/spear.h
@@ -43,8 +43,6 @@
 #define VA_L2CC_BASE   IOMEM(UL(0xFB00))
 
 /* others */
-#define DMAC0_BASE UL(0xEA80)
-#define DMAC1_BASE UL(0xEB00)
 #define MCIF_CF_BASE   UL(0xB280)
 
 /* Devices present in SPEAr1310 */
diff --git a/arch/arm/mach-spear13xx/spear1310.c 

Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces

2012-10-15 Thread Zhi Yong Wu
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner  wrote:
> On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote:
>> From: Zhi Yong Wu 
>>
>>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
>> metrics collected in btrfs_freq_data structs, and also return a
>
> I think you mean hot_freq_data :P
Yeah, sorry.
>
>> calculated data temperature based on those metrics. Optionally, retrieve
>> the temperature from the hot data hash list instead of recalculating it.
>
> To get the heat info for a specific file you have to know what file
> you want to get that info for, right?  I can see the usefulness of
Yes.
> asking for the heat data on a specific file, but how do you find the
> hot files in the first place? i.e. the big question the user
> interface needs to answer is "what files are hot?".
We only tell the user what the files' temperatures are, not what files are hot.
Their temperatures are in the output of debugfs.
>
> Once userspace knows what the hottest files are, it can open them
If the user need to know this type of info, it is easy for us to
provide it. But i don't know what way the user hope to get it via.
> and query the data via the above ioctl, but expecting userspace to
> iterate millions of inodes in a filesystem to find hot files is very
> inefficient.
>
> FWIW, if you were to return file handles to the hottest files, then
> the application could open and query them without even needing to
> know the path name to them. This woul dbe exceedingly useful for
> defragmentation programs, especially as that is the way xfs_fsr
> already operates on candidate files.(*)
ah.
>
> IOWs, sometimes the pathname is irrelevant to the operations that
> applications want to perform - all they care about having an
> efficient method of finding the inode they want and getting a file
> descriptor that points to the file. Given the heat map info fits
> right in to the sort of operations defrag and data mover tools
> already do, it kind of makes sense to optimise the interface towards
> those uses
>
> (*) i.e. finds them via bulkstat which returns handle information
> along with all the other inode data, then opens the file by handle
> to do the defrag work
OK.
>
>>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
>> state of hot data tracking and migration:
>>
>> 0 = do nothing
>> 1 = track frequency of access
>>
>>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
>> migration, as described above.
>
> I can't see how this is a manageable interface. It is not
> persistent, so after every filesystem mount you'd have to set the
> flag on all your inodes again. Hence, for the moment, I'd suggest
> that dropping per-inode tracking control until all the core issues
> are sorted out
OK.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Initial report on F2FS filesystem performance

2012-10-15 Thread Sooman Jeong

This is a brief summary of our initial filesystem performance study of f2fs 
against existing two filesystems in linux: EXT4, NILFS2, and f2fs.


* test platform
 i) Desktop PC : Linux 3.6.1 (f2fs patched), Intel i5-2500 @3.3GHz quad-core, 
8GB RAM, Transcend 16GB class 10 micro SD card
 ii) Galaxy-S3 : Linux 3.0.15 (f2fs ported), Android 4.0.4, DVFS turned off, 
Transcend 16GB class 10 micro SD card


* experiment 1: buffered write(sequential and random, 4KByte write)
===

F2FS surpasses other two filesystems in both random and sequential. In desktop 
and Galaxy S3, f2fs exhibits 2.5 and 1.6 times better performance in random 
write against EXT4, respectively. EXT4 is standard Android filesystem.

buffered write (1GB file)
+---+-+--+
|   |   Desktop PC|Galaxy-S3 |
|   +-+---+--+---+
|   |sequential (MB/s)| random (IOPS) |sequential (MB/s) | random (IOPS) |
+---+-+---+--+---+
| EXT4  |7.1  | 1073  |6.7   | 1073  |
+---+-+---+--+---+
| NILFS2|6.8  | 1462  |4.0   | 1272  |
+---+-+---+--+---+
| F2FS  |   10.6  | 2675  |6.9   | 1682  |
+---+-+---+--+---+


* experiment 2: write + fsync(sequential and random)


F2FS surpasses other two filesystems in both random and sequential workload. In 
desktop and Galaxy S3, f2fs exhibits 2 and 1.5 times better performance in 
write+fsync random write against EXT4, respectively.

write + fsync (100MB file)
+---+-+--+
|   |   Desktop PC|Galaxy-S3 |
|   +-+---+--+---+
|   |sequential (KB/s)| random (IOPS) |sequential (KB/s) | random (IOPS) |
+---+-+---+--+---+
| EXT4  |   511.8 |  125  |   383.4  |  119  |
+---+-+---+--+---+
| NILFS2|   545.2 |  112  |   356.7  |   72  |
+---+-+---+--+---+
| F2FS  |  1057.9 |  240  |   772.3  |  184  |
+---+-+---+--+---+

write() with fsync is to test the filesystem performance under Android SQLite 
operation.


* experiment 3: mounting time
===

To measure the mount time, we used two different scenarios. First, we mounted 
file system after formatting without rebooting system. Second, we mounted file 
system after rebooting in order to ensure any data cached in memory is flushed. 
Overall, EXT4 shows fastest mount time, and F2FS shows second best performance; 
however, we observed that F2FS takes longest time to mount right after 
formatting.

mounting time with Transcend 16GB micro-SD
+---+---+---+
|   |   Desktop PC  |Galaxy-S3  
|
|   
+-+-+-+-+
|   |1st mount after  | after rebooting |1st mount after  | after rebooting 
|
|   |format (msec)| (msec)  |format (msec)| (msec)  
|
+---+-+-+-+-+
| EXT4  | 11  | 20  | 20  | 40  
|
+---+-+-+-+-+
| NILFS2|920  |   1013  |   1680  |   1630  
|
+---+-+-+-+-+
| F2FS  |   1486  |161  |   2280  |   1570  
|
+---+-+-+-+-+


Sooman Jeong  ESOS Lab. Hanyang University.
<77sm...@hanyang.ac.kr>

Re: [PATCH v2] fat: editions to support fat_fallocate()

2012-10-15 Thread Namjae Jeon
2012/10/15 OGAWA Hirofumi :
> Namjae Jeon  writes:
>
>> Implement preallocation via the fallocate syscall on VFAT partitions.
>> This patch is based on an earlier patch of the same name which had some
>> issues detailed below and did not get accepted. Refer
>> https://lkml.org/lkml/2007/12/22/130.
>>
>> a)The preallocated space was not persistent across remounts when the
>> FALLOC_FL_KEEP_SIZE flag was set. Also, writes to the file allocated new
>> clusters instead of using the preallocated area.
>>
>> Consider the scenario:
>> mount-->preallocate space for a file --> unmount.
>> In the old patch,the preallocated space was not reflected for that
>> file (verified using the 'du' command).
>>
>> This is now fixed with modifications to fat_fill_inode().
>
When we consider other filesystems like XFS and ext4,  the space which
is preallocated is reserved for the life-time of that file which is
persistent across(mount/umount).
So, we tried to make this as similar to the existent solution - as
that would keep the meaning of FALLOCATE - WITH_KEEP_SIZE as same
across all filesystems.

> What is real usage pattern of persistent across remounts on FAT?
Yes,  like a TORRENT FILE -> it reserves space in advance
even though the system can be rebooted/disk unmounted and remount
but the space still remains there - as long as the torrent exists
Or if Torrent case does not matches currently
Then, Consider a case for a TV series to be recorded
Since – we want all the parts to be recorded on the same file (i.e.,
APPEND write) – and in such cases there are chances of TV shutdown,
device unmount-mount again. So, we need to have the space to be remain
available in such cases.

> If once device was unmounted, we can't know the state of FS anymore, there are
> many implementations of FAT. And preallocation is not in the spec.
I agree, As you said before, we can make fat fallocate feature as
configurable – so this is entirely in the hands of USER.
>
> I worry to break something. And I guess the freeing preallocation on
> last close may fix the issue for usage.
Okay, we can avoid most of your concerns except suddenly unplugging usb device.
But fallocate behavior will be different with other filesystem.

How about to make fat fallocate with configuration to be used by users
is having needs?

Let me know your opinion :)

Thanks.
Thanks.> --
> OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size

2012-10-15 Thread Chen Gang
于 2012年10月16日 10:51, Myklebust, Trond 写道:

>>
>> 1) is it means: nfs_inode_attrs_need_update need not consider async
>> read_done situation ?
> 
> I don't understand what you mean. This is mainly about the asynchronous
> write situation...

for async read done, it will call nfs_readpage_result -> nfs_read_done
-> nfs_refresh_inode -> nfs_refresh_inode_locked ->
nfs_inode_attrs_need_update -> nfs_size_need_update.

we need consider the situation that "async read_done also call
nfs_size_need_update with an old useless larger file size".

you means, it need not consider async read (only consider async write is
enough), is it correct ?

> 
> No... If I did, I would have changed this 15 years ago when I was
> writing that code. Nothing here is new... 2.6.27-rc9 has the exact same
> heuristics.

1) I have read the relative source code of 2.6.27-rc9, it is truly no
nfs_size_need_update function.

2) I have test the 2.6.27-rc9, it truly pass the LTP test of udp+nfsv2.

3) I got the 2.6.27-rc9 source code by this way (please check)
   A) get source code from (git clone)
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
   B) git archive v2.6.27-rc9 | tar -xf - -C ../2.6.27-rc9/


> It boils down to the rule that if you want to ensure that data is not
> _lost_, then you have to ensure that the cached file size is not less
> than the true file size.
> 

1) you means: in some condition, the cached file size can be bigger than
the true file size ?  can you give some example (which no negative
effect for correctness) ?

2) What I feel:
   A) I am not quite familiar with nfs (so truly need your information);
   B) I think it is truly a bug, but maybe nfs_size_need_update is not
the root cause (so need nfs maintainers' audit)
   C) if nfs_size_need_update is truly not the root cause, I shall
continue analysing it, after get enough information from nfs maintainers.


>>   B) the test tools which I use is from the LTP (Linux Test Project),
>> they use both udp and tcp to test both the nfsv2 and nfsv3.
> 
> So what combinations are failing?

for udp + nfsv2 failing (I am not test udp + nfsv3)

> 
>>   C) truly LTP has its limitations: "for stress test, LTP let nfs client
>> and server under the same machine, which will cause kernel stable
>> issue", but for net test, LTP use different machine (I got our issue
>> from LTP net test).
> 
> Running the client and server on the same machine is likely to deadlock
> due to memory pressure issues. The client needs to be able to _increase_
> memory pressure on the server in order to reduce its own pressure. That
> doesn't work well when client == server.
> 

truly got confirmation from Jeff Layton, 1-2 months ago;
also thank you for giving confirmation too.

-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] random: Account for entropy loss due to overwrites

2012-10-15 Thread Theodore Ts'o
On Sat, Sep 29, 2012 at 12:47:04PM -0700, H. Peter Anvin wrote:
> >-static struct poolinfo {
> >+static const struct poolinfo {
> >+int poolshift;  /* log2(POOLBITS) */
> > int poolwords;
> > int tap1, tap2, tap3, tap4, tap5;

Poolshift is duplicated information; it's just log2(poolwords) + 5
(since POOLBITS is poolwords*32).

Granted you don't want to recalculate it every single time you need to
use it, but perhaps it would be better to add poolshift to struct
entropy_store, and set it in init_std_data()?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ACPI: fix the wrong #ifdef for acpi_no_s4_hw_signature

2012-10-15 Thread Yuanhan Liu
acpi_no_s4_hw_signature is defined in #ifdef CONFIG_HIBERNATION block,
but the current code put the declare in #ifdef CONFIG_PM_SLEEP block.

Signed-off-by: Yuanhan Liu 
---
 include/linux/acpi.h |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 90be989..a468429 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -257,8 +257,11 @@ int acpi_check_region(resource_size_t start, 
resource_size_t n,
 
 int acpi_resources_are_enforced(void);
 
-#ifdef CONFIG_PM_SLEEP
+#ifdef CONFIG_HIBERNATION
 void __init acpi_no_s4_hw_signature(void);
+#endif
+
+#ifdef CONFIG_PM_SLEEP
 void __init acpi_old_suspend_ordering(void);
 void __init acpi_nvs_nosave(void);
 #endif /* CONFIG_PM_SLEEP */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] make GFP_NOTRACK flag unconditional

2012-10-15 Thread David Rientjes
On Tue, 2 Oct 2012, David Rientjes wrote:

> > There was a general sentiment in a recent discussion (See
> > https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
> > defined unconditionally. Currently, the only offender is GFP_NOTRACK,
> > which is conditional to KMEMCHECK.
> > 
> > This simple patch makes it unconditional.
> > 
> > Signed-off-by: Glauber Costa 
> > CC: Christoph Lameter 
> > CC: Mel Gorman 
> > CC: Andrew Morton 
> 
> Acked-by: David Rientjes 
> 
> I think it was done this way to show that if CONFIG_KMEMCHECK=n then the 
> bit could be reused for something else but I can't think of any reason why 
> that would be useful; what would need to add a gfp bit that would also 
> happen to depend on CONFIG_KMEMCHECK=n?  Nothing comes to mind to save a 
> bit.
> 
> There are other cases of this as well, like __GFP_OTHER_NODE which is only 
> useful for thp and it's defined unconditionally.  So this seems fine to 
> me.
> 

Still missing from linux-next as of this morning, I think this patch 
should be merged.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: Tree for Oct 16

2012-10-15 Thread Stephen Rothwell
Hi all,

The merge window has closed, feel free to add new stuff again.

Changes since 201201015:

New tree: cortex
Dropped Tree: cortex (complex merge conflict)
Removed tree: kmemleak (maintainer suggested)

The l2-mtd tree still had its build failure so I used the version from
next-20121011.

The tip tree gained a conflict against Linus' tree.

The kvm-ppc tree lost its build failure.

The cortex tree gained conflicts against Linus' tree.

The signal tree gained a build failure for which I applied a suggested
patch.

The akpm tree gained a conflict against the signal tree.



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" as mentioned in the FAQ on the wiki
(see below).

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log files
in the Next directory.  Between each merge, the tree was built with
a ppc64_defconfig for powerpc and an allmodconfig for x86_64. After the
final fixups (if any), it is also built with powerpc allnoconfig (32 and
64 bit), ppc44x_defconfig and allyesconfig (minus
CONFIG_PROFILE_ALL_BRANCHES - this fails its final link) and i386, sparc,
sparc64 and arm defconfig. These builds also have
CONFIG_ENABLE_WARN_DEPRECATED, CONFIG_ENABLE_MUST_CHECK and
CONFIG_DEBUG_INFO disabled when necessary.

Below is a summary of the state of the merge.

We are up to 204 trees (counting Linus' and 26 trees of patches pending
for Linus' tree), more are welcome (even if they are currently empty).
Thanks to those who have contributed, and to those who haven't, please do.

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

There is a wiki covering stuff to do with linux-next at
http://linux.f-seidel.de/linux-next/pmwiki/ .  Thanks to Frank Seidel.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

$ git checkout master
$ git reset --hard stable
Merging origin/master (dd8e8c4 thermal, cpufreq: Fix build when CPU_FREQ_TABLE 
isn't configured)
Merging fixes/master (12250d8 Merge branch 'i2c-embedded/for-next' of 
git://git.pengutronix.de/git/wsa/linux)
Merging kbuild-current/rc-fixes (b1e0d8b kbuild: Fix gcc -x syntax)
Merging arm-current/fixes (3d6ee36 Merge branch 'late-for-linus' of 
git://git.linaro.org/people/rmk/linux-arm)
Merging m68k-current/for-linus (92f79db m68k: Remove empty #ifdef/#else/#endif 
block)
Merging powerpc-merge/merge (fd3bc66 Merge tag 'disintegrate-powerpc-20121009' 
into merge)
Merging sparc/master (ddffeb8 Linux 3.7-rc1)
Merging net/master (29bb4cc docbook: networking: fix file paths for uapi 
headers)
Merging sound-current/for-linus (ddffeb8 Linux 3.7-rc1)
Merging pci-current/for-linus (0ff9514 PCI: Don't print anything while decoding 
is disabled)
Merging wireless/master (bf11315 net/wireless: ipw2200: Fix panic occurring in 
ipw_handle_promiscuous_tx())
Merging driver-core.current/driver-core-linus (ddffeb8 Linux 3.7-rc1)
Merging tty.current/tty-linus (3e5bde8 serial/8250_hp300: Missing 8250 register 
interface conversion bits)
Merging usb.current/usb-linus (8282da4 MAINTAINERS: Add maintainer entry for 
the USB webcam gadget)
Merging staging.current/staging-linus (ddffeb8 Linux 3.7-rc1)
Merging char-misc.current/char-misc-linus (ddffeb8 Linux 3.7-rc1)
Merging input-current/for-linus (0cc8d6a Merge branch 'next' into for-linus)
Merging md-current/for-linus (72f36d5 md: refine reporting of resync/reshape 
delays.)
Merging audit-current/for-linus (c158a35 audit: no leading space in 
audit_log_d_path prefix)
Merging crypto-current/master (c9f97a2 crypto: x86/glue_helper - fix storing of 
new IV in CBC encryption)
Merging ide/master (9974e43 ide: fix generic_ide_suspend/resume Oops)
Merging dwmw2/master (244dc4e Merge 
git://git.infradead.org/users/dwmw2/random-2.6)
Merging sh-current/sh-fixes-for-linus (4403310 SH: Convert out[bwl] macros to 
inline functions)
Merging irqdomain-current/irqdomain/merge (15e06bf irqdomain: Fix debugfs 
formatting)
Merging devicetree-current/devicetree/merge (4e8383b of: release node fix for 
of_parse_phandle_with_args)
Merging spi-current/spi/merge (d1c185b of/spi: Fix SPI module loading by using 
proper "spi:" modalias prefixes.)
Merging gpio-current/gpio/merge (96b7064 gpio/tca6424: merge I2C transactions, 
remove cast)
Merging asm-generic/master (c37d615 Merge branch 'disintegrate-asm-generic' of 

Re: mpol_to_str revisited.

2012-10-15 Thread David Rientjes
On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:

> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.

It's certainly not a complete fix, but I think it's a much better result 
of the race, i.e. we don't panic anymore, we simply fail the read() 
instead.

> we should
> close a race (or kill remain ref count leak) if we still have.

As I mentioned earlier in the thread, the read() is done here on a task 
while only a reference to the task_struct is taken and we do not hold 
task_lock() which is required for task->mempolicy.  Once that is fixed, 
mpol_to_str() should never be called for !task->mempolicy so it will never 
need to return -EINVAL in such a condition.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: build failure after merge of the final tree

2012-10-15 Thread Stephen Rothwell
Hi Al,

After merging the final tree, today's linux-next build (sparc64 defconfig)
failed like this:

arch/sparc/kernel/head_64.o: In function `sys64_execve':
(.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
`sys_execve' defined in .text section in fs/built-in.o
arch/sparc/kernel/head_64.o: In function `sys32_execve':
(.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
`compat_sys_execve' defined in .text section in fs/built-in.o

Probably caused by commit 3223f8aab885 ("sparc64: convert to generic
execve") and following from the signal tree.

I have added this patch you suggested on IRC:

From: Stephen Rothwell 
Date: Tue, 16 Oct 2012 14:43:51 +1100
Subject: [PATCH] sparc: fixup for conversion to generic execve

Fixes these errors:

arch/sparc/kernel/head_64.o: In function `sys64_execve':
(.text+0x1f58): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
`sys_execve' defined in .text section in fs/built-in.o
arch/sparc/kernel/head_64.o: In function `sys32_execve':
(.text+0x1f64): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
`compat_sys_execve' defined in .text section in fs/built-in.o

Dictated-by: Al Viro 
Signed-off-by: Stephen Rothwell 
---
 arch/sparc/kernel/syscalls.S |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S
index 4bae096..f667cdf 100644
--- a/arch/sparc/kernel/syscalls.S
+++ b/arch/sparc/kernel/syscalls.S
@@ -2,15 +2,19 @@
 * environment settings are the same as the calling processes.
 */
 sys64_execve:
-   ba,pt   %xcc,sys_execve
-flushw
+   flushw
+   mov %o7, %l5
+   callsys_execve
+mov%l5, %o7
 
 #ifdef CONFIG_COMPAT
 sunos_execv:
mov %g0, %o2
 sys32_execve:
-   ba,pt   %xcc,compat_sys_execve
-flushw
+   flushw
+   mov %o7, %l5
+   callcompat_sys_execve
+mov%l5, %o7
 #endif
 
.align  32
-- 
1.7.10.280.gaa39

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpCuYJ04DY7h.pgp
Description: PGP signature


Re: [RESEND] [PATCH 2/2] random: fix debug format strings

2012-10-15 Thread Theodore Ts'o
On Mon, Oct 15, 2012 at 11:43:29PM +0200, Jiri Kosina wrote:
> Fix the following warnings in formatting debug output:
> 
> drivers/char/random.c: In function ‘xfer_secondary_pool’:
> drivers/char/random.c:827: warning: format ‘%d’ expects type ‘int’, but 
> argument 7 has type ‘size_t’
> drivers/char/random.c: In function ‘account’:
> drivers/char/random.c:859: warning: format ‘%d’ expects type ‘int’, but 
> argument 5 has type ‘size_t’
> drivers/char/random.c:881: warning: format ‘%d’ expects type ‘int’, but 
> argument 5 has type ‘size_t’
> drivers/char/random.c: In function ‘random_read’:
> drivers/char/random.c:1141: warning: format ‘%d’ expects type ‘int’, but 
> argument 5 has type ‘ssize_t’
> drivers/char/random.c:1145: warning: format ‘%d’ expects type ‘int’, but 
> argument 5 has type ‘ssize_t’
> drivers/char/random.c:1145: warning: format ‘%d’ expects type ‘int’, but 
> argument 6 has type ‘long unsigned int’
> 
> by using '%zd' instead of '%d' to properly denote ssize_t/size_t conversion.
> 
> Signed-off-by: Jiri Kosina 

Applied to the random tree, thanks.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND] [PATCH 1/2] random: make it possible to enable debugging without rebuild

2012-10-15 Thread Theodore Ts'o
On Mon, Oct 15, 2012 at 11:42:55PM +0200, Jiri Kosina wrote:
> The module parameter that turns debugging mode (which basically means 
> printing a few extra lines during runtime) is in '#if 0' block. Forcing 
> everyone who would like to see how entropy is behaving on his system to 
> rebuild seems to be a little bit too harsh.
> 
> If we were concerned about speed, we could potentially turn 'debug' into a 
> static key, but I don't think it's necessary.
> 
> Drop the '#if 0' block to allow using the 'debug' parameter without 
> rebuilding.
> 
> Signed-off-by: Jiri Kosina 

Applied to the random tree, thanks.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq:core: Fix printing of governor and driver name

2012-10-15 Thread Viresh Kumar
On 15 October 2012 23:21, Rafael J. Wysocki  wrote:
> On Wednesday 10 of October 2012 10:12:11 Viresh Kumar wrote:
>> Arrays for governer and driver name are of size CPUFREQ_NAME_LEN or 16.
>> i.e. 15 bytes for name and 1 for trailing '\0'.
>>
>> When cpufreq driver print these names (for sysfs), it includes '\n' or ' ' in
>> the fmt string and still passes length as CPUFREQ_NAME_LEN. If the driver or
>> governor names are using all 15 fields allocated to them, then the trailing 
>> '\n'
>> or ' ' will never be printed. And so commands like:
>>
>> root@linaro-developer# cat 
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
>>
>> will print something like:
>>
>> cpufreq_foodrvroot@linaro-developer#
>>
>> Fix this by increasing print length by one character.
>>
>> Signed-off-by: Viresh Kumar 
>
> Thanks for the patch, I'll queue it up for v3.8.

Hi Rafael,

Thanks for accepting the patch.
I thought both of my patches would go in 3.7-rc2 as they are bug
fixes. Isn't that correct?

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 02/14] memcg: Reclaim when more than one page needed.

2012-10-15 Thread Kamezawa Hiroyuki
(2012/10/08 19:06), Glauber Costa wrote:
> From: Suleiman Souhlal 
> 
> mem_cgroup_do_charge() was written before kmem accounting, and expects
> three cases: being called for 1 page, being called for a stock of 32
> pages, or being called for a hugepage.  If we call for 2 or 3 pages (and
> both the stack and several slabs used in process creation are such, at
> least with the debug options I had), it assumed it's being called for
> stock and just retried without reclaiming.
> 
> Fix that by passing down a minsize argument in addition to the csize.
> 
> And what to do about that (csize == PAGE_SIZE && ret) retry?  If it's
> needed at all (and presumably is since it's there, perhaps to handle
> races), then it should be extended to more than PAGE_SIZE, yet how far?
> And should there be a retry count limit, of what?  For now retry up to
> COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
> __GFP_NORETRY.
> 
> [v4: fixed nr pages calculation pointed out by Christoph Lameter ]
> 
> Signed-off-by: Suleiman Souhlal 
> Signed-off-by: Glauber Costa 
> Reviewed-by: Kamezawa Hiroyuki 
> Acked-by: Michal Hocko 
> Acked-by: Johannes Weiner 

Acked-by: KAMEZAWA Hiroyuki 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces

2012-10-15 Thread Dave Chinner
On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote:
> From: Zhi Yong Wu 
> 
>   FS_IOC_GET_HEAT_INFO: return a struct containing the various
> metrics collected in btrfs_freq_data structs, and also return a

I think you mean hot_freq_data :P

> calculated data temperature based on those metrics. Optionally, retrieve
> the temperature from the hot data hash list instead of recalculating it.

To get the heat info for a specific file you have to know what file
you want to get that info for, right?  I can see the usefulness of
asking for the heat data on a specific file, but how do you find the
hot files in the first place? i.e. the big question the user
interface needs to answer is "what files are hot?".

Once userspace knows what the hottest files are, it can open them
and query the data via the above ioctl, but expecting userspace to
iterate millions of inodes in a filesystem to find hot files is very
inefficient.

FWIW, if you were to return file handles to the hottest files, then
the application could open and query them without even needing to
know the path name to them. This woul dbe exceedingly useful for
defragmentation programs, especially as that is the way xfs_fsr
already operates on candidate files.(*)

IOWs, sometimes the pathname is irrelevant to the operations that
applications want to perform - all they care about having an
efficient method of finding the inode they want and getting a file
descriptor that points to the file. Given the heat map info fits
right in to the sort of operations defrag and data mover tools
already do, it kind of makes sense to optimise the interface towards
those uses

(*) i.e. finds them via bulkstat which returns handle information
along with all the other inode data, then opens the file by handle
to do the defrag work

>   FS_IOC_GET_HEAT_OPTS: return an integer representing the current
> state of hot data tracking and migration:
> 
> 0 = do nothing
> 1 = track frequency of access
> 
>   FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
> migration, as described above.

I can't see how this is a manageable interface. It is not
persistent, so after every filesystem mount you'd have to set the
flag on all your inodes again. Hence, for the moment, I'd suggest
that dropping per-inode tracking control until all the core issues
are sorted out

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/5] irq_work: Move irq_work_raise() declaration/default definition to arch headers

2012-10-15 Thread Mark Brown
On Tue, Oct 16, 2012 at 12:18:05AM +0200, Frederic Weisbecker wrote:
> 2012/10/15 Arnd Bergmann :
> > On Monday 15 October 2012, Steven Rostedt wrote:
> >> On Mon, 2012-10-15 at 22:23 +0200, Frederic Weisbecker wrote:
> >> > 2012/10/15 Steven Rostedt :
> >> > > On Mon, 2012-10-15 at 17:11 +0100, Catalin Marinas wrote:

> >> > > BTW, is there any rational reason that the include path lookup doesn't
> >> > > just check for the files in include/asm-generic after looking in
> >> > > arch/*/include/asm?

> >> > > Really, the best way would be just to add the default asm files into
> >> > > include/asm-generic and be done with it. I hate the fact that we need 
> >> > > to
> >> > > touch every arch for every generic default file.

> >> > Agreed. I'm including Arnd in the conversation.

> >> As David Howells is doing user space header work, I'll include him too.
> >> Maybe someone can shed some light onto this.

I'll just add my vote there, I've *no* idea why asm-generic isn't in the
include path by default, I could never figure out what that was for.

> > A number of people have expressed the wish to do this through Makefile 
> > magic, but
> > so far nobody has been able to come up with the right incantation.
> >
> > I've spent a day trying to figure it out, and I think Mark Brown tried some 
> > of
> > the same things. It's probably not all that hard for someone who is more 
> > familiar
> > with the Kbuild internals.

I came up with stuff for it, though it needed prettyfying.

> This seems to do the trick:

> (It's the diff result of ln -s asm-generic include/asm)

That'd work, but I assume there is some reason why we've got this system
of explicitly adding each file.  It's not like cpp can test for the
presence of include files.  If we can't figure out why we're not doing
this I'd propose we start.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: hung task when USB storage probe occurs during suspend

2012-10-15 Thread Michael Spang
[oops, fixing LKML address]

If you plug in a USB storage device and then suspend, resume, and
quickly suspend again, the system may freeze. 2 minutes later you'll
get the following message.

I believe this is a regression introduced in 62d3c543 ("Block: use a
freezable workqueue for disk-event polling"). Reverting that patch
prevents the deadlock.

<3>[  240.107877] INFO: task kworker/u:2:64 blocked for more than 120 seconds.
<3>[  240.107888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  240.107899] kworker/u:2 D 880149f61908 064
2 0x
<5>[  240.107914]  880149f719f0 0046 8801495e3250
8801495e3250
<5>[  240.107923]  880149f615d8 880149f61590 880149f71fd8
880149f71fd8
<5>[  240.107931]  00012180 880149f61590 88014fa92208
7fff
<5>[  240.107939] Call Trace:
<5>[  240.107955]  [] schedule+0x64/0x66
<5>[  240.107966]  [] schedule_timeout+0x34/0xde
<5>[  240.107973]  [] wait_for_common+0xcd/0x14b
<5>[  240.107983]  [] ? try_to_wake_up+0x1e0/0x1e0
<5>[  240.107990]  [] wait_for_completion+0x1d/0x1f
<5>[  240.107998]  [] flush_work+0x2e/0x34
<5>[  240.108004]  [] ? do_work_for_cpu+0x27/0x27
<5>[  240.108011]  [] flush_delayed_work+0x49/0x4e
<5>[  240.108019]  [] disk_clear_events+0x97/0xfb
<5>[  240.108029]  [] check_disk_change+0x2d/0x5f
<5>[  240.108039]  [] sd_open+0xb1/0x160
<5>[  240.108047]  [] __blkdev_get+0xbf/0x3b0
<5>[  240.108054]  [] blkdev_get+0x1df/0x2d8
<5>[  240.108064]  [] ? unlock_new_inode+0x5c/0x61
<5>[  240.108074]  [] ? put_device+0x17/0x19
<5>[  240.108083]  [] ? disk_put_part+0x12/0x14
<5>[  240.108089]  [] add_disk+0x29f/0x3e6
<5>[  240.108096]  [] sd_probe_async+0x124/0x1c4
<5>[  240.108103]  [] ? async_schedule+0x17/0x17
<5>[  240.108108]  [] async_run_entry_fn+0xa2/0x153
<5>[  240.108115]  [] process_one_work+0x199/0x2b8
<5>[  240.108123]  [] worker_thread+0x13c/0x222
<5>[  240.108130]  [] ? manage_workers.isra.26+0x171/0x171
<5>[  240.108138]  [] kthread+0x8b/0x93
<5>[  240.108147]  [] kernel_thread_helper+0x4/0x10
<5>[  240.108155]  [] ? __init_kthread_worker+0x39/0x39
<5>[  240.108163]  [] ? gs_change+0xb/0xb
<0>[  240.108169] Kernel panic - not syncing: hung_task: blocked tasks

This async SCSI probe is stuck trying to flush the
system_nrt_freezable_wq workqueue, which is frozen.

<6>[  169.464976] powerd_suspend  D 880101d2e818 0  5687
5686 0x
<5>[  169.464981]  880101f2bcd8 0082 0246
81813020
<5>[  169.464988]  0246 880101d2e4a0 880101f2bfd8
880101f2bfd8
<5>[  169.464996]  00012180 880101d2e4a0 880101f2bcc8
0217
<5>[  169.465002] Call Trace:
<5>[  169.465009]  [] ? scsi_bus_resume_common+0x8d/0x8d
<5>[  169.465015]  [] schedule+0x64/0x66
<5>[  169.465023]  []
async_synchronize_cookie_domain+0xb6/0x112
<5>[  169.465029]  [] ? __init_waitqueue_head+0x32/0x32
<5>[  169.465038]  [] async_synchronize_cookie+0x15/0x17
<5>[  169.465046]  [] async_synchronize_full+0x15/0x31
<5>[  169.465052]  [] scsi_bus_prepare+0x1d/0x36
<5>[  169.465059]  [] dpm_prepare+0xdd/0x18d
<5>[  169.465065]  [] dpm_suspend_start+0x15/0x40
<5>[  169.465073]  [] suspend_devices_and_enter+0x78/0x27f
<5>[  169.465081]  [] pm_suspend+0x134/0x1a9
<5>[  169.465088]  [] state_store+0x9c/0xc5
<5>[  169.465098]  [] kobj_attr_store+0x17/0x19
<5>[  169.465105]  [] sysfs_write_file+0x104/0x140
<5>[  169.465111]  [] vfs_write+0xa8/0xcf
<5>[  169.465117]  [] sys_write+0x4a/0x71
<5>[  169.465124]  [] system_call_fastpath+0x16/0x1b

The powerd_suspend task is blocked waiting for SCSI probes to complete.

We've worked around this issue in the chromium tree by partially
reverting 62d3c543 (see
https://gerrit.chromium.org/gerrit/#/c/35324/).

Thanks,
Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] include/version.h: Update for kernel 3.7

2012-10-15 Thread Theodore Ts'o
On Mon, Oct 15, 2012 at 04:43:27PM -0500, Larry Finger wrote:
> The value for LINUX_VERSION_CODE was not updated for kernel 3.7-rc1.
> 
> Signed-off-by: Larry Finger 
> ---
>  version.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> ---
> 
> Index: linux-2.6/include/linux/version.h
> ===
> --- linux-2.6.orig/include/linux/version.h
> +++ linux-2.6/include/linux/version.h
> @@ -1,2 +1,2 @@
> -#define LINUX_VERSION_CODE 198144
> +#define LINUX_VERSION_CODE 198400
>  #define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))

This isn't in the Linux git sources; it's a generated file.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the akpm tree with the signal tree

2012-10-15 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm tree got a conflict in
arch/arm64/kernel/sys_compat.c between commit 0fe8f08036a2 ("arm64: Use
generic sys_execve() implementation") from the signal tree and commit
"compat: generic compat_sys_sched_rr_get_interval implementation" from
the akpm tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/arm64/kernel/sys_compat.c
index d140b73,fd8ae6e..000
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@@ -49,21 -49,24 +49,6 @@@ asmlinkage int compat_sys_vfork(struct 
   regs, 0, NULL, NULL);
  }
  
- asmlinkage int compat_sys_sched_rr_get_interval(compat_pid_t pid,
-   struct compat_timespec __user 
*interval)
- {
-   struct timespec t;
-   int ret;
-   mm_segment_t old_fs = get_fs();
- 
-   set_fs(KERNEL_DS);
-   ret = sys_sched_rr_get_interval(pid, (struct timespec __user *));
-   set_fs(old_fs);
-   if (put_compat_timespec(, interval))
-   return -EFAULT;
-   return ret;
- }
- 
 -asmlinkage int compat_sys_execve(const char __user *filenamei,
 -   compat_uptr_t argv, compat_uptr_t envp,
 -   struct pt_regs *regs)
 -{
 -  int error;
 -  struct filename *filename;
 -
 -  filename = getname(filenamei);
 -  error = PTR_ERR(filename);
 -  if (IS_ERR(filename))
 -  goto out;
 -  error = compat_do_execve(filename->name, compat_ptr(argv),
 -  compat_ptr(envp), regs);
 -  putname(filename);
 -out:
 -  return error;
 -}
 -
  static inline void
  do_compat_cache_op(unsigned long start, unsigned long end, int flags)
  {


pgpYTllEUCFHr.pgp
Description: PGP signature


Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size

2012-10-15 Thread Myklebust, Trond
On Tue, 2012-10-16 at 09:37 +0800, Chen Gang wrote:
> 于 2012年10月15日 20:32, Myklebust, Trond 写道:
> > RPC is not ordered. The fact that we get one RPC reply before another
> > does not mean that the server sent them in that order.
> > 
> > This is doubly true when you use UDP as the transport protocol.
> 
> 1) is it means: nfs_inode_attrs_need_update need not consider async
> read_done situation ?

I don't understand what you mean. This is mainly about the asynchronous
write situation...

> 2) for correctness, I do not think "nfs_size_to_loff_t(fattr->size) >
> i_size_read(inode)" in nfs_size_need_update is enough. (at least need
> use "!=" instead of '>'), do you think so ?

No... If I did, I would have changed this 15 years ago when I was
writing that code. Nothing here is new... 2.6.27-rc9 has the exact same
heuristics.
It boils down to the rule that if you want to ensure that data is not
_lost_, then you have to ensure that the cached file size is not less
than the true file size.

> 3) another reference:
> 
>   A) for an old kernel version (such as 2.6.27-rc9), no such issue
> (because it did not have nfs_size_need_update).
> 
>   B) the test tools which I use is from the LTP (Linux Test Project),
> they use both udp and tcp to test both the nfsv2 and nfsv3.

So what combinations are failing?

>   C) truly LTP has its limitations: "for stress test, LTP let nfs client
> and server under the same machine, which will cause kernel stable
> issue", but for net test, LTP use different machine (I got our issue
> from LTP net test).

Running the client and server on the same machine is likely to deadlock
due to memory pressure issues. The client needs to be able to _increase_
memory pressure on the server in order to reduce its own pressure. That
doesn't work well when client == server.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com


Re: [PATCH] include/version.h: Update for kernel 3.7

2012-10-15 Thread Larry Finger

On 10/15/2012 04:54 PM, Borislav Petkov wrote:

On Mon, Oct 15, 2012 at 04:43:27PM -0500, Larry Finger wrote:

The value for LINUX_VERSION_CODE was not updated for kernel 3.7-rc1.


That's probably fallout from the whole UAPI thing.



Signed-off-by: Larry Finger 
---
  version.h |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
---

Index: linux-2.6/include/linux/version.h
===
--- linux-2.6.orig/include/linux/version.h
+++ linux-2.6/include/linux/version.h
@@ -1,2 +1,2 @@


There are two version.h files on my box:


-#define LINUX_VERSION_CODE 198144


This is in 


+#define LINUX_VERSION_CODE 198400


This is in 

I'd guess that everything should include this new version.h file now
since the Makefile generates this now and not the one above.

But I could very well be wrong.


It seems to be fixed now.

Thanks,

Larry


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Linux KVM tool for v3.7-rc0

2012-10-15 Thread Stephen Rothwell
Hi Linus,

On Fri, 12 Oct 2012 14:34:33 +0300 (EEST) Pekka Enberg  
wrote:
>
> Please consider pulling the latest LKVM tree from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git 
> kvmtool/for-linus

So you have not taken this in the v3.7 merge window.

Will you ever merge this?   If not, it should be removed from linux-next
(where it has been sitting since before v3.1) and turned into an
independent (to the kernel) project.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpXyOPwt0P3B.pgp
Description: PGP signature


Re: Re: [PATCH] block: Add blk_rq_pos(rq) to sort rq when plushing plug-list.

2012-10-15 Thread Jianpeng Ma
On 2012-10-15 21:18 Shaohua Li  Wrote:
>2012/10/15 Shaohua Li :
>> 2012/10/15 Jianpeng Ma :
>>> My workload is a raid5 which had 16 disks. And used our filesystem to
>>> write using direct-io mode.
>>> I used the blktrace to find those message:
>>>
>>> 8,16   0 3570 1.083923979  2519  I   W 144323176 + 24 [md127_raid5]
>>> 8,16   00 1.083926214 0  m   N cfq2519 insert_request
>>> 8,16   0 3571 1.083926586  2519  I   W 144323072 + 104 [md127_raid5]
>>> 8,16   00 1.083926952 0  m   N cfq2519 insert_request
>>> 8,16   0 3572 1.083927180  2519  U   N [md127_raid5] 2
>>> 8,16   00 1.083927870 0  m   N cfq2519 Not 
>>> idling.st->count:1
>>> 8,16   00 1.083928320 0  m   N cfq2519 dispatch_insert
>>> 8,16   00 1.083928951 0  m   N cfq2519 dispatched a request
>>> 8,16   00 1.083929443 0  m   N cfq2519 activate rq,drv=1
>>> 8,16   0 3573 1.083929530  2519  D   W 144323176 + 24 [md127_raid5]
>>> 8,16   00 1.083933883 0  m   N cfq2519 Not 
>>> idling.st->count:1
>>> 8,16   00 1.083934189 0  m   N cfq2519 dispatch_insert
>>> 8,16   00 1.083934654 0  m   N cfq2519 dispatched a request
>>> 8,16   00 1.083935014 0  m   N cfq2519 activate rq,drv=2
>>> 8,16   0 3574 1.083935101  2519  D   W 144323072 + 104 [md127_raid5]
>>> 8,16   0 3575 1.084196179 0  C   W 144323176 + 24 [0]
>>> 8,16   00 1.084197979 0  m   N cfq2519 complete rqnoidle 0
>>> 8,16   0 3576 1.084769073 0  C   W 144323072 + 104 [0]
>>>   ..
>>> 8,16   1 3596 1.091394357  2519  I   W 144322544 + 16 [md127_raid5]
>>> 8,16   10 1.091396181 0  m   N cfq2519 insert_request
>>> 8,16   1 3597 1.091396571  2519  I   W 144322520 + 24 [md127_raid5]
>>> 8,16   10 1.091396934 0  m   N cfq2519 insert_request
>>> 8,16   1 3598 1.091397165  2519  I   W 144322488 + 32 [md127_raid5]
>>> 8,16   10 1.091397477 0  m   N cfq2519 insert_request
>>> 8,16   1 3599 1.091397708  2519  I   W 144322432 + 56 [md127_raid5]
>>> 8,16   10 1.091398023 0  m   N cfq2519 insert_request
>>> 8,16   1 3600 1.091398284  2519  U   N [md127_raid5] 4
>>> 8,16   10 1.091398986 0  m   N cfq2519 Not idling. 
>>> st->count:1
>>> 8,16   10 1.091399511 0  m   N cfq2519 dispatch_insert
>>> 8,16   10 1.091400217 0  m   N cfq2519 dispatched a request
>>> 8,16   10 1.091400688 0  m   N cfq2519 activate rq,drv=1
>>> 8,16   1 3601 1.091400766  2519  D   W 144322544 + 16 [md127_raid5]
>>> 8,16   10 1.091406151 0  m   N cfq2519 Not 
>>> idling.st->count:1
>>> 8,16   10 1.091406460 0  m   N cfq2519 dispatch_insert
>>> 8,16   10 1.091406931 0  m   N cfq2519 dispatched a request
>>> 8,16   10 1.091407291 0  m   N cfq2519 activate rq,drv=2
>>> 8,16   1 3602 1.091407378  2519  D   W 144322520 + 24 [md127_raid5]
>>> 8,16   10 1.091414006 0  m   N cfq2519 Not 
>>> idling.st->count:1
>>> 8,16   10 1.091414297 0  m   N cfq2519 dispatch_insert
>>> 8,16   10 1.091414702 0  m   N cfq2519 dispatched a request
>>> 8,16   10 1.091415047 0  m   N cfq2519 activate rq, drv=3
>>> 8,16   1 3603 1.091415125  2519  D   W 144322488 + 32 [md127_raid5]
>>> 8,16   10 1.091416469 0  m   N cfq2519 Not 
>>> idling.st->count:1
>>> 8,16   10 1.091416754 0  m   N cfq2519 dispatch_insert
>>> 8,16   10 1.091417186 0  m   N cfq2519 dispatched a request
>>> 8,16   10 1.091417535 0  m   N cfq2519 activate rq,drv=4
>>> 8,16   1 3604 1.091417628  2519  D   W 144322432 + 56 [md127_raid5]
>>> 8,16   1 3605 1.091857225  4393  C   W 144322544 + 16 [0]
>>> 8,16   10 1.091858753 0  m   N cfq2519 complete rqnoidle 0
>>> 8,16   1 3606 1.092068456  4393  C   W 144322520 + 24 [0]
>>> 8,16   10 1.092069851 0  m   N cfq2519 complete rqnoidle 0
>>> 8,16   1 3607 1.092350440  4393  C   W 144322488 + 32 [0]
>>> 8,16   10 1.092351688 0  m   N cfq2519 complete rqnoidle 0
>>> 8,16   1 3608 1.093629323 0  C   W 144322432 + 56 [0]
>>> 8,16   10 1.093631151 0  m   N cfq2519 complete rqnoidle 0
>>> 8,16   10 1.093631574 0  m   N cfq2519 will busy wait
>>> 8,16   10 1.093631829 0  m   N cfq schedule dispatch
>>>
>>> Because in func "elv_attempt_insert_merge", it only to try to
>>> backmerge.So the four request can't merge in theory.
>>> I trace ten minutes and count those situation, it can count 25%.
>>>
>>> With the patch,i tested and not found situation like above.
>>>
>>> Signed-off-by: Jianpeng Ma 
>>> ---
>>>  

RE: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-15 Thread Jaegeuk Kim
> On Monday 15 October 2012, Changman Lee wrote:
> > 2012년 10월 15일 월요일에 Arnd Bergmann님이 작성:
> > > It is only a performance hint though, so it is not a correctness issue the
> > > file system gets it wrong. In order to do efficient garbage collection, a 
> > > log
> > > structured file system should take all the information it can get about 
> > > the
> > > expected life of data it writes. I agree that the list, even in the form 
> > > of
> > > mkfs time settings, is not a clean abstraction, but in the place of an 
> > > Android
> > > phone manufacturer I would still enable it if it promises a significant
> > > performance advantage over not using it. I guess it would be nice if this
> > > could be overridden in some form, e.g. using an ioctl on the file as ext4 
> > > does.
> > >
> > Right. This is related with HOT/COLD separation policy of f2fs. If we know
> > that data is COLD, we can manage gc effectively.
> > I think that ext lists are placed in sb is better like your advice because
> > it's difficult to fix user app. Although it's nasty way.
> 
> Ok. I think you should adapt the terminology though. Right now, the 
> optimization
> is to mark the data as COLD because we expect it to be written less often than
> other kinds of data. However, the hot/cold terms are usually only applied to
> data that we assume is going to be written soon or not based on how often
> the same data has been accessed in the past.
> 
> Anything you detect from the file name is not really a hint on hot/cold
> files, but rather on the expected access pattern: These files are going
> to be written once, and will be read-only after that, they are probably
> multiple megabytes in size, and if you have a lot of them, they are likely
> to live for the same time.
> 
> It may well be possible that we later decide to use the hint in a different
> way, e.g. to put these files into yet another separate log, aside from
> other hot or cold files.
> 
> > > We should also take the kinds of access we have seen on a file into 
> > > account.
> > > E.g. if someone opens a file O_RDWR and performs seek or pwrite on it, we 
> > > can
> > > assume that it's not in the category of typical media files, and a file 
> > > that
> > > gets written to disk linearly in multiple megabytes might belong into the
> > > category even if it is named otherwise.
> > >
> > This is more general but it's hard to adapt now.
> 
> I think it's important to leave the option open for a future optimization.
> Right now, what we have to get agreement on is the on-disk format, because
> we absolutely don't want to make incompatible changes to that once f2fs
> has been merged into the kernel and is getting used on real systems.
> 
> This is independent of how the code is implemented at the moment, and
> any tuning regarding how to group different kinds of data into the six
> logs is completely up to how things work out in practice. But you should
> definitely ensure that those changes don't require changing the format
> if we decide to use a different number of logs in the future, or to
> use the logs differently.
> 
> The split between logs for nodes on the one hand and data on the other
> is something that can well be hardcoded, and it's ok to have a hard
> upper bound on the number of logs in the file system, possibly higher
> than 6.
> 

Thank you for a lot of points to be addressed. :)
Maybe it's time to summarize them.
Please let me know what I misunderstood.

[In v2]
- Extension list
  : Mkfs supports configuring extensions by user, and that information
will be stored in the superblock. In order to reduce the cleaning overhead,
f2fs supports an additional interface, ioctl, likewise ext4.

- The number of active logs
  : No change will be done in on-disk layout (i.e., max 6 logs).
Instead, f2fs supports changing the number with a mount option.
Currently, I think 4, 5, and 6 would be enough.

- Section size
  : Mkfs supports multiples of segments for a section, not power-of-two.

[Future optimization]
- Data separation
  : file access pattern, and else?

>   Arnd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the cortex tree with Linus' tree

2012-10-15 Thread Stephen Rothwell
Hi Uwe,

Today's linux-next merge of the cortex tree got a conflict in
arch/arm/kernel/process.c between commit 9e14f828ee4a ("arm: split
ret_from_fork, simplify kernel_thread() [based on patch by rmk]") from
Linus' tree and commit 2f3e7d3436cb ("Cortex-M3: Add support for
exception handling") from the cortex tree.

I have no idea how to fix this up, so I have just dropped this tree for
today.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpdeNj3PGZ6Q.pgp
Description: PGP signature


linux-next: manual merge of the cortex tree with Linus' tree

2012-10-15 Thread Stephen Rothwell
Hi Uwe,

Today's linux-next merge of the cortex tree got a conflict in
arch/arm/include/asm/ptrace.h between commit cb8db5d4578a ("UAPI:
(Scripted) Disintegrate arch/arm/include/asm") from Linus' tree and
commit 69bc3631744a ("Cortex-M3: Add base support for Cortex-M3") from
the cortex tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

I also had to add this merge fix patch:

diff --git a/arch/arm/include/uapi/asm/ptrace.h 
b/arch/arm/include/uapi/asm/ptrace.h
index 96ee092..b71a3f8 100644
--- a/arch/arm/include/uapi/asm/ptrace.h
+++ b/arch/arm/include/uapi/asm/ptrace.h
@@ -49,13 +49,15 @@
 #define SYSTEM_MODE0x001f
 #define MODE32_BIT 0x0010
 #define MODE_MASK  0x001f
-#define PSR_T_BIT  0x0020
-#define PSR_F_BIT  0x0040
-#define PSR_I_BIT  0x0080
-#define PSR_A_BIT  0x0100
-#define PSR_E_BIT  0x0200
-#define PSR_J_BIT  0x0100
-#define PSR_Q_BIT  0x0800
+#define V4_PSR_T_BIT   0x0020  /* >= V4T, but not V7M */
+#define V7M_PSR_T_BIT  0x0100
+#define PSR_T_BIT  V4_PSR_T_BIT
+#define PSR_F_BIT  0x0040  /* >= V4, but not V7M */
+#define PSR_I_BIT  0x0080  /* >= V4, but not V7M */
+#define PSR_A_BIT  0x0100  /* >= V6, but not V7M */
+#define PSR_E_BIT  0x0200  /* >= V6, but not V7M */
+#define PSR_J_BIT  0x0100  /* >= V5J, but not V7M */
+#define PSR_Q_BIT  0x0800  /* >= V5E, including V7M */
 #define PSR_V_BIT  0x1000
 #define PSR_C_BIT  0x2000
 #define PSR_Z_BIT  0x4000
@@ -125,6 +127,7 @@ struct pt_regs {
 #define ARM_r1 uregs[1]
 #define ARM_r0 uregs[0]
 #define ARM_ORIG_r0uregs[17]
+#define ARM_EXC_RETuregs[18]
 
 /*
  * The size of the user-visible VFP state as seen by PTRACE_GET/SETVFPREGS

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/arm/include/asm/ptrace.h
index 3d52ee1,090fea7..000
--- a/arch/arm/include/asm/ptrace.h
+++ b/arch/arm/include/asm/ptrace.h
@@@ -10,12 -10,156 +10,29 @@@
  #ifndef __ASM_ARM_PTRACE_H
  #define __ASM_ARM_PTRACE_H
  
 -#include 
 +#include 
  
 -#define PTRACE_GETREGS12
 -#define PTRACE_SETREGS13
 -#define PTRACE_GETFPREGS  14
 -#define PTRACE_SETFPREGS  15
 -/* PTRACE_ATTACH is 16 */
 -/* PTRACE_DETACH is 17 */
 -#define PTRACE_GETWMMXREGS18
 -#define PTRACE_SETWMMXREGS19
 -/* 20 is unused */
 -#define PTRACE_OLDSETOPTIONS  21
 -#define PTRACE_GET_THREAD_AREA22
 -#define PTRACE_SET_SYSCALL23
 -/* PTRACE_SYSCALL is 24 */
 -#define PTRACE_GETCRUNCHREGS  25
 -#define PTRACE_SETCRUNCHREGS  26
 -#define PTRACE_GETVFPREGS 27
 -#define PTRACE_SETVFPREGS 28
 -#define PTRACE_GETHBPREGS 29
 -#define PTRACE_SETHBPREGS 30
 -
 -/*
 - * PSR bits
 - * Note on V7M there is no mode contained in the PSR
 - */
 -#define USR26_MODE0x
 -#define FIQ26_MODE0x0001
 -#define IRQ26_MODE0x0002
 -#define SVC26_MODE0x0003
+ #if defined(__KERNEL__) && defined(CONFIG_CPU_V7M)
+ /*
+  * Use 0 here to get code right that creates a userspace
+  * or kernel space thread
+  */
++#undef USR_MODE
++#undef SVC_MODE
++#undef PSR_T_BIT
+ #define USR_MODE  0x
+ #define SVC_MODE  0x
 -#else
 -#define USR_MODE  0x0010
 -#define SVC_MODE  0x0013
 -#endif
 -#define FIQ_MODE  0x0011
 -#define IRQ_MODE  0x0012
 -#define ABT_MODE  0x0017
 -#define UND_MODE  0x001b
 -#define SYSTEM_MODE   0x001f
 -#define MODE32_BIT0x0010
 -#define MODE_MASK 0x001f
 -
 -#define V4_PSR_T_BIT  0x0020  /* >= V4T, but not V7M */
 -#define V7M_PSR_T_BIT 0x0100
 -#if defined(__KERNEL__) && defined(CONFIG_CPU_V7M)
+ #define PSR_T_BIT V7M_PSR_T_BIT
 -#else
 -/* for compatibility */
 -#define PSR_T_BIT V4_PSR_T_BIT
 -#endif
 -
 -#define PSR_F_BIT 0x0040  /* >= V4, but not V7M */
 -#define PSR_I_BIT 0x0080  /* >= V4, but not V7M */
 -#define PSR_A_BIT 0x0100  /* >= V6, but not V7M */
 -#define PSR_E_BIT 0x0200  /* >= V6, but not V7M */
 -#define PSR_J_BIT 0x0100  /* >= V5J, but not V7M */
 -#define PSR_Q_BIT 0x0800  /* >= V5E, including V7M */
 -#define PSR_V_BIT 0x1000
 -#define PSR_C_BIT 0x2000
 -#define PSR_Z_BIT 0x4000
 -#define PSR_N_BIT 0x8000
 -
 -/*
 - * Groups of PSR bits
 - */
 -#define PSR_f 0xff00  /* Flags*/
 -#define PSR_s 0x00ff  /* Status   */
 -#define PSR_x 0xff00  /* Extension*/
 -#define PSR_c 0x00ff  /* Control  */
 -
 -/*
 - * ARMv7 groups of PSR bits
 - */
 -#define APSR_MASK 0xf80f  /* N, Z, C, V, Q and GE flags */
 -#define PSR_ISET_MASK 0x0110  /* ISA state (J, T) mask 

Re: [PATCH] [media] stk1160: Check return value of stk1160_read_reg() in stk1160_i2c_read_reg()

2012-10-15 Thread Ezequiel Garcia
On Mon, Oct 15, 2012 at 9:03 PM, Jesper Juhl  wrote:
> On Mon, 15 Oct 2012, Ezequiel Garcia wrote:
>
>> On Mon, Oct 15, 2012 at 7:52 PM, Jesper Juhl  wrote:
>> > On Mon, 15 Oct 2012, Jesper Juhl wrote:
>> >
>> >> On Sat, 13 Oct 2012, Ezequiel Garcia wrote:
>> >>
> [...]
>> > Currently there are two checks for 'rc' being less than zero with no
>> > change to 'rc' between the two, so the second is just dead code.
>> > The intention seems to have been to assign the return value of
>> > 'stk1160_read_reg()' to 'rc' before the (currently dead) second check
>> > and then test /that/. This patch does that.
>> >
>>
>> This is an overly complicated explanation for such a small patch.
>> Can you try to simplify it?
>>
> How's this?
>
>
> From: Jesper Juhl 
> Date: Sat, 13 Oct 2012 00:16:37 +0200
> Subject: [PATCH] [media] stk1160: Check return value of stk1160_read_reg() in 
> stk1160_i2c_read_reg()
>
> Remember to collect the exit status from 'stk1160_read_reg()' in 'rc'
> before testing it for less than zero.
>
> Signed-off-by: Jesper Juhl 
> ---
>  drivers/media/usb/stk1160/stk1160-i2c.c |3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/usb/stk1160/stk1160-i2c.c 
> b/drivers/media/usb/stk1160/stk1160-i2c.c
> index 176ac93..a2370e4 100644
> --- a/drivers/media/usb/stk1160/stk1160-i2c.c
> +++ b/drivers/media/usb/stk1160/stk1160-i2c.c
> @@ -116,10 +116,9 @@ static int stk1160_i2c_read_reg(struct stk1160 *dev, u8 
> addr,
> if (rc < 0)
> return rc;
>
> -   stk1160_read_reg(dev, STK1160_SBUSR_RD, value);
> +   rc = stk1160_read_reg(dev, STK1160_SBUSR_RD, value);
> if (rc < 0)
> return rc;
> -

Sorry for the nitpick, but I'd like you to *not* remove this line.

Thanks

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tools/include: use stdint types for user-space byteshift headers

2012-10-15 Thread Yaakov (Cygwin/X)
From: Yaakov Selkowitz 

Commit a07f7672d7cf0ff0d6e548a9feb6e0bd016d9c6c added user-space copies
of the byteshift headers to be used by hostprogs, changing e.g. u8 to __u8.
However, in order to cross-compile the kernel from a non-Linux system,
stdint.h types need to be used instead of linux/types.h types.

Signed-off-by: Yaakov Selkowitz 
---
Ping 2; still hasn't been merged and no on-list response to original 
posted of 11 June.

Also applies to linux-3.[3456].y

 tools/include/tools/be_byteshift.h |   34 +-
 tools/include/tools/le_byteshift.h |   34 +-
 2 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/tools/include/tools/be_byteshift.h 
b/tools/include/tools/be_byteshift.h
index f4912e2..84c17d8 100644
--- a/tools/include/tools/be_byteshift.h
+++ b/tools/include/tools/be_byteshift.h
@@ -1,68 +1,68 @@
 #ifndef _TOOLS_BE_BYTESHIFT_H
 #define _TOOLS_BE_BYTESHIFT_H
 
-#include 
+#include 
 
-static inline __u16 __get_unaligned_be16(const __u8 *p)
+static inline uint16_t __get_unaligned_be16(const uint8_t *p)
 {
return p[0] << 8 | p[1];
 }
 
-static inline __u32 __get_unaligned_be32(const __u8 *p)
+static inline uint32_t __get_unaligned_be32(const uint8_t *p)
 {
return p[0] << 24 | p[1] << 16 | p[2] << 8 | p[3];
 }
 
-static inline __u64 __get_unaligned_be64(const __u8 *p)
+static inline uint64_t __get_unaligned_be64(const uint8_t *p)
 {
-   return (__u64)__get_unaligned_be32(p) << 32 |
+   return (uint64_t)__get_unaligned_be32(p) << 32 |
   __get_unaligned_be32(p + 4);
 }
 
-static inline void __put_unaligned_be16(__u16 val, __u8 *p)
+static inline void __put_unaligned_be16(uint16_t val, uint8_t *p)
 {
*p++ = val >> 8;
*p++ = val;
 }
 
-static inline void __put_unaligned_be32(__u32 val, __u8 *p)
+static inline void __put_unaligned_be32(uint32_t val, uint8_t *p)
 {
__put_unaligned_be16(val >> 16, p);
__put_unaligned_be16(val, p + 2);
 }
 
-static inline void __put_unaligned_be64(__u64 val, __u8 *p)
+static inline void __put_unaligned_be64(uint64_t val, uint8_t *p)
 {
__put_unaligned_be32(val >> 32, p);
__put_unaligned_be32(val, p + 4);
 }
 
-static inline __u16 get_unaligned_be16(const void *p)
+static inline uint16_t get_unaligned_be16(const void *p)
 {
-   return __get_unaligned_be16((const __u8 *)p);
+   return __get_unaligned_be16((const uint8_t *)p);
 }
 
-static inline __u32 get_unaligned_be32(const void *p)
+static inline uint32_t get_unaligned_be32(const void *p)
 {
-   return __get_unaligned_be32((const __u8 *)p);
+   return __get_unaligned_be32((const uint8_t *)p);
 }
 
-static inline __u64 get_unaligned_be64(const void *p)
+static inline uint64_t get_unaligned_be64(const void *p)
 {
-   return __get_unaligned_be64((const __u8 *)p);
+   return __get_unaligned_be64((const uint8_t *)p);
 }
 
-static inline void put_unaligned_be16(__u16 val, void *p)
+static inline void put_unaligned_be16(uint16_t val, void *p)
 {
__put_unaligned_be16(val, p);
 }
 
-static inline void put_unaligned_be32(__u32 val, void *p)
+static inline void put_unaligned_be32(uint32_t val, void *p)
 {
__put_unaligned_be32(val, p);
 }
 
-static inline void put_unaligned_be64(__u64 val, void *p)
+static inline void put_unaligned_be64(uint64_t val, void *p)
 {
__put_unaligned_be64(val, p);
 }
diff --git a/tools/include/tools/le_byteshift.h 
b/tools/include/tools/le_byteshift.h
index c99d45a..8fe9f24 100644
--- a/tools/include/tools/le_byteshift.h
+++ b/tools/include/tools/le_byteshift.h
@@ -1,68 +1,68 @@
 #ifndef _TOOLS_LE_BYTESHIFT_H
 #define _TOOLS_LE_BYTESHIFT_H
 
-#include 
+#include 
 
-static inline __u16 __get_unaligned_le16(const __u8 *p)
+static inline uint16_t __get_unaligned_le16(const uint8_t *p)
 {
return p[0] | p[1] << 8;
 }
 
-static inline __u32 __get_unaligned_le32(const __u8 *p)
+static inline uint32_t __get_unaligned_le32(const uint8_t *p)
 {
return p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24;
 }
 
-static inline __u64 __get_unaligned_le64(const __u8 *p)
+static inline uint64_t __get_unaligned_le64(const uint8_t *p)
 {
-   return (__u64)__get_unaligned_le32(p + 4) << 32 |
+   return (uint64_t)__get_unaligned_le32(p + 4) << 32 |
   __get_unaligned_le32(p);
 }
 
-static inline void __put_unaligned_le16(__u16 val, __u8 *p)
+static inline void __put_unaligned_le16(uint16_t val, uint8_t *p)
 {
*p++ = val;
*p++ = val >> 8;
 }
 
-static inline void __put_unaligned_le32(__u32 val, __u8 *p)
+static inline void __put_unaligned_le32(uint32_t val, uint8_t *p)
 {
__put_unaligned_le16(val >> 16, p + 2);
__put_unaligned_le16(val, p);
 }
 
-static inline void __put_unaligned_le64(__u64 val, __u8 *p)
+static inline void __put_unaligned_le64(uint64_t val, uint8_t *p)
 {
__put_unaligned_le32(val >> 32, p + 4);
__put_unaligned_le32(val, p);
 }
 

RE: [PATCH 11/16] f2fs: add inode operations for special inodes

2012-10-15 Thread Jaegeuk Kim
> On Sun, Oct 14, 2012 at 03:19:37PM +, Arnd Bergmann wrote:
> > On Sunday 14 October 2012, Vyacheslav Dubeyko wrote:
> > > On Oct 14, 2012, at 11:09 AM, Jaegeuk Kim wrote:
> > > > 2012-10-14 (일), 02:21 +0400, Vyacheslav Dubeyko:
> > > Extended attributes are more flexible way, from my point of view. The 
> > > xattr gives
> > > possibility to make hint to filesystem at any time and without any 
> > > dependencies with
> > > application's functional opportunities. Documented way of using such 
> > > extended attributes
> > > gives to user flexible way of manipulation of filesystem behavior (but I 
> > > remember that
> > > you don't believe in an user :-)).
> > >
> > > So, I think that fadvise() and extended attributes can be complementary 
> > > solutions.
> >
> > Right. Another option is to have ext4 style attributes, see
> > http://linux.die.net/man/1/chattr
> 
> Xattrs are much prefered to more "ext4 style" flags because xattrs
> are filesystem independent. Indeed, some filesystems can't store any
> new "ext4 style" flags without a change of disk format or
> internally mapping them to an xattr. So really, xattrs are the best
> way forward for such hints.
> 
> > Unlike extended attributes, there is a limited number of those,
> > and they can only be boolean flags, but that might be enough for
> > this particular use case.
> 
> A boolean is not sufficient for access policy hints. An extensible
> xattr format is probably the best approach to take here, so that we
> can easily introduce new access policy hints as functionality is
> required. Indeed, an extensible xattr could start with just a
> hot/cold boolean, and grow from there
> 
> > The main reason I can see against extended attributes is that they are not 
> > stored
> > very efficiently in f2fs, unless a lot of work is put into coming up with a 
> > good
> > implementation. A single flags bit can trivially be added to the inode in
> > comparison (if it's not there already).
> 
> That's a deficiency that should be corrected, then, because xattrs
> are very common these days.

IMO, most file systems including f2fs have some inefficiency to store
and retrieve xattrs, since they have to allocate an additional block.
The only distinct problem in f2fs is that there is a cleaning overhead.
So, that's the why xattr is not an efficient way in f2fs.

OTOH, I think xattr itself is for users, not for communicating
between file system and users.
Moreover, I'm not sure in the current android, but I saw ICS android
did not call any xattr operations, even if mount option was enabled.

> 
> And given that stuff like access frequency tracking is being
> implemented at the VFS level, access policy hints should also be VFS
> functionality. A bad filesystem implementation should not dictate
> the interface for generically useful functionality
> 
> > > Anyway, hardcoding or saving in filesystem list of file extensions is a 
> > > nasty way. It
> > > can be not safe or hardly understandable by users the way of 
> > > reconfiguration filesystem
> > > by means of tunefs or debugfs with the purpose of file extensions 
> > > addition in such
> > > "black-box" as TV or smartphones, from my point of view.
> >
> > It is only a performance hint though, so it is not a correctness issue the
> > file system gets it wrong. In order to do efficient garbage collection, a 
> > log
> > structured file system should take all the information it can get about the
> > expected life of data it writes. I agree that the list, even in the form of
> > mkfs time settings, is not a clean abstraction, but in the place of an 
> > Android
> > phone manufacturer I would still enable it if it promises a significant
> > performance advantage over not using it. I guess it would be nice if this
> > could be overridden in some form, e.g. using an ioctl on the file as ext4 
> > does.
> 
> An xattr on the root inode that holds a list like this is something
> that could be set at mkfs time, but then also updated easily by new
> software packages that are installed...
> 
> > We should also take the kinds of access we have seen on a file into account.
> 
> Yes, but it should be done at the VFS level, not in the filesystem
> itself. Integrated into the current hot inode/range tracking that is
> being worked on right now, I'd suggest.
> 
> IOWs, these access policy issues are not unique to F2FS or it's use
> case. Anything to do with access hints, policy, tracking, file
> classification, etc that can influence data locality, reclaim,
> migration, etc need to be dealt with at the VFS, independently of a
> specific filesystem. Filesystems can make use of that information
> how they please (whether in the kernel or via userspace tools), but
> having filesystem specific interfaces and implementations of the
> same functionality is extremely wasteful. Let's do it once, and do
> it right the first time. ;)

I agree that VFS should support something, but before then, it needs
to do something by the file 

Re: [RFC PATCH 1/3] mm: teach mm by current context info to not do I/O during memory allocation

2012-10-15 Thread Ming Lei
On Mon, Oct 15, 2012 at 11:47 PM, Minchan Kim  wrote:
> On Mon, Oct 15, 2012 at 01:14:17PM +0800, Ming Lei wrote:
>> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
>> 'struct task_struct'), so that the flag can be set by one task
>> to avoid doing I/O inside memory allocation in the task's context.
>>
>> The patch trys to solve one deadlock problem caused by block device,
>> and the problem can be occured at least in the below situations:
>>
>> - during block device runtime resume situation, if memory allocation
>> with GFP_KERNEL is called inside runtime resume callback of any one
>> of its ancestors(or the block device itself), the deadlock may be
>> triggered inside the memory allocation since it might not complete
>> until the block device becomes active and the involed page I/O finishes.
>> The situation is pointed out first by Alan Stern. It is not a good
>> approach to convert all GFP_KERNEL in the path into GFP_NOIO because
>> several subsystems may be involved(for example, PCI, USB and SCSI may
>> be involved for usb mass stoarage device)
>
> Couldn't we expand pm_restrict_gfp_mask to cover resume path as well as
> suspend path?

IMO, we could, but it is not good and might trigger memory allocation problem.

pm_restrict_gfp_mask uses the global variable of gfp_allowed_mask to
avoid allocating page with GFP_IOFS in all contexts during system sleep,
when processes have been frozen.

But during runtime PM, the whole system is running and all processes are
runnable. Also runtime PM is per device and the whole system may have
lots of devices, so taking the global gfp_allowed_mask may keep page
allocation with ~GFP_IOFS for a considerable proportion of system
running time, then alloc_page() will return failure easier.

The above deadlock problem may be fixed by allocating memory with
~GFP_IOFS only in the context of calling runtime_resume, and that is
idea of the patch.

>
>>
>> - during error handling situation of usb mass storage deivce, USB
>> bus reset will be put on the device, so there shouldn't have any
>> memory allocation with GFP_KERNEL during USB bus reset, otherwise
>> the deadlock similar with above may be triggered. Unfortunately, any
>> usb device may include one mass storage interface in theory, so it
>> requires all usb interface drivers to handle the situation. In fact,
>> most usb drivers don't know how to handle bus reset on the device
>> and don't provide .pre_set() and .post_reset() callback at all, so
>> USB core has to unbind and bind driver for these devices. So it
>> is still not practical to resort to GFP_NOIO for solving the problem.
>
> I hope this case could be handled by usb core like usb_restrict_gfp_mask
> rather than adding new branch on fast path.

See above, applying the global gfp_allowed_mask is not good.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/9] aoe: update driver-internal version number to 60

2012-10-15 Thread Ed Cashin
Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoe.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index 8e8da1c..536942b 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -1,5 +1,5 @@
 /* Copyright (c) 2012 Coraid, Inc.  See COPYING for GPL terms. */
-#define VERSION "50"
+#define VERSION "60"
 #define AOE_MAJOR 152
 #define DEVICE_NAME "aoe"
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 05/24] block: Use bio_sectors() more consistently

2012-10-15 Thread Ed Cashin
The aoe changes look OK, thanks.

-- 
  Ed Cashin
  ecas...@coraid.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/9] aoe: whitespace cleanup

2012-10-15 Thread Ed Cashin
Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoe.h |2 +-
 drivers/block/aoe/aoechr.c  |2 +-
 drivers/block/aoe/aoecmd.c  |6 +++---
 drivers/block/aoe/aoemain.c |2 +-
 drivers/block/aoe/aoenet.c  |4 ++--
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index 52f75c0..8e8da1c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -151,7 +151,7 @@ struct aoedev {
struct work_struct work;/* disk create work struct */
struct gendisk *gd;
struct request_queue *blkq;
-   struct hd_geometry geo; 
+   struct hd_geometry geo;
sector_t ssize;
struct timer_list timer;
spinlock_t lock;
diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c
index 2bf6273..42e67ad 100644
--- a/drivers/block/aoe/aoechr.c
+++ b/drivers/block/aoe/aoechr.c
@@ -287,7 +287,7 @@ aoechr_init(void)
int n, i;
 
n = register_chrdev(AOE_MAJOR, "aoechr", _fops);
-   if (n < 0) { 
+   if (n < 0) {
printk(KERN_ERR "aoe: can't register char device\n");
return n;
}
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 82e16c4..c491fba 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -978,7 +978,7 @@ ktiocomplete(struct frame *f)
pr_err("aoe: ata error cmd=%2.2Xh stat=%2.2Xh from e%ld.%d\n",
ahout->cmdstat, ahin->cmdstat,
d->aoemajor, d->aoeminor);
-noskb: if (buf)
+noskb: if (buf)
clear_bit(BIO_UPTODATE, >bio->bi_flags);
goto badrsp;
}
@@ -1191,7 +1191,7 @@ aoecmd_cfg(ushort aoemajor, unsigned char aoeminor)
aoecmd_cfg_pkts(aoemajor, aoeminor, );
aoenet_xmit();
 }
- 
+
 struct sk_buff *
 aoecmd_ata_id(struct aoedev *d)
 {
@@ -1230,7 +1230,7 @@ aoecmd_ata_id(struct aoedev *d)
 
return skb_clone(skb, GFP_ATOMIC);
 }
- 
+
 static struct aoetgt *
 addtgt(struct aoedev *d, char *addr, ulong nframes)
 {
diff --git a/drivers/block/aoe/aoemain.c b/drivers/block/aoe/aoemain.c
index 04793c2..4b987c2 100644
--- a/drivers/block/aoe/aoemain.c
+++ b/drivers/block/aoe/aoemain.c
@@ -105,7 +105,7 @@ aoe_init(void)
aoechr_exit();
  chr_fail:
aoedev_exit();
-   
+
printk(KERN_INFO "aoe: initialisation failure.\n");
return ret;
 }
diff --git a/drivers/block/aoe/aoenet.c b/drivers/block/aoe/aoenet.c
index a1bb692..461b6c4 100644
--- a/drivers/block/aoe/aoenet.c
+++ b/drivers/block/aoe/aoenet.c
@@ -126,8 +126,8 @@ aoenet_xmit(struct sk_buff_head *queue)
}
 }
 
-/* 
- * (1) len doesn't include the header by default.  I want this. 
+/*
+ * (1) len doesn't include the header by default.  I want this.
  */
 static int
 aoenet_rcv(struct sk_buff *skb, struct net_device *ifp, struct packet_type 
*pt, struct net_device *orig_dev)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/9] aoe: cleanup: remove unused ata_scnt function

2012-10-15 Thread Ed Cashin
Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoecmd.c |   10 --
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 2bb8c7d..82e16c4 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -552,16 +552,6 @@ sthtith(struct aoedev *d)
return 1;
 }
 
-static inline unsigned char
-ata_scnt(unsigned char *packet) {
-   struct aoe_hdr *h;
-   struct aoe_atahdr *ah;
-
-   h = (struct aoe_hdr *) packet;
-   ah = (struct aoe_atahdr *) (h+1);
-   return ah->scnt;
-}
-
 static void
 rexmit_timer(ulong vp)
 {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/9] aoe: "payload" sysfs file exports per-AoE-command data transfer size

2012-10-15 Thread Ed Cashin
The userland aoetools package includes an "aoe-stat" command that
can display a "payload size" column when the aoe driver exports
this information.  Users can quickly see what amount of user data
is transferred inside each AoE command on the network, network
headers excluded.

Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoeblk.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index d5aa3b8..56736cd 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -98,6 +98,14 @@ static ssize_t aoedisk_show_fwver(struct device *dev,
 
return snprintf(page, PAGE_SIZE, "0x%04x\n", (unsigned int) d->fw_ver);
 }
+static ssize_t aoedisk_show_payload(struct device *dev,
+   struct device_attribute *attr, char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+   struct aoedev *d = disk->private_data;
+
+   return snprintf(page, PAGE_SIZE, "%lu\n", d->maxbcnt);
+}
 
 static DEVICE_ATTR(state, S_IRUGO, aoedisk_show_state, NULL);
 static DEVICE_ATTR(mac, S_IRUGO, aoedisk_show_mac, NULL);
@@ -106,12 +114,14 @@ static struct device_attribute dev_attr_firmware_version 
= {
.attr = { .name = "firmware-version", .mode = S_IRUGO },
.show = aoedisk_show_fwver,
 };
+static DEVICE_ATTR(payload, S_IRUGO, aoedisk_show_payload, NULL);
 
 static struct attribute *aoe_attrs[] = {
_attr_state.attr,
_attr_mac.attr,
_attr_netif.attr,
_attr_firmware_version.attr,
+   _attr_payload.attr,
NULL,
 };
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/9] aoe: support larger I/O requests via aoe_maxsectors module param

2012-10-15 Thread Ed Cashin
The GPFS filesystem is an example of an aoe user that requires the
aoe driver to support I/O request sizes larger than the default.
Most users will not need large I/O request sizes, because they would
need to be split up into multiple AoE commands anyway.

Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoeblk.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 00dfc50..d5aa3b8 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -16,11 +16,18 @@
 #include 
 #include 
 #include 
+#include 
 #include "aoe.h"
 
 static DEFINE_MUTEX(aoeblk_mutex);
 static struct kmem_cache *buf_pool_cache;
 
+/* GPFS needs a larger value than the default. */
+static int aoe_maxsectors;
+module_param(aoe_maxsectors, int, 0644);
+MODULE_PARM_DESC(aoe_maxsectors,
+   "When nonzero, set the maximum number of sectors per I/O request");
+
 static ssize_t aoedisk_show_state(struct device *dev,
  struct device_attribute *attr, char *page)
 {
@@ -248,6 +255,8 @@ aoeblk_gdalloc(void *vp)
d->blkq = gd->queue = q;
q->queuedata = d;
d->gd = gd;
+   if (aoe_maxsectors)
+   blk_queue_max_hw_sectors(q, aoe_maxsectors);
gd->major = AOE_MAJOR;
gd->first_minor = d->sysminor;
gd->fops = _bdops;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/9] aoe: support the forgetting (flushing) of a user-specified AoE target

2012-10-15 Thread Ed Cashin
Users sometimes want to cause the aoe driver to forget a
particular previously discovered device when it is no longer
online.  The aoetools provide an "aoe-flush" command that users
run to perform this administrative task.  The changes below
provide the support needed in the driver.

Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoedev.c |   44 ++--
 1 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 90e5b53..63b2660 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -241,6 +241,30 @@ aoedev_freedev(struct aoedev *d)
kfree(d);
 }
 
+/* return whether the user asked for this particular
+ * device to be flushed
+ */
+static int
+user_req(char *s, size_t slen, struct aoedev *d)
+{
+   char *p;
+   size_t lim;
+
+   if (!d->gd)
+   return 0;
+   p = strrchr(d->gd->disk_name, '/');
+   if (!p)
+   p = d->gd->disk_name;
+   else
+   p += 1;
+   lim = sizeof(d->gd->disk_name);
+   lim -= p - d->gd->disk_name;
+   if (slen < lim)
+   lim = slen;
+
+   return !strncmp(s, p, lim);
+}
+
 int
 aoedev_flush(const char __user *str, size_t cnt)
 {
@@ -249,6 +273,7 @@ aoedev_flush(const char __user *str, size_t cnt)
struct aoedev *rmd = NULL;
char buf[16];
int all = 0;
+   int specified = 0;  /* flush a specific device */
 
if (cnt >= 3) {
if (cnt > sizeof buf)
@@ -256,26 +281,33 @@ aoedev_flush(const char __user *str, size_t cnt)
if (copy_from_user(buf, str, cnt))
return -EFAULT;
all = !strncmp(buf, "all", 3);
+   if (!all)
+   specified = 1;
}
 
spin_lock_irqsave(_lock, flags);
dd = 
while ((d = *dd)) {
spin_lock(>lock);
-   if ((!all && (d->flags & DEVFL_UP))
+   if (specified) {
+   if (!user_req(buf, cnt, d))
+   goto skip;
+   } else if ((!all && (d->flags & DEVFL_UP))
|| (d->flags & (DEVFL_GDALLOC|DEVFL_NEWSIZE))
|| d->nopen
-   || d->ref) {
-   spin_unlock(>lock);
-   dd = >next;
-   continue;
-   }
+   || d->ref)
+   goto skip;
+
*dd = d->next;
aoedev_downdev(d);
d->flags |= DEVFL_TKILL;
spin_unlock(>lock);
d->next = rmd;
rmd = d;
+   continue;
+skip:
+   spin_unlock(>lock);
+   dd = >next;
}
spin_unlock_irqrestore(_lock, flags);
while ((d = rmd)) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/9] aoe: update cap on outstanding commands based on config query response

2012-10-15 Thread Ed Cashin
The ATA over Ethernet config query response contains a "buffer count"
field reflecting the AoE target's capacity to buffer incoming AoE
commands.

By taking the current value of this field into accound, we increase
performance throughput or avoid network congestion, when the value
has increased or decreased, respectively.

Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoe.h|6 +++---
 drivers/block/aoe/aoecmd.c |6 +-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index d2ed7f1..52f75c0 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -122,14 +122,14 @@ struct aoeif {
 
 struct aoetgt {
unsigned char addr[6];
-   ushort nframes;
+   ushort nframes; /* cap on frames to use */
struct aoedev *d;   /* parent device I belong to */
struct list_head ffree; /* list of free frames */
struct aoeif ifs[NAOEIFS];
struct aoeif *ifp;  /* current aoeif in use */
ushort nout;
-   ushort maxout;
-   ulong falloc;
+   ushort maxout;  /* current value for max outstanding */
+   ulong falloc;   /* number of allocated frames */
ulong lastwadj; /* last window adjustment */
int minbcnt;
int wpkts, rpkts;
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 3804a0a..2bb8c7d 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -1373,7 +1373,11 @@ aoecmd_cfg_rsp(struct sk_buff *skb)
spin_lock_irqsave(>lock, flags);
 
t = gettgt(d, h->src);
-   if (!t) {
+   if (t) {
+   t->nframes = n;
+   if (n < t->maxout)
+   t->maxout = n;
+   } else {
t = addtgt(d, h->src, n);
if (!t)
goto bail;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/9] aoe: print warning regarding a common reason for dropped transmits

2012-10-15 Thread Ed Cashin
Dropped transmits are not common, but when they do occur, increasing
the transmit queue length often helps.

Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoenet.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/block/aoe/aoenet.c b/drivers/block/aoe/aoenet.c
index 162c647..a1bb692 100644
--- a/drivers/block/aoe/aoenet.c
+++ b/drivers/block/aoe/aoenet.c
@@ -50,7 +50,11 @@ __setup("aoe_iflist=", aoe_iflist_setup);
 static spinlock_t txlock;
 static struct sk_buff_head skbtxq;
 
-/* enters with txlock held */
+/* enters with txlock held
+ *
+ * Use __must_hold() for sparse when upcoming patch adds it to
+ * compiler.h.
+ */
 static int
 tx(void)
 {
@@ -58,7 +62,10 @@ tx(void)
 
while ((skb = skb_dequeue())) {
spin_unlock_irq();
-   dev_queue_xmit(skb);
+   if (dev_queue_xmit(skb) == NET_XMIT_DROP && net_ratelimit())
+   pr_warn("aoe: packet could not be sent on %s.  %s\n",
+   skb->dev ? skb->dev->name : "netif",
+   "consider increasing tx_queue_len");
spin_lock_irq();
}
return 0;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/9] aoe: describe the behavior of the "err" character device

2012-10-15 Thread Ed Cashin
Signed-off-by: Ed Cashin 
---
 drivers/block/aoe/aoechr.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c
index ed57a89..2bf6273 100644
--- a/drivers/block/aoe/aoechr.c
+++ b/drivers/block/aoe/aoechr.c
@@ -39,6 +39,11 @@ struct ErrMsg {
 };
 
 static DEFINE_MUTEX(aoechr_mutex);
+
+/* A ring buffer of error messages, to be read through
+ * "/dev/etherd/err".  When no messages are present,
+ * readers will block waiting for messages to appear.
+ */
 static struct ErrMsg emsgs[NMSG];
 static int emsgs_head_idx, emsgs_tail_idx;
 static struct completion emsgs_comp;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/9] aoe: various enhancements and cleanup from v50 to v60

2012-10-15 Thread Ed Cashin
This patch series is based on linux-next/akpm from 11 Oct.

The patch that modifies aoenet.c:tx to print a warning does not affect
locking but nonetheless causes a new sparse context warning to appear.
Before a bug in sparse suppressed the warning.  We will soon be able
to use the new __must_hold() macro that now appears only in (not
linux-next/akpm but) mm, making the warning go away by telling sparse
that the tx function enters and exits with a lock held.

Ed L. Cashin (9):
  aoe: describe the behavior of the "err" character device
  aoe: print warning regarding a common reason for dropped transmits
  aoe: update cap on outstanding commands based on config query
response
  aoe: support the forgetting (flushing) of a user-specified AoE target
  aoe: support larger I/O requests via aoe_maxsectors module param
  aoe: "payload" sysfs file exports per-AoE-command data transfer size
  aoe: cleanup: remove unused ata_scnt function
  aoe: whitespace cleanup
  aoe: update driver-internal version number to 60

 drivers/block/aoe/aoe.h |   10 
 drivers/block/aoe/aoeblk.c  |   19 ++
 drivers/block/aoe/aoechr.c  |7 +-
 drivers/block/aoe/aoecmd.c  |   22 +++-
 drivers/block/aoe/aoedev.c  |   44 +-
 drivers/block/aoe/aoemain.c |2 +-
 drivers/block/aoe/aoenet.c  |   15 ++---
 7 files changed, 88 insertions(+), 31 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf probe: convert_name_to_addr() allocated the wrong size buffer for a function name

2012-10-15 Thread Hyeoncheol Lee
convert_name_to_addr() allocated sizeof(char *) * MAX_PROBE_ARGS
bytes for a function name

Cc: Masami Hiramatsu 
Cc: Srikar Dronamraju 
Signed-off-by: Hyeoncheol Lee 
---
 tools/perf/util/probe-event.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 49a256e..bb40ed4 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -2352,13 +2352,14 @@ static int convert_name_to_addr(struct perf_probe_event 
*pev, const char *exec)
free(exec_copy);
}
free(pp->function);
-   pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+   pp->function = zalloc(sizeof(char) *
+ (3 + sizeof(unsigned long long) * 2));
if (!pp->function) {
ret = -ENOMEM;
pr_warning("Failed to allocate memory by zalloc.\n");
goto out;
}
-   e_snprintf(pp->function, MAX_PROBE_ARGS, "0x%llx", vaddr);
+   sprintf(pp->function, "0x%llx", vaddr);
ret = 0;
 
 out:
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug fix] nfs-client: fix nfs_inode_attrs_need_update for async read_done comes during truncating to smaller size

2012-10-15 Thread Chen Gang
于 2012年10月15日 20:32, Myklebust, Trond 写道:
> RPC is not ordered. The fact that we get one RPC reply before another
> does not mean that the server sent them in that order.
> 
> This is doubly true when you use UDP as the transport protocol.

1) is it means: nfs_inode_attrs_need_update need not consider async
read_done situation ?

2) for correctness, I do not think "nfs_size_to_loff_t(fattr->size) >
i_size_read(inode)" in nfs_size_need_update is enough. (at least need
use "!=" instead of '>'), do you think so ?


3) another reference:

  A) for an old kernel version (such as 2.6.27-rc9), no such issue
(because it did not have nfs_size_need_update).

  B) the test tools which I use is from the LTP (Linux Test Project),
they use both udp and tcp to test both the nfsv2 and nfsv3.

  C) truly LTP has its limitations: "for stress test, LTP let nfs client
and server under the same machine, which will cause kernel stable
issue", but for net test, LTP use different machine (I got our issue
from LTP net test).


-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Q] Default SLAB allocator

2012-10-15 Thread JoonSoo Kim
Hello, Eric.

2012/10/14 Eric Dumazet :
> SLUB was really bad in the common workload you describe (allocations
> done by one cpu, freeing done by other cpus), because all kfree() hit
> the slow path and cpus contend in __slab_free() in the loop guarded by
> cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
> hit the main "struct page" to add the freed object to freelist.

Could you elaborate more on how 'netperf RR' makes kernel "allocations
done by one cpu, freeling done by other cpus", please?
I don't have enough background network subsystem, so I'm just curious.

> I played some months ago adding a percpu associative cache to SLUB, then
> just moved on other strategy.
>
> (Idea for this per cpu cache was to build a temporary free list of
> objects to batch accesses to struct page)

Is this implemented and submitted?
If it is, could you tell me the link for the patches?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] kdb: Rename kdb_register_repeat() to kdb_register_flags()

2012-10-15 Thread Anton Vorontsov
We're about to add more options for commands behaviour, so let's give
a more generic name to the low-level kdb command registration function.

There are just various renames, no functional changes.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h | 10 +++---
 kernel/debug/kdb/kdb_bp.c   | 16 -
 kernel/debug/kdb/kdb_main.c | 88 ++---
 kernel/trace/trace_kdb.c|  2 +-
 4 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index cbd1c28..0142cd3 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -145,17 +145,17 @@ static inline const char *kdb_walk_kallsyms(loff_t *pos)
 
 /* Dynamic kdb shell command registration */
 extern int kdb_register(char *, kdb_func_t, char *, char *, short);
-extern int kdb_register_repeat(char *, kdb_func_t, char *, char *,
-  short, kdb_cmdflags_t);
+extern int kdb_register_flags(char *, kdb_func_t, char *, char *,
+ short, kdb_cmdflags_t);
 extern int kdb_unregister(char *);
 #else /* ! CONFIG_KGDB_KDB */
 static inline __printf(1, 2) int kdb_printf(const char *fmt, ...) { return 0; }
 static inline void kdb_init(int level) {}
 static inline int kdb_register(char *cmd, kdb_func_t func, char *usage,
   char *help, short minlen) { return 0; }
-static inline int kdb_register_repeat(char *cmd, kdb_func_t func, char *usage,
- char *help, short minlen,
- kdb_repeat_t repeat) { return 0; }
+static inline int kdb_register_flags(char *cmd, kdb_func_t func, char *usage,
+char *help, short minlen,
+kdb_repeat_t repeat) { return 0; }
 static inline int kdb_unregister(char *cmd) { return 0; }
 #endif /* CONFIG_KGDB_KDB */
 enum {
diff --git a/kernel/debug/kdb/kdb_bp.c b/kernel/debug/kdb/kdb_bp.c
index 8418c2f..d2cb80d 100644
--- a/kernel/debug/kdb/kdb_bp.c
+++ b/kernel/debug/kdb/kdb_bp.c
@@ -545,23 +545,23 @@ void __init kdb_initbptab(void)
for (i = 0, bp = kdb_breakpoints; i < KDB_MAXBPT; i++, bp++)
bp->bp_free = 1;
 
-   kdb_register_repeat("bp", kdb_bp, "[]",
+   kdb_register_flags("bp", kdb_bp, "[]",
"Set/Display breakpoints", 0, KDB_REPEAT_NO_ARGS);
-   kdb_register_repeat("bl", kdb_bp, "[]",
+   kdb_register_flags("bl", kdb_bp, "[]",
"Display breakpoints", 0, KDB_REPEAT_NO_ARGS);
if (arch_kgdb_ops.flags & KGDB_HW_BREAKPOINT)
-   kdb_register_repeat("bph", kdb_bp, "[]",
+   kdb_register_flags("bph", kdb_bp, "[]",
"[datar [length]|dataw [length]]   Set hw brk", 0, 
KDB_REPEAT_NO_ARGS);
-   kdb_register_repeat("bc", kdb_bc, "",
+   kdb_register_flags("bc", kdb_bc, "",
"Clear Breakpoint", 0, KDB_REPEAT_NONE);
-   kdb_register_repeat("be", kdb_bc, "",
+   kdb_register_flags("be", kdb_bc, "",
"Enable Breakpoint", 0, KDB_REPEAT_NONE);
-   kdb_register_repeat("bd", kdb_bc, "",
+   kdb_register_flags("bd", kdb_bc, "",
"Disable Breakpoint", 0, KDB_REPEAT_NONE);
 
-   kdb_register_repeat("ss", kdb_ss, "",
+   kdb_register_flags("ss", kdb_ss, "",
"Single Step", 1, KDB_REPEAT_NO_ARGS);
-   kdb_register_repeat("ssb", kdb_ss, "",
+   kdb_register_flags("ssb", kdb_ss, "",
"Single step to branch/call", 0, KDB_REPEAT_NO_ARGS);
/*
 * Architecture dependent initialization.
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index c7a1797..bae9a1d 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2683,7 +2683,7 @@ static int kdb_grep_help(int argc, const char **argv)
 }
 
 /*
- * kdb_register_repeat - This function is used to register a kernel
+ * kdb_register_flags - This function is used to register a kernel
  * debugger command.
  * Inputs:
  * cmd Command name
@@ -2695,12 +2695,12 @@ static int kdb_grep_help(int argc, const char **argv)
  * zero for success, one if a duplicate command.
  */
 #define kdb_command_extend 50  /* arbitrary */
-int kdb_register_repeat(char *cmd,
-   kdb_func_t func,
-   char *usage,
-   char *help,
-   short minlen,
-   kdb_cmdflags_t flags)
+int kdb_register_flags(char *cmd,
+  kdb_func_t func,
+  char *usage,
+  char *help,
+  short minlen,
+  kdb_cmdflags_t flags)
 {
int i;
kdbtab_t *kp;
@@ -2753,13 +2753,13 @@ int kdb_register_repeat(char *cmd,
 
return 0;
 }
-EXPORT_SYMBOL_GPL(kdb_register_repeat);
+EXPORT_SYMBOL_GPL(kdb_register_flags);
 
 
 /*
  * kdb_register - Compatibility 

[PATCH 2/7] kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags

2012-10-15 Thread Anton Vorontsov
We're about to add more options for command behaviour, so let's expand
the meaning of kdb_repeat_t.

So far we just do various renames, there should be no functional changes.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h| 4 ++--
 kernel/debug/kdb/kdb_main.c| 6 +++---
 kernel/debug/kdb/kdb_private.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index 7f6fe6e..cbd1c28 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -17,7 +17,7 @@ typedef enum {
KDB_REPEAT_NONE = 0,/* Do not repeat this command */
KDB_REPEAT_NO_ARGS, /* Repeat the command without arguments */
KDB_REPEAT_WITH_ARGS,   /* Repeat the command including its arguments */
-} kdb_repeat_t;
+} kdb_cmdflags_t;
 
 typedef int (*kdb_func_t)(int, const char **);
 
@@ -146,7 +146,7 @@ static inline const char *kdb_walk_kallsyms(loff_t *pos)
 /* Dynamic kdb shell command registration */
 extern int kdb_register(char *, kdb_func_t, char *, char *, short);
 extern int kdb_register_repeat(char *, kdb_func_t, char *, char *,
-  short, kdb_repeat_t);
+  short, kdb_cmdflags_t);
 extern int kdb_unregister(char *);
 #else /* ! CONFIG_KGDB_KDB */
 static inline __printf(1, 2) int kdb_printf(const char *fmt, ...) { return 0; }
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index cdaaa52..c7a1797 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -992,7 +992,7 @@ int kdb_parse(const char *cmdstr)
if (result && ignore_errors && result > KDB_CMD_GO)
result = 0;
KDB_STATE_CLEAR(CMD);
-   switch (tp->cmd_repeat) {
+   switch (tp->cmd_flags) {
case KDB_REPEAT_NONE:
argc = 0;
if (argv[0])
@@ -2700,7 +2700,7 @@ int kdb_register_repeat(char *cmd,
char *usage,
char *help,
short minlen,
-   kdb_repeat_t repeat)
+   kdb_cmdflags_t flags)
 {
int i;
kdbtab_t *kp;
@@ -2749,7 +2749,7 @@ int kdb_register_repeat(char *cmd,
kp->cmd_usage  = usage;
kp->cmd_help   = help;
kp->cmd_minlen = minlen;
-   kp->cmd_repeat = repeat;
+   kp->cmd_flags  = flags;
 
return 0;
 }
diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h
index f8245b3..9e1b8e9 100644
--- a/kernel/debug/kdb/kdb_private.h
+++ b/kernel/debug/kdb/kdb_private.h
@@ -177,7 +177,7 @@ typedef struct _kdbtab {
char*cmd_help;  /* Help message for this command */
shortcmd_minlen;/* Minimum legal # command
 * chars required */
-   kdb_repeat_t cmd_repeat;/* Does command auto repeat on enter? */
+   kdb_cmdflags_t cmd_flags;   /* Command behaviour flags */
 } kdbtab_t;
 
 extern int kdb_bt(int, const char **); /* KDB display back trace */
-- 
1.7.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] kdb: Remove KDB_REPEAT_NONE flag

2012-10-15 Thread Anton Vorontsov
Since we now treat KDB_REPEAT_* as flags, there is no need to
pass KDB_REPEAT_NONE. It's just the default behaviour when no
flags are specified.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h |  1 -
 kernel/debug/kdb/kdb_bp.c   |  6 ++---
 kernel/debug/kdb/kdb_main.c | 61 ++---
 kernel/trace/trace_kdb.c|  2 +-
 4 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index c6f1ec3..792779c 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -14,7 +14,6 @@
  */
 
 typedef enum {
-   KDB_REPEAT_NONE = 0,/* Do not repeat this command */
KDB_REPEAT_NO_ARGS  = 0x1, /* Repeat the command w/o arguments */
KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */
 } kdb_cmdflags_t;
diff --git a/kernel/debug/kdb/kdb_bp.c b/kernel/debug/kdb/kdb_bp.c
index d2cb80d..928e9e9 100644
--- a/kernel/debug/kdb/kdb_bp.c
+++ b/kernel/debug/kdb/kdb_bp.c
@@ -553,11 +553,11 @@ void __init kdb_initbptab(void)
kdb_register_flags("bph", kdb_bp, "[]",
"[datar [length]|dataw [length]]   Set hw brk", 0, 
KDB_REPEAT_NO_ARGS);
kdb_register_flags("bc", kdb_bc, "",
-   "Clear Breakpoint", 0, KDB_REPEAT_NONE);
+   "Clear Breakpoint", 0, 0);
kdb_register_flags("be", kdb_bc, "",
-   "Enable Breakpoint", 0, KDB_REPEAT_NONE);
+   "Enable Breakpoint", 0, 0);
kdb_register_flags("bd", kdb_bc, "",
-   "Disable Breakpoint", 0, KDB_REPEAT_NONE);
+   "Disable Breakpoint", 0, 0);
 
kdb_register_flags("ss", kdb_ss, "",
"Single Step", 1, KDB_REPEAT_NO_ARGS);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 7245bab..172b726 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2752,7 +2752,7 @@ EXPORT_SYMBOL_GPL(kdb_register_flags);
 /*
  * kdb_register - Compatibility register function for commands that do
  * not need to specify a repeat state.  Equivalent to
- * kdb_register_flags with KDB_REPEAT_NONE.
+ * kdb_register_flags with flags set to 0.
  * Inputs:
  * cmd Command name
  * funcFunction to execute the command
@@ -2767,8 +2767,7 @@ int kdb_register(char *cmd,
 char *help,
 short minlen)
 {
-   return kdb_register_flags(cmd, func, usage, help, minlen,
- KDB_REPEAT_NONE);
+   return kdb_register_flags(cmd, func, usage, help, minlen, 0);
 }
 EXPORT_SYMBOL_GPL(kdb_register);
 
@@ -2822,70 +2821,70 @@ static void __init kdb_inittab(void)
kdb_register_flags("mm", kdb_mm, " ",
  "Modify Memory Contents", 0, KDB_REPEAT_NO_ARGS);
kdb_register_flags("go", kdb_go, "[]",
- "Continue Execution", 1, KDB_REPEAT_NONE);
+ "Continue Execution", 1, 0);
kdb_register_flags("rd", kdb_rd, "",
- "Display Registers", 0, KDB_REPEAT_NONE);
+ "Display Registers", 0, 0);
kdb_register_flags("rm", kdb_rm, " ",
- "Modify Registers", 0, KDB_REPEAT_NONE);
+ "Modify Registers", 0, 0);
kdb_register_flags("ef", kdb_ef, "",
- "Display exception frame", 0, KDB_REPEAT_NONE);
+ "Display exception frame", 0, 0);
kdb_register_flags("bt", kdb_bt, "[]",
- "Stack traceback", 1, KDB_REPEAT_NONE);
+ "Stack traceback", 1, 0);
kdb_register_flags("btp", kdb_bt, "",
- "Display stack for process ", 0, KDB_REPEAT_NONE);
+ "Display stack for process ", 0, 0);
kdb_register_flags("bta", kdb_bt, "[DRSTCZEUIMA]",
- "Display stack all processes", 0, KDB_REPEAT_NONE);
+ "Display stack all processes", 0, 0);
kdb_register_flags("btc", kdb_bt, "",
- "Backtrace current process on each cpu", 0, KDB_REPEAT_NONE);
+ "Backtrace current process on each cpu", 0, 0);
kdb_register_flags("btt", kdb_bt, "",
  "Backtrace process given its struct task address", 0,
-   KDB_REPEAT_NONE);
+   0);
kdb_register_flags("ll", kdb_ll, "  ",
- "Execute cmd for each element in linked list", 0, KDB_REPEAT_NONE);
+ "Execute cmd for each element in linked list", 0, 0);
kdb_register_flags("env", kdb_env, "",
- "Show environment variables", 0, KDB_REPEAT_NONE);
+ "Show environment variables", 0, 0);
kdb_register_flags("set", kdb_set, "",
- "Set environment variables", 0, KDB_REPEAT_NONE);
+ "Set environment variables", 0, 0);
kdb_register_flags("help", kdb_help, "",
- "Display Help Message", 1, KDB_REPEAT_NONE);
+ "Display Help Message", 1, 0);
kdb_register_flags("?", kdb_help, "",
- "Display Help Message", 0, KDB_REPEAT_NONE);
+ "Display Help Message", 0, 0);
kdb_register_flags("cpu", 

[PATCH 7/7] kdb: Add kiosk mode

2012-10-15 Thread Anton Vorontsov
By issuing 'echo 1 > /sys/module/kdb/parameters/kiosk' or
booting with kdb.kiosk=1 kernel command line option, one can still have
a somewhat usable debugging facility, but not fearing that the
debugger can be used to easily gain root access or dump sensitive data.

Without the kiosk mode, obtaining the root rights via KDB is a matter of
a few commands, and works everywhere. For example, log in as a normal
user:

cbou:~$ id
uid=1001(cbou) gid=1001(cbou) groups=1001(cbou)

Now enter KDB (for example via sysrq):

Entering kdb (current=0x8800065bc740, pid 920) due to Keyboard Entry
kdb> ps
23 sleeping system daemon (state M) processes suppressed,
use 'ps A' to see all.
Task Addr   Pid   Parent [*] cpu State Thread Command
0x8800065bc740  920  919  10   R  0x8800065bca20 *bash

0x88000707800010  00   S  0x8800070782e0  init
[...snip...]
0x8800065be3c0  9181  00   S  0x8800065be6a0  getty
0x8800065b9c80  9191  00   S  0x8800065b9f60  login
0x8800065bc740  920  919  10   R  0x8800065bca20 *bash

All we need is the offset of cred pointers. We can look up the offset in
the distro's kernel source, but it is unnecessary. We can just start
dumping init's task_struct, until we see the process name:

kdb> md 0x880007078000
0x880007078000 0001 88000703c000   
0x880007078010 00402102    .!@.
[...snip...]
0x8800070782b0 8800073e0580 8800073e0580   ..>...>.
0x8800070782c0 74696e69    init

^ Here, 'init'. Creds are just above it, so the offset is 0x02b0.

Now we set up init's creds for our non-privileged shell:

kdb> mm 0x8800065bc740+0x02b0 0x8800073e0580
0x8800065bc9f0 = 0x8800073e0580
kdb> mm 0x8800065bc740+0x02b8 0x8800073e0580
0x8800065bc9f8 = 0x8800073e0580

And thus gaining the root:

kdb> go
cbou:~$ id
uid=0(root) gid=0(root) groups=0(root)
cbou:~$ bash
root:~#

p.s. No distro enables kdb by default (although, with a nice KDB-over-KMS
feature availability, I would expect at least some would enable it), so
it's not actually some kind of a major issue.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h |  1 +
 kernel/debug/kdb/kdb_main.c | 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index abe927c..3a2c554 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -63,6 +63,7 @@ extern atomic_t kdb_event;
 #define KDB_BADLENGTH  (-19)
 #define KDB_NOBP   (-20)
 #define KDB_BADADDR(-21)
+#define KDB_NOPERM (-22)
 
 /*
  * kdb_diemsg
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 83c3f60..36e4c2a 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -12,6 +12,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -23,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -42,6 +44,12 @@
 #include 
 #include "kdb_private.h"
 
+#undef MODULE_PARAM_PREFIX
+#defineMODULE_PARAM_PREFIX "kdb."
+
+static bool kdb_kiosk;
+module_param_named(kiosk, kdb_kiosk, bool, 0600);
+
 #define GREP_LEN 256
 char kdb_grep_string[GREP_LEN];
 int kdb_grepping_flag;
@@ -121,6 +129,7 @@ static kdbmsg_t kdbmsgs[] = {
KDBMSG(BADLENGTH, "Invalid length field"),
KDBMSG(NOBP, "No Breakpoint exists"),
KDBMSG(BADADDR, "Invalid address"),
+   KDBMSG(NOPERM, "Permission denied"),
 };
 #undef KDBMSG
 
@@ -987,6 +996,14 @@ int kdb_parse(const char *cmdstr)
 
if (i < kdb_max_commands) {
int result;
+
+   if (kdb_kiosk) {
+   if (!(tp->cmd_flags & (KDB_SAFE | KDB_SAFE_NO_ARGS)))
+   return KDB_NOPERM;
+   if (tp->cmd_flags & KDB_SAFE_NO_ARGS && argc > 1)
+   return KDB_NOPERM;
+   }
+
KDB_STATE_SET(CMD);
result = (*tp->cmd_func)(argc-1, (const char **)argv);
if (result && ignore_errors && result > KDB_CMD_GO)
@@ -1009,7 +1026,7 @@ int kdb_parse(const char *cmdstr)
 * obtaining the address of a variable, or the nearest symbol
 * to an address contained in a register.
 */
-   {
+   if (!kdb_kiosk) {
unsigned long value;
char *name = NULL;
long offset;
@@ -1025,6 +1042,7 @@ int kdb_parse(const char *cmdstr)
kdb_printf("\n");
return 0;
}
+   return KDB_NOPERM;
 }
 
 
-- 
1.7.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the 

[PATCH 6/7] kdb: Mark safe commands as KDB_SAFE and KDB_SAFE_NO_ARGS

2012-10-15 Thread Anton Vorontsov
This patch introduces two new flags: KDB_SAFE, denotes a safe command,
and KDB_SAFE_NO_ARGS, denotes a safe command when used without arguments.

The word "safe" here used in the sense that the commands cannot be
used to leak sensitive data from the memory, and cannot be used
to change program flow in a predefined manner.

These flags will be used by the "kiosk" mode, i.e. when it is possible
for the ordinary user to enter the KDB (or user can get the access to
KDB after the crash), but we do not allow user to read dump the
memory [and thus read some sensitive data].

The following commands were marked as "safe":

Display exception frame
Stack traceback
Display stack for process
Display stack all processes
Backtrace current process on each cpu
Execute cmd for each element in linked list
Show environment variables
Set environment variables
Display Help Message
Switch to new cpu
Display active task list
Switch to another task
Reboot the machine immediately
List loaded kernel modules
Magic SysRq key
Display syslog buffer
Define a set of commands, down to endefcmd
Summarize the system
Disable NMI entry to KDB

The following commands were marked as safe when issued with no arguments:

Continue Execution

And the following commands are unsafe:

Clear Breakpoint
Enable Breakpoint
Disable Breakpoint
Single step
Single step to branch/call
Continue Execution (with address argument)
Display Memory Contents
Display Raw Memory
Display Physical Memory
Display Memory Symbolically
Modify Memory Contents
Display Registers
Modify Registers
Backtrace process given its struct task address
Send a signal to a process
Enter kgdb mode
Display per_cpu variables

Note that we mark "display registers" command unsafe, this is because
single stepping + constantly dumping registers in string or memory
functions can be used as a way to read sensitive data (it's actually
trivial to exploit). Later we can do a bit better, i.e. not displaying
general-purpose registers, but printing control registers.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h |  2 ++
 kernel/debug/kdb/kdb_main.c | 44 ++--
 kernel/trace/trace_kdb.c|  2 +-
 3 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index 792779c..abe927c 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -16,6 +16,8 @@
 typedef enum {
KDB_REPEAT_NO_ARGS  = 0x1, /* Repeat the command w/o arguments */
KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */
+   KDB_SAFE= 0x4, /* Security-wise safe command */
+   KDB_SAFE_NO_ARGS= 0x8, /* Only safe if run w/o arguments */
 } kdb_cmdflags_t;
 
 typedef int (*kdb_func_t)(int, const char **);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 172b726..83c3f60 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2821,70 +2821,70 @@ static void __init kdb_inittab(void)
kdb_register_flags("mm", kdb_mm, " ",
  "Modify Memory Contents", 0, KDB_REPEAT_NO_ARGS);
kdb_register_flags("go", kdb_go, "[]",
- "Continue Execution", 1, 0);
+ "Continue Execution", 1, KDB_SAFE_NO_ARGS);
kdb_register_flags("rd", kdb_rd, "",
  "Display Registers", 0, 0);
kdb_register_flags("rm", kdb_rm, " ",
  "Modify Registers", 0, 0);
kdb_register_flags("ef", kdb_ef, "",
- "Display exception frame", 0, 0);
+ "Display exception frame", 0, KDB_SAFE);
kdb_register_flags("bt", kdb_bt, "[]",
- "Stack traceback", 1, 0);
+ "Stack traceback", 1, KDB_SAFE);
kdb_register_flags("btp", kdb_bt, "",
- "Display stack for process ", 0, 0);
+ "Display stack for process ", 0, KDB_SAFE);
kdb_register_flags("bta", kdb_bt, "[DRSTCZEUIMA]",
- "Display stack all processes", 0, 0);
+ "Display stack all processes", 0, KDB_SAFE);
kdb_register_flags("btc", kdb_bt, "",
- "Backtrace current process on each cpu", 0, 0);
+ "Backtrace current process on each cpu", 0, KDB_SAFE);
kdb_register_flags("btt", kdb_bt, "",
  "Backtrace process given its struct task address", 0,
0);
kdb_register_flags("ll", kdb_ll, "  ",
- "Execute cmd for each element in linked list", 0, 0);
+ "Execute cmd for each element in linked list", 0, KDB_SAFE);
kdb_register_flags("env", kdb_env, "",
- "Show environment variables", 0, 0);
+ "Show environment variables", 0, KDB_SAFE);
kdb_register_flags("set", kdb_set, "",
- 

[PATCH 4/7] kdb: Use KDB_REPEAT_* values as flags

2012-10-15 Thread Anton Vorontsov
The actual values of KDB_REPEAT_* enum values and overall logic stayed
the same, but we now treat the values as flags.

This makes it possible to add other flags and combine them, plus makes
the code a lot simpler and shorter. But functionality-wise, there should
be no changes.

Signed-off-by: Anton Vorontsov 
---
 include/linux/kdb.h |  4 ++--
 kernel/debug/kdb/kdb_main.c | 21 +++--
 2 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/include/linux/kdb.h b/include/linux/kdb.h
index 0142cd3..c6f1ec3 100644
--- a/include/linux/kdb.h
+++ b/include/linux/kdb.h
@@ -15,8 +15,8 @@
 
 typedef enum {
KDB_REPEAT_NONE = 0,/* Do not repeat this command */
-   KDB_REPEAT_NO_ARGS, /* Repeat the command without arguments */
-   KDB_REPEAT_WITH_ARGS,   /* Repeat the command including its arguments */
+   KDB_REPEAT_NO_ARGS  = 0x1, /* Repeat the command w/o arguments */
+   KDB_REPEAT_WITH_ARGS= 0x2, /* Repeat the command w/ its arguments */
 } kdb_cmdflags_t;
 
 typedef int (*kdb_func_t)(int, const char **);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index bae9a1d..7245bab 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -992,20 +992,13 @@ int kdb_parse(const char *cmdstr)
if (result && ignore_errors && result > KDB_CMD_GO)
result = 0;
KDB_STATE_CLEAR(CMD);
-   switch (tp->cmd_flags) {
-   case KDB_REPEAT_NONE:
-   argc = 0;
-   if (argv[0])
-   *(argv[0]) = '\0';
-   break;
-   case KDB_REPEAT_NO_ARGS:
-   argc = 1;
-   if (argv[1])
-   *(argv[1]) = '\0';
-   break;
-   case KDB_REPEAT_WITH_ARGS:
-   break;
-   }
+
+   if (tp->cmd_flags & KDB_REPEAT_WITH_ARGS)
+   return result;
+
+   argc = tp->cmd_flags & KDB_REPEAT_NO_ARGS ? 1 : 0;
+   if (argv[argc])
+   *(argv[argc]) = '\0';
return result;
}
 
-- 
1.7.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] kdb: Remove currently unused kdbtab_t->cmd_flags

2012-10-15 Thread Anton Vorontsov
The struct member is never used in the code, so we can remove it.

We will introduce real flags soon by renaming cmd_repeat to cmd_flags.

Signed-off-by: Anton Vorontsov 
---
 kernel/debug/kdb/kdb_main.c| 1 -
 kernel/debug/kdb/kdb_private.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 4d5f8d5..cdaaa52 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2748,7 +2748,6 @@ int kdb_register_repeat(char *cmd,
kp->cmd_func   = func;
kp->cmd_usage  = usage;
kp->cmd_help   = help;
-   kp->cmd_flags  = 0;
kp->cmd_minlen = minlen;
kp->cmd_repeat = repeat;
 
diff --git a/kernel/debug/kdb/kdb_private.h b/kernel/debug/kdb/kdb_private.h
index 392ec6a..f8245b3 100644
--- a/kernel/debug/kdb/kdb_private.h
+++ b/kernel/debug/kdb/kdb_private.h
@@ -175,7 +175,6 @@ typedef struct _kdbtab {
kdb_func_t cmd_func;/* Function to execute command */
char*cmd_usage; /* Usage String for this command */
char*cmd_help;  /* Help message for this command */
-   shortcmd_flags; /* Parsing flags */
shortcmd_minlen;/* Minimum legal # command
 * chars required */
kdb_repeat_t cmd_repeat;/* Does command auto repeat on enter? */
-- 
1.7.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] KDB: Kiosk (reduced capabilities) mode

2012-10-15 Thread Anton Vorontsov
Hello Jason,

Just as promised, I'm resending the series after the merge window.

This patchset implements "kiosk" mode for KDB debugger. The mode reduces
kdb features, so that it is no longer possible to leak sensitive data via
the debugger, and not possible to change program flow in a predefined
manner by an ordinary user. Root can control the capability.

There are a few patches, some are just cleanups, some are churn-ish
cleanups, but inevitable. And the rest implements the mode -- after all
the preparations, everything is pretty straightforward.

Thanks!
Anton.

--
 include/linux/kdb.h|  20 ++--
 kernel/debug/kdb/kdb_bp.c  |  24 ++---
 kernel/debug/kdb/kdb_main.c| 189 ++
 kernel/debug/kdb/kdb_private.h |   3 +-
 kernel/trace/trace_kdb.c   |   4 +-
 5 files changed, 125 insertions(+), 115 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


usbutils for Mac OS X and Cygwin

2012-10-15 Thread Xiaofan Chen
Hi Greg,

Now usbutils git almost builds successfully out of the box under Mac OS X
and Cygwin (using libusbx). Just wondering if you can accept the minor
fix for Mac OS X and suggest a way to fix cygwin build.

For Cygwin, there is a conflict with Cygwin's w32api package.

DATADIR conflicts with MinGW and cydwin's  in their w32api
package.
http://caca.zoy.org/changeset/3404

typedef enum tag DATADIR{
DATADIR_GET=1,
DATADIR_SET
} DATADIR;

I do not know the proper fix, so I just temporarily change
objidl.h to

typedef enum tag DATADIR{
DATADIR_GET=1,
DATADIR_SET
} DATADIR1;

After that I can build usbutils.

I only need one fix for Mac OS X as Apple's gcc compiler does not like
--as-needed.

mymacmini:usbutils xiaofanc$ git diff
diff --git a/Makefile.am b/Makefile.am
index 4e53e45..e8cb002 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -1,8 +1,7 @@
 SUBDIRS = \
usbhid-dump

-AM_LDFLAGS = \
-   -Wl,--as-needed
+AM_LDFLAGS =

 data_DATA =

-- 
Xiaofan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the tip tree with Linus' tree

2012-10-15 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the tip tree got a conflict in
mm/huge_memory.c between commit 325adeb55e32 ("mm: huge_memory: Fix build
error") from Linus' tree and commit 39d6cb39a817 ("mm/mpol: Use special
PROT_NONE to migrate pages") from the tip tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

diff --cc mm/huge_memory.c
index 40f17c3,d14c8b2..000
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@@ -17,7 -17,7 +17,8 @@@
  #include 
  #include 
  #include 
 +#include 
+ #include 
  #include 
  #include 
  #include "internal.h"
@@@ -1347,59 -1428,55 +1418,54 @@@ static int __split_huge_page_map(struc
spin_lock(>page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
 PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
-   if (pmd) {
-   pgtable = pgtable_trans_huge_withdraw(mm);
-   pmd_populate(mm, &_pmd, pgtable);
- 
-   haddr = address;
-   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-   pte_t *pte, entry;
-   BUG_ON(PageCompound(page+i));
-   entry = mk_pte(page + i, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-   if (!pmd_write(*pmd))
-   entry = pte_wrprotect(entry);
-   else
-   BUG_ON(page_mapcount(page) != 1);
-   if (!pmd_young(*pmd))
-   entry = pte_mkold(entry);
-   pte = pte_offset_map(&_pmd, haddr);
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(mm, haddr, pte, entry);
-   pte_unmap(pte);
-   }
+   if (!pmd)
+   goto unlock;
  
-   smp_wmb(); /* make pte visible before pmd */
-   /*
-* Up to this point the pmd is present and huge and
-* userland has the whole access to the hugepage
-* during the split (which happens in place). If we
-* overwrite the pmd with the not-huge version
-* pointing to the pte here (which of course we could
-* if all CPUs were bug free), userland could trigger
-* a small page size TLB miss on the small sized TLB
-* while the hugepage TLB entry is still established
-* in the huge TLB. Some CPU doesn't like that. See
-* http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-* Erratum 383 on page 93. Intel should be safe but is
-* also warns that it's only safe if the permission
-* and cache attributes of the two entries loaded in
-* the two TLB is identical (which should be the case
-* here). But it is generally safer to never allow
-* small and huge TLB entries for the same virtual
-* address to be loaded simultaneously. So instead of
-* doing "pmd_populate(); flush_tlb_range();" we first
-* mark the current pmd notpresent (atomically because
-* here the pmd_trans_huge and pmd_trans_splitting
-* must remain set at all times on the pmd until the
-* split is complete for this pmd), then we flush the
-* SMP TLB and finally we write the non-huge version
-* of the pmd entry with pmd_populate.
-*/
-   pmdp_invalidate(vma, address, pmd);
-   pmd_populate(mm, pmd, pgtable);
-   ret = 1;
+   prot = pmd_pgprot(*pmd);
 -  pgtable = get_pmd_huge_pte(mm);
++  pgtable = pgtable_trans_huge_withdraw(mm);
+   pmd_populate(mm, &_pmd, pgtable);
+ 
+   for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) 
{
+   pte_t *pte, entry;
+ 
+   BUG_ON(PageCompound(page+i));
+   entry = mk_pte(page + i, prot);
+   entry = pte_mkdirty(entry);
+   if (!pmd_young(*pmd))
+   entry = pte_mkold(entry);
+   pte = pte_offset_map(&_pmd, haddr);
+   BUG_ON(!pte_none(*pte));
+   set_pte_at(mm, haddr, pte, entry);
+   pte_unmap(pte);
}
+ 
+   smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+   /*
+* Up to this point the pmd is present and huge.
+*
+* If we overwrite the pmd with the not-huge version, we could trigger
+* a small page size TLB miss on the small sized TLB while the hugepage
+* TLB entry is still established in the huge TLB.
+*
+* Some CPUs don't like that. See
+* http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+* on page 93.
+*
+ 

Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0

2012-10-15 Thread David Rientjes
On Tue, 16 Oct 2012, Michal Hocko wrote:

> And a follow up for memcg.swappiness documentation which is more
> specific about spwappiness==0 meaning.
> ---
> From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
> 
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
> 
> Signed-off-by: Michal Hocko 

Acked-by: David Rientjes 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] SLUB: remove hard coded magic numbers from resiliency_test

2012-10-15 Thread David Rientjes
On Mon, 15 Oct 2012, Christoph Lameter wrote:

> > Use the always inlined function kmalloc_index to translate
> > sizes to indexes, so that we don't have to have the slab indexes
> > hard coded in two places.
> 
> Acked-by: Christoph Lameter 
> 

Shouldn't this be using get_slab() instead?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0

2012-10-15 Thread Kamezawa Hiroyuki

(2012/10/16 7:07), Michal Hocko wrote:

And a follow up for memcg.swappiness documentation which is more
specific about spwappiness==0 meaning.
---
 From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe memcg swappiness more precisely

since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.

Signed-off-by: Michal Hocko 


Nice :)
Acked-by: KAMEZAWA Hiroyuki 


---
  Documentation/cgroups/memory.txt |4 
  1 file changed, 4 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..71c4da4 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -466,6 +466,10 @@ Note:
  5.3 swappiness

  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+Please note that unlike the global swappiness, memcg knob set to 0
+really prevents from any swapping even if there is a swap storage
+available. This might lead to memcg OOM killer if there are no file
+pages to reclaim.

  Following cgroups' swappiness can't be changed.
  - root cgroup (uses /proc/sys/vm/swappiness).




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead

2012-10-15 Thread Shaohua Li
On Mon, Oct 08, 2012 at 03:09:58PM -0700, Hugh Dickins wrote:
> On Thu, 4 Oct 2012, Konstantin Khlebnikov wrote:
> 
> > Here results of my test. Workload isn't very realistic, but at least it
> > threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs,
> > 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment)
> > 
> > average results for ten runs:
> > 
> > RA=3RA=0RA=1RA=2RA=4HughShaohua
> > real time   500 542 528 519 500 523 522
> > user time   738 737 735 737 739 737 739
> > sys time93  93  91  92  96  92  93
> > pgmajfault  62918   110533  92454   78221   54342   86601   77229
> > pgpgin  2070372 795228  1034046 1471010 3177192 1154532 1599388
> > pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570
> > pswpin  462747  138873  202148  310969  739431  232710  341320
> > pswpout 646363  502599  524613  584731  697797  568784  628677
> > 
> > So, last two columns shows mostly equal results: +4.6% and +4.4% in
> > comparison to vanilla kernel with RA=3, but your version shows more stable
> > results (std-error 2.7% against 4.8%) (all this numbers in huge table in
> > attachment)
> 
> Thanks for doing this, Konstantin, but I'm stuck for anything much to say!
> Shaohua and I are both about 4.5% bad for this particular test, but I'm
> more consistently bad - hurrah!
> 
> I suspect (not a convincing argument) that if the test were just slightly
> different (a little more or a little less memory, SSD instead of hard
> disk, diskcache instead of tmpfs), then it would come out differently.
> 
> Did you draw any conclusions from the numbers you found?
> 
> I haven't done any more on this in the last few days, except to verify
> that once an anon_vma is judged random with Shaohua's, then it appears
> to be condemned to no-readahead ever after.
> 
> That's probably something that a hack like I had in mine would fix,
> but that addition might change its balance further (and increase vma
> or anon_vma size) - not tried yet.
> 
> All I want to do right now, is suggest to Andrew that he hold Shaohua's
> patch back from 3.7 for the moment: I'll send a response to Sep 7th's
> mm-commits mail to suggest that - but no great disaster if he ignores me.

Ok, I tested Hugh's patch. My test is a multithread random write workload.
With Hugh's patch, 49:28.06elapsed
With mine, 43:23.39elapsed
There is 12% more time used with Hugh's patch.

In the stable state of this workload, SI:SO ratio should be roughly 1:1. With
Hugh's patch, it's around 1.6:1, there is still unnecessary swapin.

I also tried a workload with seqential/random write mixed, Hugh's patch is 10%
bad too.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/2] DMA-mapping & IOMMU - physically contiguous allocations

2012-10-15 Thread Inki Dae
2012/10/15 Marek Szyprowski :
> Hello,
>
> Some devices, which have IOMMU, for some use cases might require to
> allocate a buffers for DMA which is contiguous in physical memory. Such
> use cases appears for example in DRM subsystem when one wants to improve
> performance or use secure buffer protection.
>
> I would like to ask if adding a new attribute, as proposed in this RFC
> is a good idea? I feel that it might be an attribute just for a single
> driver, but I would like to know your opinion. Should we look for other
> solution?
>

In addition, currently we have worked dma-mapping-based iommu support
for exynos drm driver with this patch set so this patch set has been
tested with iommu enabled exynos drm driver and worked fine. actually,
this feature is needed for secure mode such as TrustZone. in case of
Exynos SoC, memory region for secure mode should be physically
contiguous and also maybe OMAP but now dma-mapping framework doesn't
guarantee physically continuous memory allocation so this patch set
would make it possible.

Tested-by: Inki Dae 
Reviewed-by: Inki Dae 

Thanks,
Inki Dae

> Best regards
> --
> Marek Szyprowski
> Samsung Poland R Center
>
>
> Marek Szyprowski (2):
>   common: DMA-mapping: add DMA_ATTR_FORCE_CONTIGUOUS attribute
>   ARM: dma-mapping: add support for DMA_ATTR_FORCE_CONTIGUOUS attribute
>
>  Documentation/DMA-attributes.txt |9 +
>  arch/arm/mm/dma-mapping.c|   41 
> ++
>  include/linux/dma-attrs.h|1 +
>  3 files changed, 43 insertions(+), 8 deletions(-)
>
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mpol_to_str revisited.

2012-10-15 Thread David Rientjes
On Mon, 8 Oct 2012, Dave Jones wrote:

>  > > diff -durpN '--exclude-from=/home/davej/.exclude' 
> src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
>  > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c2012-05-31 
> 22:32:46.778150675 -0400
>  > > +++ linux-dj/fs/proc/task_mmu.c  2012-10-04 19:31:41.269988984 -0400
>  > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
>  > >  struct mm_walk walk = {};
>  > >  struct mempolicy *pol;
>  > >  int n;
>  > > +int ret;
>  > >  char buffer[50];
>  > >  
>  > >  if (!mm)
>  > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
>  > >  walk.mm = mm;
>  > >  
>  > >  pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
>  > > -mpol_to_str(buffer, sizeof(buffer), pol, 0);
>  > > +memset(buffer, 0, sizeof(buffer));
>  > > +ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
>  > > +if (ret < 0)
>  > > +return 0;
>  > 
>  > We should need the mpol_cond_put(pol) here before returning.
> 
> good catch. I'll respin the patch later with this changed.
> 

Did you get a chance to fix this issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] slab: Ignore internal flags in cache creation

2012-10-15 Thread David Rientjes
On Mon, 8 Oct 2012, David Rientjes wrote:

> > diff --git a/mm/slab.h b/mm/slab.h
> > index 7deeb44..4c35c17 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -45,6 +45,31 @@ static inline struct kmem_cache 
> > *__kmem_cache_alias(const char *name, size_t siz
> >  #endif
> >  
> >  
> > +/* Legal flag mask for kmem_cache_create(), for various configurations */
> > +#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC 
> > | \
> > +SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
> > +
> > +#if defined(CONFIG_DEBUG_SLAB)
> > +#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
> > +#elif defined(CONFIG_SLUB_DEBUG)
> > +#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> > + SLAB_TRACE | SLAB_DEBUG_FREE)
> > +#else
> > +#define SLAB_DEBUG_FLAGS (0)
> > +#endif
> > +
> > +#if defined(CONFIG_SLAB)
> > +#define SLAB_CACHE_FLAGS (SLAB_MEMSPREAD | SLAB_NOLEAKTRACE | \
> 
> s/SLAB_MEMSPREAD/SLAB_MEM_SPREAD/
> 

Did you have a v5 of this patch with the above fix?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >