date:20130705

Re: [PATCH 3/3] smp/ipi:Remove check around csd lock in handler for smp_call_function variants

2013-07-05 Thread Wang YanQing

On Fri, Jul 05, 2013 at 09:57:21PM +0530, Preeti U Murthy wrote:
> call_single_data is always locked by all callers of
> arch_send_call_function_single_ipi() or
> arch_send_call_function_ipi_mask() which results in execution of
> generic_call_function_interrupt() handler.
> 
> Hence remove the check for lock on csd in generic_call_function_interrupt()
> handler, before unlocking it.

I can't find where is the generic_call_function_interrupt :)

> Signed-off-by: Preeti U Murthy 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Xiao Guangrong 
> Cc: srivatsa.b...@linux.vnet.ibm.com
> Cc: Paul E. McKenney 
> Cc: Steven Rostedt 
> Cc: Rusty Russell  ---
> 
>  kernel/smp.c |   14 +-
>  1 file changed, 1 insertion(+), 13 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index b6981ae..d37581a 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -181,25 +181,13 @@ void generic_smp_call_function_single_interrupt(void)
>  
>   while (!list_empty()) {
>   struct call_single_data *csd;
> - unsigned int csd_flags;
>  
>   csd = list_entry(list.next, struct call_single_data, list);
>   list_del(>list);
>  
> - /*
> -  * 'csd' can be invalid after this call if flags == 0
> -  * (when called through generic_exec_single()),
> -  * so save them away before making the call:
> -  */
> - csd_flags = csd->flags;
> -

You haven't mention this change in the ChangeLog, don't do it.
I can't see any harm to remove csd_flags, but I hope others
check it again.

>   csd->func(csd->info);
>  
> - /*
> -  * Unlocked CSDs are valid through generic_exec_single():
> -  */
> - if (csd_flags & CSD_FLAG_LOCK)
> - csd_unlock(csd);
> + csd_unlock(csd);

I don't like this change, I think check CSD_FLAG_LOCK 
to make sure we really need csd_unlock is good.

Just like you can't know who and how people will use the
API, so some robust check code is good.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] i915: Don't provide ACPI backlight interface if firmware expects Windows 8

2013-07-05 Thread Aaron Lu

On 07/06/2013 06:23 AM, Rafael J. Wysocki wrote:
> On Friday, July 05, 2013 11:40:02 PM Rafael J. Wysocki wrote:
>> On Friday, July 05, 2013 10:00:55 PM Rafael J. Wysocki wrote:
>>> On Friday, July 05, 2013 02:20:14 PM Rafael J. Wysocki wrote:
 On Sunday, June 09, 2013 07:01:39 PM Matthew Garrett wrote:
> Windows 8 leaves backlight control up to individual graphics drivers 
> rather
> than making ACPI calls itself. There's plenty of evidence to suggest that
> the Intel driver for Windows doesn't use the ACPI interface, including the
> fact that it's broken on a bunch of machines when the OS claims to support
> Windows 8. The simplest thing to do appears to be to disable the ACPI
> backlight interface on these systems.
>
> Signed-off-by: Matthew Garrett 
> ---
>  drivers/gpu/drm/i915/i915_dma.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_dma.c 
> b/drivers/gpu/drm/i915/i915_dma.c
> index 3b315ba..23b6292 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1661,6 +1661,9 @@ int i915_driver_load(struct drm_device *dev, 
> unsigned long flags)
>   /* Must be done after probing outputs */
>   intel_opregion_init(dev);
>   acpi_video_register();
> + /* Don't use ACPI backlight functions on Windows 8 platforms */
> + if (acpi_osi_version() >= ACPI_OSI_WIN_8)
> + acpi_video_backlight_unregister();
>   }
>  
>   if (IS_GEN5(dev))
>

 Well, this causes build failures to happen when the ACPI video driver is
 modular and the graphics driver is not.

 I'm not sure how to resolve that, so suggestions are welcome.
>>>
>>> Actually, that happened with the radeon patch.
>>>
>>> That said, ACPI_OSI_WIN_8 doesn't make much sense for !CONFIG_ACPI, for
>>> example.
>>>
>>> What about making acpi_video_register() do the quirk instead?  We could add 
>>> an
>>> argument to it indicating whether or not quirks should be applied.
>>
>> Actually, I wonder what about the appended patch (on top of the Aaron's
>> https://patchwork.kernel.org/patch/2812951/) instead of [1-3/3] from this 
>> series.
> 
> Or even something as simple as this one.
> 
> ---
>  drivers/acpi/video_detect.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-pm/drivers/acpi/video_detect.c
> ===
> --- linux-pm.orig/drivers/acpi/video_detect.c
> +++ linux-pm/drivers/acpi/video_detect.c
> @@ -203,6 +203,9 @@ long acpi_video_get_capabilities(acpi_ha
>*/
>  
>   dmi_check_system(video_detect_dmi_table);
> +
> + if (acpi_gbl_osi_data >= ACPI_OSI_WIN_8)
> + acpi_video_support |= ACPI_VIDEO_BACKLIGHT_FORCE_VENDOR;

Then vendor driver(thinkpad_acpi) will step in and create a backlight
interface for the system, which, unfortunately, is also broken for those
win8 thinkpads.

So we will need to do something in thinkpad_acpi to also not create
backlight interface for these systems too.

This actually doesn't feel bad to me, since the modules are blacklisting
their own interfaces. The downside is of course, two quirk code exist.

BTW, unregistering ACPI video's backlight interface in GPU driver doesn't
have this problem since it made the platform driver think the ACPI video
driver will control the backlight and then use the newly added API to
remove ACPI video created backlight interface.

Thanks,
Aaron

>   } else {
>   status = acpi_bus_get_device(graphics_handle, _dev);
>   if (ACPI_FAILURE(status)) {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ARM: pxa: remove IRQF_DISABLED

2013-07-05 Thread Michael Opdenacker

This flag is a NOOP since 2.6.35 and can be removed.

Signed-off-by: Michael Opdenacker 
---
 arch/arm/mach-pxa/am200epd.c | 3 +--
 arch/arm/mach-pxa/am300epd.c | 3 +--
 arch/arm/mach-pxa/em-x270.c  | 3 +--
 arch/arm/mach-pxa/magician.c | 2 +-
 arch/arm/mach-pxa/mainstone.c| 2 +-
 arch/arm/mach-pxa/pcm990-baseboard.c | 2 +-
 arch/arm/mach-pxa/sharpsl_pm.c   | 8 
 arch/arm/mach-pxa/time.c | 2 +-
 arch/arm/mach-pxa/trizeps4.c | 3 +--
 arch/arm/plat-pxa/dma.c  | 2 +-
 10 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/arch/arm/mach-pxa/am200epd.c b/arch/arm/mach-pxa/am200epd.c
index ffa6d81..12fb0f4 100644
--- a/arch/arm/mach-pxa/am200epd.c
+++ b/arch/arm/mach-pxa/am200epd.c
@@ -293,8 +293,7 @@ static int am200_setup_irq(struct fb_info *info)
int ret;
 
ret = request_irq(PXA_GPIO_TO_IRQ(RDY_GPIO_PIN), am200_handle_irq,
-   IRQF_DISABLED|IRQF_TRIGGER_FALLING,
-   "AM200", info->par);
+   IRQF_TRIGGER_FALLING, "AM200", info->par);
if (ret)
dev_err(_device->dev, "request_irq failed: %d\n", ret);
 
diff --git a/arch/arm/mach-pxa/am300epd.c b/arch/arm/mach-pxa/am300epd.c
index 3dfec1e..c9f309a 100644
--- a/arch/arm/mach-pxa/am300epd.c
+++ b/arch/arm/mach-pxa/am300epd.c
@@ -241,8 +241,7 @@ static int am300_setup_irq(struct fb_info *info)
struct broadsheetfb_par *par = info->par;
 
ret = request_irq(PXA_GPIO_TO_IRQ(RDY_GPIO_PIN), am300_handle_irq,
-   IRQF_DISABLED|IRQF_TRIGGER_RISING,
-   "AM300", par);
+   IRQF_TRIGGER_RISING, "AM300", par);
if (ret)
dev_err(_device->dev, "request_irq failed: %d\n", ret);
 
diff --git a/arch/arm/mach-pxa/em-x270.c b/arch/arm/mach-pxa/em-x270.c
index f6726bb..86936d9 100644
--- a/arch/arm/mach-pxa/em-x270.c
+++ b/arch/arm/mach-pxa/em-x270.c
@@ -556,8 +556,7 @@ static int em_x270_mci_init(struct device *dev,
}
 
err = request_irq(gpio_to_irq(mmc_cd), em_x270_detect_int,
- IRQF_DISABLED | IRQF_TRIGGER_RISING |
- IRQF_TRIGGER_FALLING,
+ IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING,
  "MMC card detect", data);
if (err) {
dev_err(dev, "can't request MMC card detect IRQ: %d\n", err);
diff --git a/arch/arm/mach-pxa/magician.c b/arch/arm/mach-pxa/magician.c
index f44532f..38f544d 100644
--- a/arch/arm/mach-pxa/magician.c
+++ b/arch/arm/mach-pxa/magician.c
@@ -633,7 +633,7 @@ static struct platform_device bq24022 = {
 static int magician_mci_init(struct device *dev,
irq_handler_t detect_irq, void *data)
 {
-   return request_irq(IRQ_MAGICIAN_SD, detect_irq, IRQF_DISABLED,
+   return request_irq(IRQ_MAGICIAN_SD, detect_irq, 0,
   "mmc card detect", data);
 }
 
diff --git a/arch/arm/mach-pxa/mainstone.c b/arch/arm/mach-pxa/mainstone.c
index d2c6523..1184efa 100644
--- a/arch/arm/mach-pxa/mainstone.c
+++ b/arch/arm/mach-pxa/mainstone.c
@@ -400,7 +400,7 @@ static int mainstone_mci_init(struct device *dev, 
irq_handler_t mstone_detect_in
 */
MST_MSCWR1 &= ~MST_MSCWR1_MS_SEL;
 
-   err = request_irq(MAINSTONE_MMC_IRQ, mstone_detect_int, IRQF_DISABLED,
+   err = request_irq(MAINSTONE_MMC_IRQ, mstone_detect_int, 0,
 "MMC card detect", data);
if (err)
printk(KERN_ERR "mainstone_mci_init: MMC/SD: can't request MMC 
card detect IRQ\n");
diff --git a/arch/arm/mach-pxa/pcm990-baseboard.c 
b/arch/arm/mach-pxa/pcm990-baseboard.c
index fb7f1d1..33f058f 100644
--- a/arch/arm/mach-pxa/pcm990-baseboard.c
+++ b/arch/arm/mach-pxa/pcm990-baseboard.c
@@ -326,7 +326,7 @@ static int pcm990_mci_init(struct device *dev, 
irq_handler_t mci_detect_int,
 {
int err;
 
-   err = request_irq(PCM027_MMCDET_IRQ, mci_detect_int, IRQF_DISABLED,
+   err = request_irq(PCM027_MMCDET_IRQ, mci_detect_int,
 "MMC card detect", data);
if (err)
printk(KERN_ERR "pcm990_mci_init: MMC/SD: can't request MMC "
diff --git a/arch/arm/mach-pxa/sharpsl_pm.c b/arch/arm/mach-pxa/sharpsl_pm.c
index 0a36d35..051a655 100644
--- a/arch/arm/mach-pxa/sharpsl_pm.c
+++ b/arch/arm/mach-pxa/sharpsl_pm.c
@@ -860,18 +860,18 @@ static int sharpsl_pm_probe(struct platform_device *pdev)
 
/* Register interrupt handlers */
irq = gpio_to_irq(sharpsl_pm.machinfo->gpio_acin);
-   if (request_irq(irq, sharpsl_ac_isr, IRQF_DISABLED | 
IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, "AC Input Detect", sharpsl_ac_isr)) 
{
+   if (request_irq(irq, sharpsl_ac_isr, IRQF_TRIGGER_RISING | 
IRQF_TRIGGER_FALLING, "AC Input Detect", sharpsl_ac_isr)) {

Re: [PATCH 1/3] smp/ipi: Remove redundant cfd->cpumask_ipi mask

2013-07-05 Thread Preeti U Murthy

Hi Wang,

On 07/06/2013 08:43 AM, Wang YanQing wrote:
> On Fri, Jul 05, 2013 at 09:57:01PM +0530, Preeti U Murthy wrote:
>> cfd->cpumask_ipi is used only in smp_call_function_many().The existing
>> comment around it says that this additional mask is used because
>> cfd->cpumask can get overwritten.
>>
>> There is no reason why the cfd->cpumask can be overwritten, since this
>> is a per_cpu mask; nobody can change it but us and we are
>> called with preemption disabled.
> 
> The ChangeLog for f44310b98ddb7f0d06550d73ed67df5865e3eda5
> which import cfd->cpumask_ipi saied the reason why we need
> it:
> 
> "As explained by Linus as well:
> 
>  |
>  | Once we've done the "list_add_rcu()" to add it to the
>  | queue, we can have (another) IPI to the target CPU that can
>  | now see it and clear the mask.
>  |
>  | So by the time we get to actually send the IPI, the mask might
>  | have been cleared by another IPI.

I am unable to understand where the cfd->cpumask of the source cpu is
getting cleared. Surely not by itself, since it is preempt disabled.
Also why should it get cleared?

The idea behind clearing a source CPU's cfd->cpumask AFAICS, could be
that the source cpu should not send an IPI to the target if the target
has already received an IPI from another CPU. The reason being that the
target would execute the already queued csds, hence would not need
another IPI to see its queue.

If the above is the intention of clearing the cfd->cpumask of the source
cpu, why is the mechanism not consistent with what happens in
generic_exec_single(), where in an ipi is decided to be sent if there
are no previous queued csds on the target?

Also why is it that in the wait condition under
smp_call_function_many(), cfd->cpumask continues to be used and not
cfd->cpumask_ipi ?

>  |
> 
> This patch also fixes a system hang problem, if the data->cpumask
> gets cleared after passing this point:
> 
> if (WARN_ONCE(!mask, "empty IPI mask"))
> return;
> 
> then the problem in commit 83d349f35e1a ("x86: don't send an IPI to
> the empty set of CPU's") will happen again.
> "
> So this patch is wrong.
> 
> And you should cc linus and Jan Beulich who give acked-by tag to
> the commit.
> 
> Thanks.
> 

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ARM: at91: remove IRQF_DISABLED

2013-07-05 Thread Michael Opdenacker

This flag is a NOOP since 2.6.36 and can be removed.

Signed-off-by: Michael Opdenacker 
---
 arch/arm/mach-at91/at91rm9200_time.c  | 2 +-
 arch/arm/mach-at91/at91sam926x_time.c | 2 +-
 arch/arm/mach-at91/at91x40_time.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/mach-at91/at91rm9200_time.c 
b/arch/arm/mach-at91/at91rm9200_time.c
index 180b302..f607deb 100644
--- a/arch/arm/mach-at91/at91rm9200_time.c
+++ b/arch/arm/mach-at91/at91rm9200_time.c
@@ -93,7 +93,7 @@ static irqreturn_t at91rm9200_timer_interrupt(int irq, void 
*dev_id)
 
 static struct irqaction at91rm9200_timer_irq = {
.name   = "at91_tick",
-   .flags  = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER | 
IRQF_IRQPOLL,
+   .flags  = IRQF_SHARED | IRQF_TIMER | IRQF_IRQPOLL,
.handler= at91rm9200_timer_interrupt,
.irq= NR_IRQS_LEGACY + AT91_ID_SYS,
 };
diff --git a/arch/arm/mach-at91/at91sam926x_time.c 
b/arch/arm/mach-at91/at91sam926x_time.c
index 3a4bc2e..bb39232 100644
--- a/arch/arm/mach-at91/at91sam926x_time.c
+++ b/arch/arm/mach-at91/at91sam926x_time.c
@@ -171,7 +171,7 @@ static irqreturn_t at91sam926x_pit_interrupt(int irq, void 
*dev_id)
 
 static struct irqaction at91sam926x_pit_irq = {
.name   = "at91_tick",
-   .flags  = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER | 
IRQF_IRQPOLL,
+   .flags  = IRQF_SHARED | IRQF_TIMER | IRQF_IRQPOLL,
.handler= at91sam926x_pit_interrupt,
.irq= NR_IRQS_LEGACY + AT91_ID_SYS,
 };
diff --git a/arch/arm/mach-at91/at91x40_time.c 
b/arch/arm/mach-at91/at91x40_time.c
index 2919eba..c0e637a 100644
--- a/arch/arm/mach-at91/at91x40_time.c
+++ b/arch/arm/mach-at91/at91x40_time.c
@@ -57,7 +57,7 @@ static irqreturn_t at91x40_timer_interrupt(int irq, void 
*dev_id)
 
 static struct irqaction at91x40_timer_irq = {
.name   = "at91_tick",
-   .flags  = IRQF_DISABLED | IRQF_TIMER,
+   .flags  = IRQF_TIMER,
.handler= at91x40_timer_interrupt
 };
 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/10] idr: Rewrite ida

2013-07-05 Thread Kent Overstreet

From: Kent Overstreet 

This is a new, from scratch implementation of ida that should be
simpler, faster and more space efficient.

Two primary reasons for the rewrite:
 * A future patch will reimplement idr on top of this ida implementation +
   radix trees. Once that's done, the end result will be ~1k fewer lines
   of code, much simpler and easier to understand and it should be quite
   a bit faster.

 * The performance improvements and addition of ganged allocation should
   make ida more suitable for use by a percpu id/tag allocator, which
   would then act as a frontend to this allocator.

The old ida implementation was done with the idr data structures - this
was IMO backwards. I'll soon be reimplementing idr on top of this new
ida implementation and radix trees - using a separate dedicated data
structure for the free ID bitmap should actually make idr faster, and
the end result is _significantly_ less code.

This implementation conceptually isn't that different from the old one -
it's a tree of bitmaps, where one bit in a given node indicates whether
or not there are free bits in a child node.

The main difference (and advantage) over the old version is that the
tree isn't implemented with pointers - it's implemented in an array,
like how heaps are implemented, which both better space efficiency and
it'll be faster since there's no pointer chasing. (It's not one giant
contiguous array, it's an array of arrays but the algorithm treats it as
one big array)

Time to allocate 1 << 24 ids:   0m0.663s
Time to allocate 1 << 24 ids, old code: 0m28.604s

Time to allocate INT_MAX ids:   1m41.371s
Time to allocate INT_MAX ids, old code: Got bored of waiting for it to finish.

Signed-off-by: Kent Overstreet 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: Stephen Rothwell 
Cc: Fengguang Wu 
Signed-off-by: Kent Overstreet 
---
 include/linux/idr.h | 122 ---
 lib/idr.c   | 894 +++-
 2 files changed, 687 insertions(+), 329 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index c0e0c54..a310bb0 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -16,6 +16,92 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* IDA */
+
+struct ida {
+   spinlock_t  lock;
+
+   /*
+* cur_id and allocated_ids are for ida_alloc_cyclic. For cyclic
+* allocations we search for new ids to allocate starting from the last
+* id allocated - cur_id is the next id to try allocating.
+*
+* But we also don't want the allocated ids to be arbitrarily sparse -
+* the memory usage for the bitmap could be arbitrarily bad, and if
+* they're used as keys in a radix tree the memory overhead of the radix
+* tree could be quite bad as well. So we use allocated_ids to decide
+* when to restart cur_id from 0, and bound how sparse the bitmap can
+* be.
+*/
+   unsignedcur_id;
+   unsignedallocated_ids;
+
+   /* size of ida->tree */
+   unsignednodes;
+
+   /*
+* Index of first leaf node in ida->tree; equal to the number of non
+* leaf nodes, ida->nodes - ida->first_leaf == number of leaf nodes
+*/
+   unsignedfirst_leaf;
+   unsignedsections;
+
+   unsigned long   **tree;
+   unsigned long   *inline_section;
+   unsigned long   inline_node;
+};
+
+#define IDA_INIT(name) \
+{  \
+   .lock   = __SPIN_LOCK_UNLOCKED(name.lock),  \
+   .nodes  = 1,\
+   .first_leaf = 0,\
+   .sections   = 1,\
+   .tree   = _section, \
+   .inline_section = _node,\
+}
+#define DEFINE_IDA(name)   struct ida name = IDA_INIT(name)
+
+void ida_remove(struct ida *ida, unsigned id);
+int ida_alloc_range(struct ida *ida, unsigned int start,
+ unsigned int end, gfp_t gfp);
+int ida_alloc_cyclic(struct ida *ida, unsigned start, unsigned end, gfp_t gfp);
+void ida_destroy(struct ida *ida);
+int ida_init_prealloc(struct ida *ida, unsigned prealloc);
+
+/**
+ * ida_alloc_range - allocate a new id.
+ * @ida: the (initialized) ida.
+ * @gfp_mask: memory allocation flags
+ *
+ * Allocates an id in the range [0, INT_MAX]. Returns -ENOSPC if no ids are
+ * available, or -ENOMEM on memory allocation failure.
+ *
+ * Returns the smallest available id
+ *
+ * Use ida_remove() to get rid of an id.
+ */
+static inline int ida_alloc(struct ida *ida, gfp_t gfp_mask)
+{
+   return ida_alloc_range(ida, 0, 0, gfp_mask);
+}
+
+/**
+ * ida_init - initialize ida handle
+ * @ida:   ida handle
+ *
+ * This

[PATCH 05/10] idr: Kill old deprecated idr interfaces

2013-07-05 Thread Kent Overstreet

From: Kent Overstreet 

The deprecated idr interfaces don't have any in kernel users, so let's
delete them as prep work for the idr rewrite.

Signed-off-by: Kent Overstreet 
Cc: Andrew Morton 
Cc: Tejun Heo 
Signed-off-by: Kent Overstreet 
---
 include/linux/idr.h | 63 -
 lib/idr.c   | 36 +++---
 2 files changed, 3 insertions(+), 96 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index f5b889b..b26f8b1 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -264,69 +264,6 @@ static inline void *idr_find(struct idr *idr, int id)
 #define idr_for_each_entry(idp, entry, id) \
for (id = 0; ((entry) = idr_get_next(idp, &(id))) != NULL; ++id)
 
-/*
- * Don't use the following functions.  These exist only to suppress
- * deprecated warnings on EXPORT_SYMBOL()s.
- */
-int __idr_pre_get(struct idr *idp, gfp_t gfp_mask);
-int __idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
-void __idr_remove_all(struct idr *idp);
-
-/**
- * idr_pre_get - reserve resources for idr allocation
- * @idp:   idr handle
- * @gfp_mask:  memory allocation flags
- *
- * Part of old alloc interface.  This is going away.  Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_pre_get(struct idr *idp, gfp_t gfp_mask)
-{
-   return __idr_pre_get(idp, gfp_mask);
-}
-
-/**
- * idr_get_new_above - allocate new idr entry above or equal to a start id
- * @idp: idr handle
- * @ptr: pointer you want associated with the id
- * @starting_id: id to start search at
- * @id: pointer to the allocated handle
- *
- * Part of old alloc interface.  This is going away.  Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_get_new_above(struct idr *idp, void *ptr,
-int starting_id, int *id)
-{
-   return __idr_get_new_above(idp, ptr, starting_id, id);
-}
-
-/**
- * idr_get_new - allocate new idr entry
- * @idp: idr handle
- * @ptr: pointer you want associated with the id
- * @id: pointer to the allocated handle
- *
- * Part of old alloc interface.  This is going away.  Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_get_new(struct idr *idp, void *ptr, int *id)
-{
-   return __idr_get_new_above(idp, ptr, 0, id);
-}
-
-/**
- * idr_remove_all - remove all ids from the given idr tree
- * @idp: idr handle
- *
- * If you're trying to destroy @idp, calling idr_destroy() is enough.
- * This is going away.  Don't use.
- */
-static inline void __deprecated idr_remove_all(struct idr *idp)
-{
-   __idr_remove_all(idp);
-}
-
 void __init idr_init_cache(void);
 
 #endif /* __IDR_H__ */
diff --git a/lib/idr.c b/lib/idr.c
index 0278e79..f7ba96b 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1070,19 +1070,6 @@ static void idr_mark_full(struct idr_layer **pa, int id)
}
 }
 
-int __idr_pre_get(struct idr *idp, gfp_t gfp_mask)
-{
-   while (idp->id_free_cnt < MAX_IDR_FREE) {
-   struct idr_layer *new;
-   new = kmem_cache_zalloc(idr_layer_cache, gfp_mask);
-   if (new == NULL)
-   return (0);
-   move_to_free_list(idp, new);
-   }
-   return 1;
-}
-EXPORT_SYMBOL(__idr_pre_get);
-
 /**
  * sub_alloc - try to allocate an id without growing the tree depth
  * @idp: idr handle
@@ -1248,21 +1235,6 @@ static void idr_fill_slot(struct idr *idr, void *ptr, 
int id,
idr_mark_full(pa, id);
 }
 
-int __idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id)
-{
-   struct idr_layer *pa[MAX_IDR_LEVEL + 1];
-   int rv;
-
-   rv = idr_get_empty_slot(idp, starting_id, pa, 0, idp);
-   if (rv < 0)
-   return rv == -ENOMEM ? -EAGAIN : rv;
-
-   idr_fill_slot(idp, ptr, rv, pa);
-   *id = rv;
-   return 0;
-}
-EXPORT_SYMBOL(__idr_get_new_above);
-
 /**
  * idr_preload - preload for idr_alloc()
  * @gfp_mask: allocation mask to use for preloading
@@ -1483,7 +1455,7 @@ void idr_remove(struct idr *idp, int id)
 }
 EXPORT_SYMBOL(idr_remove);
 
-void __idr_remove_all(struct idr *idp)
+static void __idr_remove_all(struct idr *idp)
 {
int n, id, max;
int bt_mask;
@@ -1516,7 +1488,6 @@ void __idr_remove_all(struct idr *idp)
}
idp->layers = 0;
 }
-EXPORT_SYMBOL(__idr_remove_all);
 
 /**
  * idr_destroy - release all cached layers within an idr tree
@@ -1578,13 +1549,12 @@ EXPORT_SYMBOL(idr_find_slowpath);
  * callback function will be called for each pointer currently
  * registered, passing the id, the pointer and the data pointer passed
  * to this function.  It is not safe to modify the idr tree while in
- * the callback, so functions such as idr_get_new and idr_remove are
- * not allowed.
+ * the callback, so functions such as idr_remove are not allowed.
  *
  * We check the

[PATCH v3] lib/idr.c rewrite, percpu ida/tag allocator

2013-07-05 Thread Kent Overstreet

Previous posting: http://thread.gmane.org/gmane.linux.kernel/1511216

The only real change since the last version is that I've reworked the
new ida implementation to not use one giant allocation - it's still
logically one big arary, but it's implemented as an array of arrays.

With that, it scales up to INT_MAX allocated ids just fine. Benchmarks
are included in that patch.

Patch series is available in my git repo:
git://evilpiepirate.org/~kent/linux-bcache.git idr

Andrew, want to pick this up for 3.12?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/10] idr: Percpu ida

2013-07-05 Thread Kent Overstreet

From: Kent Overstreet 

Percpu frontend for allocating ids. With percpu allocation (that works),
it's impossible to guarantee it will always be possible to allocate all
nr_tags - typically, some will be stuck on a remote percpu freelist
where the current job can't get to them.

We do guarantee that it will always be possible to allocate at least
(nr_tags / 2) tags - this is done by keeping track of which and how many
cpus have tags on their percpu freelists. On allocation failure if
enough cpus have tags that there could potentially be (nr_tags / 2) tags
stuck on remote percpu freelists, we then pick a remote cpu at random to
steal from.

Note that there's no cpu hotplug notifier - we don't care, because
steal_tags() will eventually get the down cpu's tags. We _could_ satisfy
more allocations if we had a notifier - but we'll still meet our
guarantees and it's absolutely not a correctness issue, so I don't think
it's worth the extra code.

Signed-off-by: Kent Overstreet 
Cc: Tejun Heo 
Cc: Oleg Nesterov 
Cc: Christoph Lameter 
Cc: Ingo Molnar 
Cc: Andi Kleen 
Cc: Jens Axboe 
Cc: "Nicholas A. Bellinger" 
Signed-off-by: Kent Overstreet 
---
 include/linux/idr.h |  46 +
 lib/idr.c   | 282 
 2 files changed, 328 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index a310bb0..f5b889b 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -101,6 +101,52 @@ static inline void ida_init(struct ida *ida)
ida_init_prealloc(ida, 0);
 }
 
+/* Percpu IDA/tag allocator */
+
+struct percpu_ida_cpu;
+
+struct percpu_ida {
+   /*
+* number of tags available to be allocated, as passed to
+* percpu_ida_init()
+*/
+   unsignednr_tags;
+
+   struct percpu_ida_cpu __percpu  *tag_cpu;
+
+   /*
+* Bitmap of cpus that (may) have tags on their percpu freelists:
+* steal_tags() uses this to decide when to steal tags, and which cpus
+* to try stealing from.
+*
+* It's ok for a freelist to be empty when its bit is set - steal_tags()
+* will just keep looking - but the bitmap _must_ be set whenever a
+* percpu freelist does have tags.
+*/
+   unsigned long   *cpus_have_tags;
+
+   struct {
+   /*
+* When we go to steal tags from another cpu (see steal_tags()),
+* we want to pick a cpu at random. Cycling through them every
+* time we steal is a bit easier and more or less equivalent:
+*/
+   unsignedcpu_last_stolen;
+
+   /* For sleeping on allocation failure */
+   wait_queue_head_t   wait;
+
+   /* Global freelist */
+   struct ida  ida;
+   } cacheline_aligned_in_smp;
+};
+
+int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
+void percpu_ida_free(struct percpu_ida *pool, unsigned tag);
+
+void percpu_ida_destroy(struct percpu_ida *pool);
+int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags);
+
 /* IDR */
 
 /*
diff --git a/lib/idr.c b/lib/idr.c
index 4666350..0278e79 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -629,6 +629,288 @@ err:
 }
 EXPORT_SYMBOL(ida_init_prealloc);
 
+/* Percpu IDA */
+
+/*
+ * Number of tags we move between the percpu freelist and the global freelist 
at
+ * a time
+ */
+#define IDA_PCPU_BATCH_MOVE32U
+
+/* Max size of percpu freelist, */
+#define IDA_PCPU_SIZE  ((IDA_PCPU_BATCH_MOVE * 3) / 2)
+
+struct percpu_ida_cpu {
+   spinlock_t  lock;
+   unsignednr_free;
+   unsignedfreelist[];
+};
+
+/*
+ * Try to steal tags from a remote cpu's percpu freelist.
+ *
+ * We first check how many percpu freelists have tags - we don't steal tags
+ * unless enough percpu freelists have tags on them that it's possible more 
than
+ * half the total tags could be stuck on remote percpu freelists.
+ *
+ * Then we iterate through the cpus until we find some tags - we don't attempt
+ * to find the "best" cpu to steal from, to keep cacheline bouncing to a
+ * minimum.
+ */
+static inline void steal_tags(struct percpu_ida *pool,
+ struct percpu_ida_cpu *tags)
+{
+   unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
+   struct percpu_ida_cpu *remote;
+
+   for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids);
+cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2;
+cpus_have_tags--) {
+   cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
+
+   if (cpu == nr_cpu_ids)
+   cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
+
+   if (cpu == nr_cpu_ids)
+   BUG();
+
+   pool->cpu_last_stolen = cpu;
+   remote =

[PATCH 08/10] idr: Reimplement idr on top of ida/radix trees

2013-07-05 Thread Kent Overstreet

The old idr code was really a second radix tree implementation - we
already have one in lib/radix-tree.c.

This patch reimplements idr on top of our existing radix trees, using
our shiny new ida implementation for allocating/freeing the ids. The old
idr code was noticably slower than lib/radix-tree.c in at least some
benchmarks, so in addition to being ~500 lines less code this patch
should improve performance too.

There's one thing left unfinished in this patch - the existing
idr_preload() interface won't work for ida. Another patch on top of this
will fix idr_preload() and update existing users to the new interface.

Signed-off-by: Kent Overstreet 
Cc: Andrew Morton 
Cc: Tejun Heo 
Signed-off-by: Kent Overstreet 
---
 include/linux/idr.h | 157 -
 init/main.c |   1 -
 lib/idr.c   | 896 ++--
 3 files changed, 249 insertions(+), 805 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 6fc0225..85355d7 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -1,6 +1,6 @@
 /*
  * include/linux/idr.h
- * 
+ *
  * 2002-10-18  written by Jim Houston jim.hous...@ccur.com
  * Copyright (C) 2002 by Concurrent Computer Corporation
  * Distributed under the GNU GPL license version 2.
@@ -12,10 +12,8 @@
 #ifndef __IDR_H__
 #define __IDR_H__
 
-#include 
-#include 
-#include 
-#include 
+#include 
+#include 
 #include 
 #include 
 
@@ -149,74 +147,42 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned 
long nr_tags);
 
 /* IDR */
 
-/*
- * We want shallower trees and thus more bits covered at each layer.  8
- * bits gives us large enough first layer for most use cases and maximum
- * tree depth of 4.  Each idr_layer is slightly larger than 2k on 64bit and
- * 1k on 32bit.
+/**
+ * DOC: idr sync
+ * idr synchronization (stolen from radix-tree.h)
+ *
+ * idr_alloc() and idr_remove() do their own locking internally - the user need
+ * not be concerned with synchronization unless there's other operations that
+ * need to be done atomically.
+ *
+ * idr_find() does no locking - it can be called locklessly using RCU, if the
+ * caller ensures calls to this function are made within rcu_read_lock()
+ * regions and does all the other appropriate RCU stuff.
  */
-#define IDR_BITS 8
-#define IDR_SIZE (1 << IDR_BITS)
-#define IDR_MASK ((1 << IDR_BITS)-1)
-
-struct idr_layer {
-   int prefix; /* the ID prefix of this idr_layer */
-   DECLARE_BITMAP(bitmap, IDR_SIZE); /* A zero bit means "space here" */
-   struct idr_layer __rcu  *ary[1prefix)
-   return rcu_dereference_raw(hint->ary[id & IDR_MASK]);
-
-   return idr_find_slowpath(idr, id);
+   return __radix_idr_ptr(radix_tree_lookup(>ptrs, id));
 }
 
 /**
  * idr_for_each_entry - iterate over an idr's elements of a given type
- * @idp: idr handle
+ * @idr: idr handle
  * @entry:   the type * to use as cursor
  * @id:  id entry's key
  *
@@ -266,9 +273,7 @@ static inline void *idr_find(struct idr *idr, int id)
  * after normal terminatinon @entry is left with the value NULL.  This
  * is convenient for a "not found" value.
  */
-#define idr_for_each_entry(idp, entry, id) \
-   for (id = 0; ((entry) = idr_find_next(idp, &(id))) != NULL; ++id)
-
-void __init idr_init_cache(void);
+#define idr_for_each_entry(idr, entry, id) \
+   for (id = 0; ((entry) = idr_find_next(idr, &(id))) != NULL; ++id)
 
 #endif /* __IDR_H__ */
diff --git a/init/main.c b/init/main.c
index 9484f4b..87b5a0f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -541,7 +541,6 @@ asmlinkage void __init start_kernel(void)
preempt_disable();
if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, 
fixing it\n"))
local_irq_disable();
-   idr_init_cache();
perf_event_init();
rcu_init();
tick_nohz_init();
diff --git a/lib/idr.c b/lib/idr.c
index a3977aa..fc7cb1a 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -8,24 +8,10 @@
  *
  * Modified by Nadia Derbey to make it RCU safe.
  *
- * IDA completely rewritten by Kent Overstreet 
+ * Completely rewritten by Kent Overstreet .
  *
- * Small id to pointer translation service.
- *
- * It uses a radix tree like structure as a sparse array indexed
- * by the id to obtain the pointer.  The bitmap makes allocating
- * a new id quick.
- *
- * You call it to allocate an id (an int) an associate with that id a
- * pointer or what ever, we treat it as a (void *).  You can pass this
- * id to a user for him to pass back at a later time.  You then pass
- * that id to this code and it returns your pointer.
-
- * You can release ids at any time. When all ids are released, most of
- * the memory is returned (we keep MAX_IDR_FREE) in a local pool so we
- * don't need to go to the memory "store" during an id allocate, just
- * so you don't need to be too

[PATCH 06/10] idr: Rename idr_get_next() -> idr_find_next()

2013-07-05 Thread Kent Overstreet

From: Kent Overstreet 

get() implies taking a ref or sometimes an allocation, which this
function definitely does not do - rename it to something more sensible.

Signed-off-by: Kent Overstreet 
Cc: Andrew Morton 
Cc: Tejun Heo 
Signed-off-by: Kent Overstreet 
---
 drivers/block/drbd/drbd_main.c | 2 +-
 drivers/block/drbd/drbd_nl.c   | 2 +-
 drivers/mtd/mtdcore.c  | 2 +-
 include/linux/idr.h| 4 ++--
 lib/idr.c  | 6 +++---
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index a5dca6a..b84e4b2 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -500,7 +500,7 @@ int conn_lowest_minor(struct drbd_tconn *tconn)
int vnr = 0, m;
 
rcu_read_lock();
-   mdev = idr_get_next(>volumes, );
+   mdev = idr_find_next(>volumes, );
m = mdev ? mdev_to_minor(mdev) : -1;
rcu_read_unlock();
 
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 9e3f441..8fa7eb1 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2837,7 +2837,7 @@ int get_one_status(struct sk_buff *skb, struct 
netlink_callback *cb)
}
if (tconn) {
 next_tconn:
-   mdev = idr_get_next(>volumes, );
+   mdev = idr_find_next(>volumes, );
if (!mdev) {
/* No more volumes to dump on this tconn.
 * Advance tconn iterator. */
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index c400c57..eaa1fcc 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(mtd_table_mutex);
 
 struct mtd_info *__mtd_next_device(int i)
 {
-   return idr_get_next(_idr, );
+   return idr_find_next(_idr, );
 }
 EXPORT_SYMBOL_GPL(__mtd_next_device);
 
diff --git a/include/linux/idr.h b/include/linux/idr.h
index b26f8b1..6395da1 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -211,7 +211,7 @@ int idr_alloc(struct idr *idp, void *ptr, int start, int 
end, gfp_t gfp_mask);
 int idr_alloc_cyclic(struct idr *idr, void *ptr, int start, int end, gfp_t 
gfp_mask);
 int idr_for_each(struct idr *idp,
 int (*fn)(int id, void *p, void *data), void *data);
-void *idr_get_next(struct idr *idp, int *nextid);
+void *idr_find_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_free(struct idr *idp, int id);
@@ -262,7 +262,7 @@ static inline void *idr_find(struct idr *idr, int id)
  * is convenient for a "not found" value.
  */
 #define idr_for_each_entry(idp, entry, id) \
-   for (id = 0; ((entry) = idr_get_next(idp, &(id))) != NULL; ++id)
+   for (id = 0; ((entry) = idr_find_next(idp, &(id))) != NULL; ++id)
 
 void __init idr_init_cache(void);
 
diff --git a/lib/idr.c b/lib/idr.c
index f7ba96b..254e0dc 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1594,7 +1594,7 @@ int idr_for_each(struct idr *idp,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
- * idr_get_next - lookup next object of id to given id.
+ * idr_find_next - lookup next object of id to given id.
  * @idp: idr handle
  * @nextidp:  pointer to lookup key
  *
@@ -1605,7 +1605,7 @@ EXPORT_SYMBOL(idr_for_each);
  * This function can be called under rcu_read_lock(), given that the leaf
  * pointers lifetimes are correctly managed.
  */
-void *idr_get_next(struct idr *idp, int *nextidp)
+void *idr_find_next(struct idr *idp, int *nextidp)
 {
struct idr_layer *p, *pa[MAX_IDR_LEVEL + 1];
struct idr_layer **paa = [0];
@@ -1646,7 +1646,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
}
return NULL;
 }
-EXPORT_SYMBOL(idr_get_next);
+EXPORT_SYMBOL(idr_find_next);
 
 
 /**
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why no USB id list in the kernel sources?

2013-07-05 Thread Michael Opdenacker

Hi Greg,

On 07/05/2013 11:43 PM, Greg KH wrote:
> On Fri, Jul 05, 2013 at 11:34:05PM +0200, Michael Opdenacker wrote:
>> Hi,
>>
>> I'm wondering why there is no include/linux/usb_ids.h (or
>> include/linux/usb/ids.h) file in the same way there is a
>> include/linux/pci_ids.h for PCI.
> Because that way lies madness, we have learned from our mistakes and do
> not want to repeat them again :)
>
> It turns out that the pci_ids file isn't a good idea, it's a merge mess,
> and only really works when you have ids that are shared across different
> drivers.  In the end, that is a very small number, and it's just not
> worth the time and effort to do this in a centralized way.
>
> Hope this helps explain things, if you want more details, dig into the
> linux usb mailing list about 10-15 years ago when this decision was
> made.

I understand better now, thanks. It's true that the added value would
have been relatively small anyway.

Thanks for your time!

Cheers,

Michael.
>
> thanks,
>
> greg k-h


-- 
Michael Opdenacker, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
+33 484 258 098

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hashtable: add hash_for_each_possible_rcu_notrace()

2013-07-05 Thread Alexey Kardashevskiy

This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/hashtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..af8b169 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects 
hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+   hlist_for_each_entry_rcu_notrace(obj, [hash_min(key, 
HASH_BITS(name))],\
+   member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to 
the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL for v3.11-rc1] media patches for v3.11

2013-07-05 Thread Mauro Carvalho Chehab


On Fri, 5 Jul 2013, Bjørn Mork wrote:


Mauro Carvalho Chehab  writes:


 mode change 100755 => 100644 lib/build_OID_registry
 mode change 100755 => 100644 scripts/Lindent
 mode change 100755 => 100644 scripts/bloat-o-meter
 mode change 100755 => 100644 scripts/checkincludes.pl
 mode change 100755 => 100644 scripts/checkkconfigsymbols.sh
 mode change 100755 => 100644 scripts/checkpatch.pl
 mode change 100755 => 100644 scripts/checkstack.pl
 mode change 100755 => 100644 scripts/checksyscalls.sh
 mode change 100755 => 100644 scripts/checkversion.pl
 mode change 100755 => 100644 scripts/cleanfile
 mode change 100755 => 100644 scripts/cleanpatch
 mode change 100755 => 100644 scripts/coccicheck
 mode change 100755 => 100644 scripts/config
 mode change 100755 => 100644 scripts/decodecode
 mode change 100755 => 100644 scripts/depmod.sh
 mode change 100755 => 100644 scripts/diffconfig
 mode change 100755 => 100644 scripts/extract-ikconfig
 mode change 100755 => 100644 scripts/extract-vmlinux
 mode change 100755 => 100644 scripts/get_maintainer.pl
 mode change 100755 => 100644 scripts/gfp-translate
 mode change 100755 => 100644 scripts/headerdep.pl
 mode change 100755 => 100644 scripts/headers.sh
 mode change 100755 => 100644 scripts/kconfig/check.sh
 mode change 100755 => 100644 scripts/kconfig/merge_config.sh
 mode change 100755 => 100644 scripts/kernel-doc
 mode change 100755 => 100644 scripts/makelst
 mode change 100755 => 100644 scripts/mkcompile_h
 mode change 100755 => 100644 scripts/mkuboot.sh
 mode change 100755 => 100644 scripts/namespace.pl
 mode change 100755 => 100644 scripts/package/mkspec
 mode change 100755 => 100644 scripts/patch-kernel
 mode change 100755 => 100644 scripts/recordmcount.pl
 mode change 100755 => 100644 scripts/setlocalversion
 mode change 100755 => 100644 scripts/show_delta
 mode change 100755 => 100644 scripts/sign-file
 mode change 100755 => 100644 scripts/tags.sh
 mode change 100755 => 100644 scripts/ver_linux
 mode change 100755 => 100644 tools/hv/hv_get_dhcp_info.sh
 mode change 100755 => 100644 tools/hv/hv_get_dns_info.sh
 mode change 100755 => 100644 tools/hv/hv_set_ifconfig.sh
 mode change 100755 => 100644 tools/nfsd/inject_fault.sh
 mode change 100755 => 100644 tools/perf/python/twatch.py
 mode change 100755 => 100644 
tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/EventClass.py
 mode change 100755 => 100644 
tools/perf/scripts/python/bin/net_dropmonitor-record
 mode change 100755 => 100644 
tools/perf/scripts/python/bin/net_dropmonitor-report
 mode change 100755 => 100644 tools/perf/scripts/python/net_dropmonitor.py
 mode change 100755 => 100644 tools/perf/util/PERF-VERSION-GEN
 mode change 100755 => 100644 tools/perf/util/generate-cmdlist.sh
 mode change 100755 => 100644 tools/power/cpupower/utils/version-gen.sh
 mode change 100755 => 100644 tools/testing/ktest/compare-ktest-sample.pl
 mode change 100755 => 100644 tools/testing/ktest/ktest.pl



You didn't really mean to do that, did you?


No, those changes were incidental. It were caused by a bad script
that were meant to just replace my e-mail.

I'll fix it.

Thanks for pointing it!

Regards,
Mauro

Re: [PATCH 1/3] smp/ipi: Remove redundant cfd->cpumask_ipi mask

2013-07-05 Thread Wang YanQing

On Fri, Jul 05, 2013 at 09:57:01PM +0530, Preeti U Murthy wrote:
> cfd->cpumask_ipi is used only in smp_call_function_many().The existing
> comment around it says that this additional mask is used because
> cfd->cpumask can get overwritten.
> 
> There is no reason why the cfd->cpumask can be overwritten, since this
> is a per_cpu mask; nobody can change it but us and we are
> called with preemption disabled.

The ChangeLog for f44310b98ddb7f0d06550d73ed67df5865e3eda5
which import cfd->cpumask_ipi saied the reason why we need
it:

"As explained by Linus as well:

 |
 | Once we've done the "list_add_rcu()" to add it to the
 | queue, we can have (another) IPI to the target CPU that can
 | now see it and clear the mask.
 |
 | So by the time we get to actually send the IPI, the mask might
 | have been cleared by another IPI.
 |

This patch also fixes a system hang problem, if the data->cpumask
gets cleared after passing this point:

if (WARN_ONCE(!mask, "empty IPI mask"))
return;

then the problem in commit 83d349f35e1a ("x86: don't send an IPI to
the empty set of CPU's") will happen again.
"
So this patch is wrong.

And you should cc linus and Jan Beulich who give acked-by tag to
the commit.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-07-05 Thread Simon Kirby

We saw two Oopses overnight on two separate boxes that seem possibly
related, but both are weird. These boxes typically run btrfs for rsync
snapshot backups (and usually Oops in btrfs ;), but not this time!
backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
was running 3.10 release plus btrfs-next from yesterday. Full kern.log
and .config at http://0x.ca/sim/ref/3.10/

backup02's first Oops:

BUG: unable to handle kernel paging request at 0001
IP: [] kmem_cache_alloc+0x4b/0x110
PGD 1f54f7067 PUD 0
Oops:  [#1] SMP
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw 
bnx2 evdev
CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: 8802c3f08000 ti: 8801b4876000 task.ti: 8801b4876000
RIP: 0010:[]  [] kmem_cache_alloc+0x4b/0x110
RSP: 0018:8801b4877e88  EFLAGS: 00010206
RAX:  RBX: 8802c3f08000 RCX: 017f040e
RDX: 017f040d RSI: 00d0 RDI: 8107a503
RBP: 8801b4877ec8 R08: 00016a80 R09: 
R10: 7fff025fe120 R11: 0246 R12: 00d0
R13: 88042d8019c0 R14: 0001 R15: 7fc3588ee97f
FS:  () GS:88043fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0001 CR3: 000409d68000 CR4: 07f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
 8801b4877ed8 8112a1bc 8800985acd20 8802c3f08000
 0001 7fc3588ee334 7fc358af5758 7fc3588ee97f
 8801b4877ee8 8107a503 8801b4877ee8 ffea
Call Trace:
 [] ? __fput+0x12c/0x240
 [] prepare_creds+0x23/0x150
 [] SyS_faccessat+0x34/0x1f0
 [] SyS_access+0x13/0x20
 [] system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 
8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 
65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [] kmem_cache_alloc+0x4b/0x110
 RSP 
CR2: 0001
---[ end trace 744477356cd98306 ]---

backup03's first Oops:

BUG: unable to handle kernel paging request at 880502efc240
IP: [] kmem_cache_alloc+0x4b/0x110
PGD 1d3a067 PUD 0
Oops:  [#1] SMP
Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw 
evdev
CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
task: 88040111c3b0 ti: 8803c23ae000 task.ti: 8803c23ae000
RIP: 0010:[]  [] kmem_cache_alloc+0x4b/0x110
RSP: 0018:8803c23afd90  EFLAGS: 00010282
RAX:  RBX: 88040111c3b0 RCX: 0002a76e
RDX: 0002a76d RSI: 00d0 RDI: 8107a4e3
RBP: 8803c23afdd0 R08: 00016a80 R09: 
R10: fffe R11: ffd0 R12: 00d0
R13: 88041d403980 R14: 880502efc240 R15: 88010e375a40
FS:  7f2cae496700() GS:88041f2c() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 880502efc240 CR3: 0001e0ced000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
 8803c23afe98 8803c23afdb8 81133811 88040111c3b0
 88010e375a40 01200011 7f2cae4969d0 88010e375a40
 8803c23afdf0 8107a4e3 81b49b80 01200011
Call Trace:
 [] ? final_putname+0x21/0x50
 [] prepare_creds+0x23/0x150
 [] copy_creds+0x31/0x160
 [] ? unlazy_fpu+0x9b/0xb0
 [] copy_process.part.49+0x239/0x1390
 [] ? __alloc_fd+0x42/0x100
 [] do_fork+0xa4/0x320
 [] ? __do_pipe_flags+0x77/0xb0
 [] ? __fd_install+0x26/0x60
 [] SyS_clone+0x11/0x20
 [] stub_clone+0x69/0x90
 [] ? system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 
8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 
65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [] kmem_cache_alloc+0x4b/0x110
 RSP 
CR2: 880502efc240
---[ end trace 956d153150ecc57f ]---

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Avoid unnecessary calls to work_on_cpu

2013-07-05 Thread Benjamin Herrenschmidt

On Fri, 2013-07-05 at 17:36 -0600, Bjorn Helgaas wrote:
> It seems a little strange to me that this "run the driver probe method
> on the correct node" code is in PCI.  I would think this behavior
> would be desirable for *all* bus types, not just PCI, so maybe it
> would make sense to do this up in device_attach() or somewhere
> similar.
> 
> But Rusty added this (in 873392ca51), and he knows way more about this
> stuff than I do.

I tend to agree... I can see this being useful on some of our non-PCI
devices on power as well in fact.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 9/9] clocksource: dw_apb_timer: special variant for rockchip rk3188 timers

2013-07-05 Thread Thomas Gleixner

On Sat, 6 Jul 2013, Heiko Stübner wrote:
> + if (of_device_is_compatible(np, "rockchip,rk3188-dw-apb-timer-osc"))
> + *quirks |= APBTMR_QUIRK_64BIT_COUNTER | APBTMR_QUIRK_NO_EOI |
> +APBTMR_QUIRK_INVERSE_INTMASK |
> +APBTMR_QUIRK_INVERSE_PERIODIC;

Brilliant. Next time we add

> + if (of_device_is_compatible(np, "rockchip,rk3188-dw-apb-timer-osc2"))
> + *quirks |= APBTMR_QUIRK_64BIT_COUNTER | APBTMR_QUIRK_NO_EOI |
> +APBTMR_QUIRK_INVERSE_INTMASK |
> +APBTMR_QUIRK_INVERSE_PERIODIC | MORE_NONSENSE;

Plus the extra conditionals all over the place

and a week later

> + if (of_device_is_compatible(np, "rockchip,rk3188-dw-apb-timer-osc3"))
> + *quirks |= ALLNONSENSE;

This has nothing to do with QUIRKS at all. QUIRKS are our last resort
when we cannot deal with the problem in a sane way.

In this case this is simply the wrong aproach.

We can deal with it in a sane way as I pointed out before and handle
it as simple properties of the IP block. We have all mechanisms in
place to handle such properties, device tree, platform data, static
init structures. It's all there and we really do not need to plaster
the code with random QUIRK conditionals.

Did you ever consider the runtime penalty of this? Probably not,
otherwise you would have spent time on speeding up that code by
caching frequently accessed registers instead of reading them back
over and over for no reason.

Thanks,

tglx

Re: [PATCH 6/9] clocksource: dw_apb_timer: quirk for inverted int mask

2013-07-05 Thread Thomas Gleixner

On Sat, 6 Jul 2013, Heiko Stübner wrote:

> Some timer variants use an inverted setting to mask the timer interrupt.
> Therefore add a quirk to handle these variants.

And by that add even more pointless conditionals into critical code
pathes.

Re: [PATCH 5/9] clocksource: dw_apb_timer: quirk for variants without EOI register

2013-07-05 Thread Thomas Gleixner

On Sat, 6 Jul 2013, Heiko Stübner wrote:
> - dw_ced->eoi = apbt_eoi;
> + if (quirks & APBTMR_QUIRK_NO_EOI)
> + dw_ced->eoi = apbt_eoi_int_status;
> + else
> + dw_ced->eoi = apbt_eoi;

No again. This has nothing to do with quirks. We use quirks for
workarounds and not for refactoring of code.

Thanks,

tglx

Re: [PATCH 3/9] clocksource: dw_apb_timer: quirk for variants with 64bit counter

2013-07-05 Thread Thomas Gleixner

On Sat, 6 Jul 2013, Heiko Stübner wrote:

> This adds a quirk for IP variants containing two load_count and value
> registers that are used to provide 64bit accuracy on 32bit systems.
> 
> The added accuracy is currently not used, the driver is only adapted to
> handle the different register layout and make it work on affected devices.
> 
> Signed-off-by: Heiko Stuebner 
> ---
>  drivers/clocksource/dw_apb_timer.c |   27 +++
>  include/linux/dw_apb_timer.h   |6 ++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/drivers/clocksource/dw_apb_timer.c 
> b/drivers/clocksource/dw_apb_timer.c
> index f5e7be8..bd45351 100644
> --- a/drivers/clocksource/dw_apb_timer.c
> +++ b/drivers/clocksource/dw_apb_timer.c
> @@ -56,6 +56,17 @@ static void apbt_init_regs(struct dw_apb_timer *timer, int 
> quirks)
>   timer->reg_control = APBTMR_N_CONTROL;
>   timer->reg_eoi = APBTMR_N_EOI;
>   timer->reg_int_status = APBTMR_N_INT_STATUS;
> +
> + /*
> +  * On variants with 64bit counters some registers are
> +  * moved further down.
> +  */
> + if (quirks & APBTMR_QUIRK_64BIT_COUNTER) {
> + timer->reg_current_value += 0x4;
> + timer->reg_control += 0x8;
> + timer->reg_eoi += 0x8;
> + timer->reg_int_status += 0x8;
> + }
>  }

Oh, no. this is not how we handle these things.

1) We want proper constants for this 64bit IP block 

2) This is not a quirk, it's a property of that particular IP block

You already made the register offsets a part of the timer structure,
so why don't you supply a proper structure filled with that values to
the init function?

That's what we do all over the place. Either we instantiate those
structs at compile time or runtime fed by device tree or any other
configuration mechanism.
  
>  static unsigned long apbt_readl(struct dw_apb_timer *timer, unsigned long 
> offs)
> @@ -145,6 +156,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
>   udelay(1);
>   pr_debug("Setting clock period %lu for HZ %d\n", period, HZ);
>   apbt_writel(timer, period, timer->reg_load_count);
> +
> + if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
> + apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> +

No. We are not adding such conditional constructs when we can deal
with them just by providing a proper set of function pointers. And
definitely not with hardcoded magic 0x4 constants involved.

 timer->load_count(timer, value);

Provide a 32 bit and a 64 bit version of that function and be done
with it.

>   ctrl |= APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
>   break;
> @@ -168,6 +183,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
>* running mode.
>*/
>   apbt_writel(timer, ~0, timer->reg_load_count);
> +
> + if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
> + apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> +

Makes this copy go away.

>   ctrl &= ~APBTMR_CONTROL_INT;
>   ctrl |= APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
> @@ -199,6 +218,10 @@ static int apbt_next_event(unsigned long delta,
>   apbt_writel(timer, ctrl, timer->reg_control);
>   /* write new count */
>   apbt_writel(timer, delta, timer->reg_load_count);
> +
> + if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
> + apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> +

And this one as well.

>   ctrl |= APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
>  
> @@ -325,6 +348,10 @@ void dw_apb_clocksource_start(struct dw_apb_clocksource 
> *dw_cs)
>   ctrl &= ~APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
>   apbt_writel(timer, ~0, timer->reg_load_count);
> +
> + if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
> + apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> +

Copy and paste is a conveniant thing, right? It just should have a pop
up window assigned which asks at the second instance of copying the
same thing whether you really thought about it.

Thanks,

tglx

[PATCH] irqchip: gic: Don't complain in gic_get_cpumask() if UP system

2013-07-05 Thread Stephen Boyd

In a uniprocessor implementation the interrupt processor targets
registers are read-as-zero/write-ignored (RAZ/WI). Unfortunately
gic_get_cpumask() will print a critical message saying

 GIC CPU mask not found - kernel will fail to boot.

if these registers all read as zero, but there won't actually be
a problem on uniprocessor systems and the kernel will boot just
fine. Skip this check if we're running a UP kernel or if we
detect that the hardware only supports a single processor.

Cc: Nicolas Pitre 
Cc: Russell King 
Signed-off-by: Stephen Boyd 
---

Maybe we should just drop the check entirely? It looks like it may
just be debug code that won't ever trigger in practice, even on the
11MPCore that caused this code to be introduced.

 drivers/irqchip/irq-gic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 19ceaa6..589c760 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -368,7 +368,7 @@ static u8 gic_get_cpumask(struct gic_chip_data *gic)
break;
}
 
-   if (!mask)
+   if (!mask && num_possible_cpus() > 1)
pr_crit("GIC CPU mask not found - kernel will fail to boot.\n");
 
return mask;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pci: Avoid unnecessary calls to work_on_cpu

2013-07-05 Thread Bjorn Helgaas

[+cc Rusty]

On Mon, Jun 24, 2013 at 2:05 PM, Alexander Duyck
 wrote:
> This patch is meant to address the fact that we are making unnecessary calls
> to work_on_cpu.  To resolve this I have added a check to see if the current
> node is the correct node for the device before we decide to assign the probe
> task to another CPU.
>
> The advantages to this approach is that we can avoid reentrant calls to
> work_on_cpu.  In addition we should not make any calls to setup the work
> remotely in the case of a single node system that has NUMA enabled.

The description above makes it sound like this is just a minor
performance enhancement, but I think the real reason you want this is
to resolve the lockdep warning mentioned at [1].  That thread is long
and confusing, so I'd like to see a bugzilla that distills out the
useful details, and a synopsis in this changelog.

[1] 
https://lkml.kernel.org/r/1368498506-25857-7-git-send-email-ying...@kernel.org

> Signed-off-by: Alexander Duyck 
> ---
>
> This patch is based off of work I submitted in an earlier patch that I never
> heard back on.  The change was originally submitted in:
>   pci: Avoid reentrant calls to work_on_cpu
>
> I'm not sure what ever happened with that patch, however after reviewing it
> some myself I decided I could do without the change to the comments since they
> were unneeded.  As such I am resubmitting this as a much simpler patch that
> only adds the line of code needed to avoid calling work_on_cpu for every call
> to probe on an NUMA node specific device.
>
>  drivers/pci/pci-driver.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 79277fb..7d81713 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -282,7 +282,7 @@ static int pci_call_probe(struct pci_driver *drv, struct 
> pci_dev *dev,
>its local memory on the right node without any need to
>change it. */
> node = dev_to_node(>dev);
> -   if (node >= 0) {
> +   if ((node >= 0) && (node != numa_node_id())) {
> int cpu;
>
> get_online_cpus();

I think it's theoretically unsafe to use numa_node_id() while
preemption is enabled.

It seems a little strange to me that this "run the driver probe method
on the correct node" code is in PCI.  I would think this behavior
would be desirable for *all* bus types, not just PCI, so maybe it
would make sense to do this up in device_attach() or somewhere
similar.

But Rusty added this (in 873392ca51), and he knows way more about this
stuff than I do.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: scripts/kallsyms: Avoid ARM veneer symbols

2013-07-05 Thread Arnd Bergmann

On Friday 05 July 2013, Dave P Martin wrote:
> On Fri, Jul 05, 2013 at 05:42:44PM +0100, Arnd Bergmann wrote:
> > On Friday 05 July 2013, Dave P Martin wrote:
> > > On Wed, Jul 03, 2013 at 06:03:04PM +0200, Arnd Bergmann wrote:
> 
> I think there are a small number of patterns to check for.
> 
> __*_veneer, __*_from_arm and __*_from_thumb should cover most cases.

Ok.

> > * There are actually symbols without a name on ARM, which screws up the
> >   kallsyms.c parser. These also seem to be veneers, but attached to some
> >   random function:
> 
> Hmmm, I don't what those are.  By default, we should probably ignore those
> too.  Maybe they have something to do with link-time relocation processing.

Definitely link-time. It only shows up after the final link, and only
with ld.bfd not with ld.gold as I found out now.

> > $ nm obj-tmp/.tmp_vmlinux1 | head
> > c09e8db1 t 
> > c09e8db5 t 
> > c09e8db9 t# <==
> > c09e8dbd t 
> > c0abfc29 t 
> > c0008000 t $a
> > c0f7b640 t $a
> > 
> > $ objdump -Dr obj-tmp/.tmp_vmlinux1 | grep -C 30 c09e8db.
> > c0851fcc :
> > c0851fcc:   b538push{r3, r4, r5, lr}
> > c0851fce:   b500push{lr}
> > c0851fd0:   f7bb d8dc   bl  c000d18c <__gnu_mcount_nc>
> > c0851fd4:   f240 456b   movwr5, #1131   ; 0x46b
> > c0851fd8:   4604mov r4, r0
> > c0851fda:   f880 14d5   strb.w  r1, [r0, #1237] ; 0x4d5
> > c0851fde:   462amov r2, r5
> > c0851fe0:   f44f 710b   mov.w   r1, #556; 0x22c
> > c0851fe4:   f7ff fe6d   bl  c0851cc2 
> > c0851fe8:   4620mov r0, r4
> > c0851fea:   462amov r2, r5
> > c0851fec:   f240 212d   movwr1, #557; 0x22d
> > c0851ff0:   f7ff fe67   bl  c0851cc2 
> > c0851ff4:   4620mov r0, r4
> > c0851ff6:   f240 212e   movwr1, #558; 0x22e
> > c0851ffa:   f44f 7270   mov.w   r2, #960; 0x3c0
> > c0851ffe:   f196 fedb   bl  c09e8db8   # 
> > <===
> > c0852002:   4620mov r0, r4
> > c0852004:   f240 212f   movwr1, #559; 0x22f
> > c0852008:   f44f 7270   mov.w   r2, #960; 0x3c0
> > c085200c:   e8bd 4038   ldmia.w sp!, {r3, r4, r5, lr}
> > c0852010:   f7ff be57   b.w c0851cc2 
> > 
> > 
> > ... # in tpci200_free_irq:
> > c09e8d9e:   e003b.n c09e8da8 
> > c09e8da0:   f06f 0415   mvn.w   r4, #21
> > c09e8da4:   e000b.n c09e8da8 
> > c09e8da6:   4c01ldr r4, [pc, #4]; (c09e8dac 
> > )
> > c09e8da8:   4620mov r0, r4
> > c09e8daa:   bdf8pop {r3, r4, r5, r6, r7, pc}
> > c09e8dac:   fe00;  instruction: 
> > 0xfe00
> > c09e8db0:   f4cf b814   b.w c06b7ddc 
> > 
> > c09e8db4:   f53e bed8   b.w c0727b68 
> > c09e8db8:   f668 bf83   b.w c0851cc2
> > # <==
> > c09e8dbc:   d101bne.n   c09e8dc2 
> > c09e8dbe:   f435 b920   b.w c061e002 
> > 
> > It makes no sense to me at all that a function in one driver can just call
> > write_phy_reg a couple of times, but need a veneer in the middle, and put
> > that veneer in a totally unrelated function in another driver!
> 
> I think that if ld inserts a veneer for a function anywhere, branches
> from any object in the link to that target symbol can reuse the same
> veneer as a trampoline, effectively appearing to branch through an
> unrelated location to reach the destination.

That part makes sense, but it doesn't explain why ld would do that just
for the third out of four identical function calls in the example above.

> ld inserts veneers between individual input sections, but I don't
> think they have to go next to the same section the branch originates
> from.  In the above code, it looks like that series of unconditional
> branches after the end of tpci200_free_irq might be a common veneer pool
> for many different destinations.

Yes, exactly. In this build I had six of these nameless symbols, and five
of them were in this one function.

> LTO may also make the expected compilation unit boundaries disappear
> completely.  Anything could end up almost anywhere in that case.
> Files could get intermingled, inlined and generally spread all over the
> place.

I'm not sure we actually want to enable that in the kernel ;-)

In particular in combination with kallsyms, it would make the kallsyms
information rather useless when we can no longer infer a function name
from an address.

> Even so, veneers shouldn't be needed in the common case where we're not
> jumping across .rodata.
> 
> > 
> > If this is a binutils bug or gcc bug, we should probably just fix it, but it
> > might be easier to work around it by changing kallsyms.c some more.
> 
> I haven't found

Re: [patch v2] rapidio: use after free in unregister function

2013-07-05 Thread Ryan Mallon

On 06/07/13 06:39, Dan Carpenter wrote:

> We're freeing the list iterator so we can't move to the next entry.
> Since there is only one matching mport_id, we can just break after
> finding it.
> 
> Signed-off-by: Dan Carpenter 
> ---
> v2: cleaner fix than v1
> 
> diff --git a/drivers/rapidio/rio.c b/drivers/rapidio/rio.c
> index f4f30af..2e8a20c 100644
> --- a/drivers/rapidio/rio.c
> +++ b/drivers/rapidio/rio.c
> @@ -1715,11 +1715,13 @@ int rio_unregister_scan(int mport_id, struct rio_scan 
> *scan_ops)
>   (mport_id == RIO_MPORT_ANY && port->nscan == scan_ops))
>   port->nscan = NULL;
>  
> - list_for_each_entry(scan, _scans, node)
> + list_for_each_entry(scan, _scans, node) {
>   if (scan->mport_id == mport_id) {
>   list_del(>node);
>   kfree(scan);
> + break;
>   }
> + }
>  
>   mutex_unlock(_mport_list_lock);
>  


Reviewed-by: Ryan Mallon 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 08/10] clocksource: sun4i: Remove TIMER_SCAL variable

2013-07-05 Thread Thomas Gleixner

Maxime,

On Sat, 6 Jul 2013, Maxime Ripard wrote:
> @@ -168,8 +166,7 @@ static void __init sun4i_timer_init(struct device_node 
> *node)
>   clocksource_mmio_init(timer_base + TIMER_CNTVAL_REG(1), node->name,
> rate, 300, 32, clocksource_mmio_readl_down);
>  
> - writel(rate / (TIMER_SCAL * HZ),
> -timer_base + TIMER_INTVAL_REG(0));
> + writel(rate / HZ, timer_base + TIMER_INTVAL_REG(0));
>  
>   /* set clock source to HOSC, 16 pre-division */
>   val = readl(timer_base + TIMER_CTL_REG(0));
> @@ -192,8 +189,8 @@ static void __init sun4i_timer_init(struct device_node 
> *node)
>  
>   sun4i_clockevent.cpumask = cpumask_of(0);
>  
> - clockevents_config_and_register(_clockevent, rate / TIMER_SCAL,
> - 0x1, 0xff);
> + clockevents_config_and_register(_clockevent, rate, 0x1,
> + 0x);

I really recommend that you go out for lots of beer/wine NOW and
resume reading this mail when you recovered from that.

I definitely appreciate your responsivness to feedback, but please go
back and read my reply to the previous version of this patch
carefully. You might eventually find out that I pointed you to another
redundant clk_get_rate() call in that code.

After you did this, please go through the other patches in that series
and check how many new instances of clk_get_rate() calls you add down
the road. I did not even bother to look whether you cleaned it up
between v3 and v4, but I'm quite sure you did not. If I'm wrong, I owe
you a beer at the next conference.

Please take your time to address all concerns and look over the whole
thing carefullly before resending. This is not a speed coding contest!

Taking time and reconsidering whether a comment for patch N/M might
apply to other parts of the code or other parts of the patch series is
not optional. Review comments are mostly hints. So it's up to you to
check whether such a comment might apply to more than the particular
patch line which was commented.

Taking time and being careful actually spares time on both and aside
of that it spares a lot of pointless wasted electrons sent through the
intertubes.

Have a good weekend!

 tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/15] sched: Select a preferred node with the most numa hinting faults

2013-07-05 Thread Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
struct callback_head numa_work;
 
unsigned long *numa_faults;
+   int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..ed4e785 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+   p->numa_preferred_nid = -1;
p->numa_work.next = >numa_work;
p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..c0bee41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-   int seq;
+   int seq, nid, max_nid = 0;
+   unsigned long max_faults = 0;
 
if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
return;
p->numa_scan_seq = seq;
 
-   /* FIXME: Scheduling placement policy hints go here */
+   /* Find the node with the highest number of faults */
+   for (nid = 0; nid < nr_node_ids; nid++) {
+   unsigned long faults = p->numa_faults[nid];
+   p->numa_faults[nid] >>= 1;
+   if (faults > max_faults) {
+   max_faults = faults;
+   max_nid = nid;
+   }
+   }
+
+   /* Update the tasks preferred node if necessary */
+   if (max_faults && max_nid != p->numa_preferred_nid)
+   p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls

2013-07-05 Thread Mel Gorman

Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/kernel.txt | 66 +
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3

2013-07-05 Thread Mel Gorman

This continues to build on the previous feedback. The results are a mix of
gains and losses but when looking at the losses I think it's also important
to consider the reduced overhead when the patches are applied. I still
have not had the chance to closely review Peter's or Srikar's approach to
scheduling but the tests are queued to do a comparison.

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
running on that node already. This avoids waiting for the scheduler
to move the task slowly.

Patch 7 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced

Patch 8 replaces PTE scanning reset hammer and instread increases the
scanning rate when an otherwise settled task changes its
preferred node.

Patch 9 avoids some unnecessary allocation

Patch 10 sets the scan rate proportional to the size of the task being scanned.

Patch 11-12 kicks away some training wheels and scans shared pages and small 
VMAs.

Patch 13 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently

Patch 14 accounts for how many "preferred placed" tasks are running on an node
and attempts to avoid overloading them. This patch is the primary
candidate for replacing with proper load tracking of nodes. This patch
is crude but acts as a basis for comparison

Patch 15 favours moving tasks towards nodes where more faults were incurred
even if it is not the preferred node.

Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it'll take a
weekend to churn through them so here is the shorter tests.

I tested 9 kernels using 3.9.0 as a baseline

o 3.9.0-vanilla vanilla kernel with automatic numa balancing 
enabled
o 3.9.0-favorpref-v3Patches 1-9
o 3.9.0-scalescan-v3Patches 1-10
o 3.9.0-scanshared-v3   Patches 1-12
o 3.9.0-splitprivate-v3 Patches 1-13
o 3.9.0-accountpreferred-v3 Patches 1-14
o 3.9.0-peterz-v3   Patches 1-14 + Peter's scheduling patch
o 3.9.0-srikar-v3   vanilla kernel + Srikar's scheduling patch
o 3.9.0-favorfaults-v3  Patches 1-15

Note that Peters patch has been rebased by me and acts as a replacement
for the crude per-node accounting. Srikar's patch was standalone and I
made to attempt to pick it apart and rebase it on top of the series.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.

specjbb
3.9.0

[PATCH 04/15] sched: Update NUMA hinting faults once per scan

2013-07-05 Thread Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h | 13 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp  */
struct callback_head numa_work;
 
+   /*
+* Exponential decaying average of faults on a per-node basis.
+* Scheduling placement decisions are made based on the these counts.
+* The values remain static for the duration of a PTE scan
+*/
unsigned long *numa_faults;
+
+   /*
+* numa_faults_buffer records faults per node during the current
+* scan window. When the scan completes, the counts in numa_faults
+* decay and these values are copied.
+*/
+   unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed4e785..0bd541c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = >numa_work;
p->numa_faults = NULL;
+   p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0bee41..8dc9ff9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
-   unsigned long faults = p->numa_faults[nid];
+   unsigned long faults;
+
+   /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+   p->numa_faults[nid] += p->numa_faults_buffer[nid];
+   p->numa_faults_buffer[nid] = 0;
+
+   faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -832,9 +838,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-   p->numa_faults = kzalloc(size, GFP_KERNEL);
+   /* numa_faults and numa_faults_buffer share the allocation */
+   p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
if (!p->numa_faults)
return;
+
+   BUG_ON(p->numa_faults_buffer);
+   p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}
 
/*
@@ -848,7 +858,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
 
/* Record the fault, double the weight if pages were migrated */
-   p->numa_faults[node] += pages << migrated;
+   p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/15] sched: Favour moving tasks towards the preferred node

2013-07-05 Thread Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

[sri...@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/kernel.txt |  8 +-
 include/linux/sched.h   |  1 +
 kernel/sched/core.c |  3 ++-
 kernel/sched/fair.c | 60 ++---
 kernel/sysctl.c |  7 +
 5 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system 
overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
 #define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
 #define SD_OVERLAP 0x2000  /* sched_domains of this level overlap 
*/
+#define SD_NUMA0x4000  /* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd541c..5e02507 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-   p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+   p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = >numa_work;
@@ -6141,6 +6141,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int 
cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+   | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance   = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8dc9ff9..5055bf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct

[PATCH 02/15] sched: Track NUMA hinting faults on per-node basis

2013-07-05 Thread Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++-
 kernel/sched/sched.h  | 11 +++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
unsigned int numa_scan_period;
u64 node_stamp; /* migration stamp  */
struct callback_head numa_work;
+
+   unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = >numa_work;
+   p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct 
task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+   task_numa_free(prev);
+
/*
 * Remove function-return probe instances associated with this
 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!sched_feat_numa(NUMA))
return;
 
-   /* FIXME: Allocate task-specific structure for placement policy here */
+   /* Allocate buffer to track faults on a per-node basis */
+   if (unlikely(!p->numa_faults)) {
+   int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+   p->numa_faults = kzalloc(size, GFP_KERNEL);
+   if (!p->numa_faults)
+   return;
+   }
 
/*
 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
p->numa_scan_period + jiffies_to_msecs(10));
 
task_numa_placement(p);
+
+   /* Record the fault, double the weight if pages were migrated */
+   p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)  (cpu_rq(cpu)->curr)
 #define raw_rq()   (&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+   kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/15] sched: Check current->mm before allocating NUMA faults

2013-07-05 Thread Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[pet...@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c69b599..aee3e0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -843,8 +843,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = 0;
unsigned long max_faults = 0;
 
-   if (!p->mm) /* for example, ksmd faulting in a user's mm */
-   return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -921,6 +919,10 @@ void task_numa_fault(int last_nid, int node, int pages, 
bool migrated)
if (!sched_feat_numa(NUMA))
return;
 
+   /* for example, ksmd faulting in a user's mm */
+   if (!p->mm)
+   return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected

2013-07-05 Thread Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman 
---
 kernel/sched/core.c  | 17 
 kernel/sched/fair.c  | 55 +++-
 kernel/sched/sched.h |  1 +
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e02507..e4c1832 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -992,6 +992,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+   struct migration_arg arg = { p, target_cpu };
+   int curr_cpu = task_cpu(p);
+
+   if (curr_cpu == target_cpu)
+   return 0;
+
+   if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+   return -EINVAL;
+
+   return stop_one_cpu(curr_cpu, migration_cpu_stop, );
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5055bf9..5a01dcb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,40 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+   unsigned long load, min_load = ULONG_MAX;
+   int i, idlest_cpu = this_cpu;
+
+   BUG_ON(cpu_to_node(this_cpu) == nid);
+
+   rcu_read_lock();
+   for_each_cpu(i, cpumask_of_node(nid)) {
+   load = weighted_cpuload(i);
+
+   if (load < min_load) {
+   /*
+* Kernel threads can be preempted. For others, do
+* not preempt if running on their preferred node
+* or pinned.
+*/
+   struct task_struct *p = cpu_rq(i)->curr;
+   if ((p->flags & PF_KTHREAD) ||
+   (p->numa_preferred_nid != nid && p->nr_cpus_allowed 
> 1)) {
+   min_load = load;
+   idlest_cpu = i;
+   }
+   }
+   }
+   rcu_read_unlock();
+
+   return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
int seq, nid, max_nid = 0;
@@ -829,10 +863,29 @@ static void task_numa_placement(struct task_struct *p)
}
}
 
-   /* Update the tasks preferred node if necessary */
+   /*
+* Record the preferred node as the node with the most faults,
+* requeue the task to be running on the idlest CPU on the
+* preferred node and reset the scanning rate to recheck
+* the working set placement.
+*/
if (max_faults && max_nid != p->numa_preferred_nid) {
+   int preferred_cpu;
+
+   /*
+* If the task is not on the preferred node then find the most
+* idle CPU to migrate to.
+*/
+   preferred_cpu = task_cpu(p);
+   if (cpu_to_node(preferred_cpu) != max_nid) {
+   preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+max_nid);
+   }
+
+   /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+   migrate_task_to(p, preferred_cpu);
}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..795346d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()   (&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
kfree(p->numa_faults);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/15] sched: Increase NUMA PTE scanning when a new preferred node is selected

2013-07-05 Thread Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/kernel.txt | 11 +++
 include/linux/mm_types.h|  3 ---
 include/linux/sched/sysctl.h|  1 -
 kernel/sched/core.c |  1 -
 kernel/sched/fair.c | 27 ---
 kernel/sysctl.c |  7 ---
 6 files changed, 15 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to 
NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each 
task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 */
unsigned long numa_next_scan;
 
-   /* numa_next_reset is when the PTE scanner period will be reset */
-   unsigned long numa_next_reset;
-
/* Restart point for scanning and setting pte_numa */
unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling 
sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c1832..02db92a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1602,7 +1602,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(>mm->mm_users) == 1) {
p->mm->numa_next_scan = jiffies;
-   p->mm->numa_next_reset = jiffies;
p->mm->numa_scan_seq = 0;
}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f3f01c..3c69b599 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -882,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
 */
if (max_faults &&

[PATCH 12/15] sched: Remove check that skips small VMAs

2013-07-05 Thread Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 66306c7..47276a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1089,10 +1089,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;
 
-   /* Skip small VMAs. They are not likely to be of relevance */
-   if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-   continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/15] mm: numa: Scan pages with elevated page_mapcount

2013-07-05 Thread Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h |  7 ---
 mm/memory.c |  4 ++--
 mm/migrate.c| 17 ++---
 mm/mprotect.c   |  4 +---
 4 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+struct vm_area_struct *vma, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index c28bf52..b06022a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3581,7 +3581,7 @@ int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, target_nid);
+   migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
current_nid = target_nid;
 
@@ -3666,7 +3666,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
/* Migrate to the requested node */
pte_unmap_unlock(pte, ptl);
-   migrated = migrate_misplaced_page(page, target_nid);
+   migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
curr_nid = target_nid;
task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct 
page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+  int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);
 
/*
-* Don't migrate pages that are mapped in multiple processes.
-* TODO: Handle false sharing detection instead of this hammer
+* Don't migrate file pages that are mapped in multiple processes
+* with execute permissions as they are probably shared libraries.
 */
-   if (page_mapcount(page) != 1)
+   if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+   (vma->vm_flags & VM_EXEC))
goto out;
 
/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
*mm,
int page_lru = page_is_file_cache(page);
 
/*
-* Don't migrate pages that are mapped in multiple processes.
-* TODO: Handle false sharing detection instead of this hammer
-*/
-   if (page_mapcount(page) != 1)
-   goto out_dropref;
-
-   /*
 * Rate-limit the amount of data that is being migrated to a node.
 * Optimal placement is no good if the memory bus is saturated and
 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;
 
-

[PATCH 07/15] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults

2013-07-05 Thread Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 33 -
 mm/huge_memory.c  |  7 ---
 mm/memory.c   |  9 ++---
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..b81195e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+  bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5a01dcb..0f3f01c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -834,6 +834,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+   return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
int seq, nid, max_nid = 0;
@@ -850,13 +855,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
unsigned long faults;
+   int priv, i;
 
-   /* Decay existing window and copy faults since last scan */
-   p->numa_faults[nid] >>= 1;
-   p->numa_faults[nid] += p->numa_faults_buffer[nid];
-   p->numa_faults_buffer[nid] = 0;
+   for (priv = 0; priv < 2; priv++) {
+   i = task_faults_idx(nid, priv);
 
-   faults = p->numa_faults[nid];
+   /* Decay existing window, copy faults since last scan */
+   p->numa_faults[i] >>= 1;
+   p->numa_faults[i] += p->numa_faults_buffer[i];
+   p->numa_faults_buffer[i] = 0;
+   }
+
+   /* Find maximum private faults */
+   faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -892,16 +903,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
struct task_struct *p = current;
+   int priv;
 
if (!sched_feat_numa(NUMA))
return;
 
+   /* For now, do not attempt to detect private/shared accesses */
+   priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
-   int size = sizeof(*p->numa_faults) * nr_node_ids;
+   int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -909,7 +924,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;
 
BUG_ON(p->numa_faults_buffer);
-   p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+   p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}
 
/*
@@ -923,7 +938,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
 
/* Record the fault, double the weight if pages were migrated */
-   p->numa_faults_buffer[node] += pages << migrated;
+   p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 {
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
-   int target_nid;
+   int target_nid, last_nid;
int current_nid = -1;
bool migrated;
 
@@ -1307,6 +1307,7 @@ int

[PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node

2013-07-05 Thread Mel Gorman

It is preferred that tasks always run local to their memory but it is
not optimal if that node is compute overloaded and failing to get
access to a CPU. This would compete with the load balancer trying to
move tasks off and NUMA balancing moving it back.

Ultimately, it will be required that the compute load be calculated
of each node and minimise that as well as minimising the number of
remote accesses until the optimal balance point is reached. Begin
this process by simply accounting for the number of tasks that are
running on their preferred node. When deciding what node to place
a task on, do not place a task on a node that has more preferred
placement tasks than there are CPUs.

Signed-off-by: Mel Gorman 
---
 kernel/sched/core.c  | 34 ++
 kernel/sched/fair.c  | 49 +++--
 kernel/sched/sched.h |  5 +
 3 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02db92a..13b9068 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6112,6 +6112,40 @@ static struct sched_domain_topology_level 
default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = 
default_topology;
 
+#ifdef CONFIG_NUMA_BALANCING
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
+{
+   struct rq *rq;
+   unsigned long flags;
+   bool on_rq, running;
+
+   /*
+* Dequeue task before updating preferred_nid so
+* rq->nr_preferred_running is accurate
+*/
+   rq = task_rq_lock(p, );
+   on_rq = p->on_rq;
+   running = task_current(rq, p);
+
+   if (on_rq)
+   dequeue_task(rq, p, 0);
+   if (running)
+   p->sched_class->put_prev_task(rq, p);
+
+   /* Update the preferred nid and migrate task if possible */
+   p->numa_preferred_nid = nid;
+   p->numa_migrate_seq = 0;
+
+   /* Requeue task if necessary */
+   if (running)
+   p->sched_class->set_curr_task(rq);
+   if (on_rq)
+   enqueue_task(rq, p, 0);
+   task_rq_unlock(rq, p, );
+}
+
+#endif
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5933e24..c303ba6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+   rq->nr_preferred_running +=
+   (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+   rq->nr_preferred_running -=
+   (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -880,6 +892,21 @@ static inline int task_faults_idx(int nid, int priv)
return 2 * nid + priv;
 }
 
+/* Returns true if the given node is compute overloaded */
+static bool sched_numa_overloaded(int nid)
+{
+   int nr_cpus = 0;
+   int nr_preferred = 0;
+   int i;
+
+   for_each_cpu(i, cpumask_of_node(nid)) {
+   nr_cpus++;
+   nr_preferred += cpu_rq(i)->nr_preferred_running;
+   }
+
+   return nr_preferred >= nr_cpus << 1;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
int seq, nid, max_nid = 0;
@@ -908,7 +935,7 @@ static void task_numa_placement(struct task_struct *p)
 
/* Find maximum private faults */
faults = p->numa_faults[task_faults_idx(nid, 1)];
-   if (faults > max_faults) {
+   if (faults > max_faults && !sched_numa_overloaded(nid)) {
max_faults = faults;
max_nid = nid;
}
@@ -934,9 +961,7 @@ static void task_numa_placement(struct task_struct *p)
 max_nid);
}
 
-   /* Update the preferred nid and migrate task if possible */
-   p->numa_preferred_nid = max_nid;
-   p->numa_migrate_seq = 0;
+   sched_setnuma(p, max_nid, preferred_cpu);
migrate_task_to(p, preferred_cpu);
 
/*
@@ -1165,6 +1190,14 @@ void task_tick_numa(struct rq *rq, struct task_struct 
*curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1174,8 +1207,10 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
if (!parent_entity(se))

[PATCH 15/15] sched: Favour moving tasks towards nodes that incurred more faults

2013-07-05 Thread Mel Gorman

The scheduler already favours moving tasks towards its preferred node but
does nothing special if the destination node is anything else. This patch
favours moving tasks towards a destination node if more NUMA hinting faults
were recorded on it. Similarly if migrating to a destination node would
degrade locality based on NUMA hinting faults then it will be resisted.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 63 -
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c303ba6..1a4af96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4069,24 +4069,65 @@ task_hot(struct task_struct *p, u64 now, struct 
sched_domain *sd)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-/* Returns true if the destination node has incurred more faults */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env 
*env)
+
+static bool migrate_locality_prepare(struct task_struct *p, struct lb_env *env,
+   int *src_nid, int *dst_nid,
+   unsigned long *src_faults, unsigned long *dst_faults)
 {
-   int src_nid, dst_nid;
+   int priv;
 
if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
return false;
 
-   src_nid = cpu_to_node(env->src_cpu);
-   dst_nid = cpu_to_node(env->dst_cpu);
+   *src_nid = cpu_to_node(env->src_cpu);
+   *dst_nid = cpu_to_node(env->dst_cpu);
 
-   if (src_nid == dst_nid ||
+   if (*src_nid == *dst_nid ||
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
 
+   /* Calculate private/shared faults on the two nodes */
+   *src_faults = 0;
+   *dst_faults = 0;
+   for (priv = 0; priv < 2; priv++) {
+   *src_faults += p->numa_faults[task_faults_idx(*src_nid, priv)];
+   *dst_faults += p->numa_faults[task_faults_idx(*dst_nid, priv)];
+   }
+
+   return true;
+}
+
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env 
*env)
+{
+   int src_nid, dst_nid;
+   unsigned long src, dst;
+
+   if (!migrate_locality_prepare(p, env, _nid, _nid, , ))
+   return false;
+
+   /* Move towards node if it is the preferred node */
if (p->numa_preferred_nid == dst_nid)
return true;
 
+   /* Move towards node if there were more NUMA hinting faults recorded */
+   if (dst > src)
+   return true;
+
+   return false;
+}
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env 
*env)
+{
+   int src_nid, dst_nid;
+   unsigned long src, dst;
+
+   if (!migrate_locality_prepare(p, env, _nid, _nid, , ))
+   return false;
+
+   if (src > dst)
+   return true;
+
return false;
 }
 #else
@@ -4095,6 +4136,14 @@ static inline bool migrate_improves_locality(struct 
task_struct *p,
 {
return false;
 }
+
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+struct lb_env *env)
+{
+   return false;
+}
+
 #endif
 
 /*
@@ -4150,6 +4199,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * 3) too many balance attempts have failed.
 */
tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+   if (!tsk_cache_hot)
+   tsk_cache_hot = migrate_degrades_locality(p, env);
 
if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/15] sched: Set the scan rate proportional to the size of the task being scanned

2013-07-05 Thread Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman 
---
 Documentation/sysctl/kernel.txt | 11 ---
 include/linux/sched.h   |  1 +
 kernel/sched/fair.c | 72 +++--
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact 
due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+   unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp  */
struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aee3e0b..66306c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+   unsigned long nr_vm_pages = 0;
+   unsigned long nr_scan_pages;
+
+   nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+   nr_vm_pages = p->mm->total_vm;
+   if (!nr_vm_pages)
+   nr_vm_pages = nr_scan_pages;
+
+   nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+   return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+   unsigned int scan, floor;
+   unsigned int windows = 1;
+
+   if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+   windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+   floor = 1000 / windows;
+
+   scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+

[PATCH 13/15] sched: Set preferred NUMA node based on number of private faults

2013-07-05 Thread Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman 
---
 include/linux/mm.h| 59 ---
 include/linux/mm_types.h  |  4 +--
 include/linux/page-flags-layout.h | 28 +++
 kernel/sched/fair.c   | 12 +---
 mm/huge_memory.c  | 10 +++
 mm/memory.c   | 16 +--
 mm/mempolicy.c| 10 +--
 mm/migrate.c  |  4 +--
 mm/mm_init.c  | 18 ++--
 mm/mmzone.c   | 12 
 mm/page_alloc.c   |  4 +--
 11 files changed, 103 insertions(+), 74 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..569beec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF  (ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 #define SECTIONS_PGSHIFT   (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT  (NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT  (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT   (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -618,7 +618,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 #define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK ((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK  ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK  ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK   ((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -662,48 +662,63 @@ static inline

Re: [PATCH 5/9] clocksource: dw_apb_timer: quirk for variants without EOI register

2013-07-05 Thread Heiko Stübner

this patch should have had a

From: Ulrich Prinz 

sorry for the mistake

Am Samstag, 6. Juli 2013, 00:54:07 schrieb Heiko Stübner:
> Some variants of the dw_apb_timer don't have an eoi register but instead
> expect a one to be written to the int_status register at eoi time.
> 
> Signed-off-by: Ulrich Prinz 
> ---
>  drivers/clocksource/dw_apb_timer.c |   10 +-
>  include/linux/dw_apb_timer.h   |5 +
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/clocksource/dw_apb_timer.c
> b/drivers/clocksource/dw_apb_timer.c index 5f80a30..23cd7c6 100644
> --- a/drivers/clocksource/dw_apb_timer.c
> +++ b/drivers/clocksource/dw_apb_timer.c
> @@ -104,6 +104,11 @@ static void apbt_eoi(struct dw_apb_timer *timer)
>   apbt_readl(timer, timer->reg_eoi);
>  }
> 
> +static void apbt_eoi_int_status(struct dw_apb_timer *timer)
> +{
> + apbt_writel(timer, 1, timer->reg_int_status);
> +}
> +
>  static irqreturn_t dw_apb_clockevent_irq(int irq, void *data)
>  {
>   struct clock_event_device *evt = data;
> @@ -286,7 +291,10 @@ dw_apb_clockevent_init(int cpu, const char *name,
> unsigned rating, IRQF_NOBALANCING |
> IRQF_DISABLED;
> 
> - dw_ced->eoi = apbt_eoi;
> + if (quirks & APBTMR_QUIRK_NO_EOI)
> + dw_ced->eoi = apbt_eoi_int_status;
> + else
> + dw_ced->eoi = apbt_eoi;
>   err = setup_irq(irq, _ced->irqaction);
>   if (err) {
>   pr_err("failed to request timer irq\n");
> diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
> index 80f6686..fbe4c6b 100644
> --- a/include/linux/dw_apb_timer.h
> +++ b/include/linux/dw_apb_timer.h
> @@ -25,6 +25,11 @@
>   */
>  #define APBTMR_QUIRK_64BIT_COUNTER   BIT(0)
> 
> +/* The IP does not provide a end-of-interrupt register to clear pending
> + * interrupts, but requires to write a 1 to the interrupt-status register.
> + */
> +#define APBTMR_QUIRK_NO_EOI  BIT(1)
> +
>  struct dw_apb_timer {
>   void __iomem*base;
>   unsigned long   freq;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] clocksource: dw_apb_timer: use the eoi callback to clear pending interrupts

2013-07-05 Thread Heiko Stübner

this patch should have had a

From: Ulrich Prinz 

sorry for the mistake

Am Samstag, 6. Juli 2013, 00:53:36 schrieb Heiko Stübner:
> Some timer variants have different mechanisms to clear a pending timer
> interrupt. Therefore don't hardcode the reading of the eoi register to
> clear them, but instead use the already existing eoi callback for this.
> 
> Signed-off-by: Ulrich Prinz 
> ---
>  drivers/clocksource/dw_apb_timer.c |   11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/clocksource/dw_apb_timer.c
> b/drivers/clocksource/dw_apb_timer.c index bd45351..5f80a30 100644
> --- a/drivers/clocksource/dw_apb_timer.c
> +++ b/drivers/clocksource/dw_apb_timer.c
> @@ -121,11 +121,14 @@ static irqreturn_t dw_apb_clockevent_irq(int irq,
> void *data) return IRQ_HANDLED;
>  }
> 
> -static void apbt_enable_int(struct dw_apb_timer *timer)
> +static void apbt_enable_int(struct dw_apb_clock_event_device *dw_ced)
>  {
> + struct dw_apb_timer *timer = _ced->timer;
>   unsigned long ctrl = apbt_readl(timer, timer->reg_control);
> +
>   /* clear pending intr */
> - apbt_readl(timer, timer->reg_eoi);
> + if (dw_ced->eoi)
> + dw_ced->eoi(timer);
>   ctrl &= ~APBTMR_CONTROL_INT;
>   apbt_writel(timer, ctrl, timer->reg_control);
>  }
> @@ -200,7 +203,7 @@ static void apbt_set_mode(enum clock_event_mode mode,
>   break;
> 
>   case CLOCK_EVT_MODE_RESUME:
> - apbt_enable_int(timer);
> + apbt_enable_int(dw_ced);
>   break;
>   }
>  }
> @@ -325,7 +328,7 @@ void dw_apb_clockevent_register(struct
> dw_apb_clock_event_device *dw_ced)
> 
>   apbt_writel(timer, 0, timer->reg_control);
>   clockevents_register_device(_ced->ced);
> - apbt_enable_int(timer);
> + apbt_enable_int(dw_ced);
>  }
> 
>  /**

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/9] clocksource: dw_apb_timer: quirk for inverted int mask

2013-07-05 Thread Heiko Stübner

this patch should have had a

From: Ulrich Prinz 

sorry for the mistake

Am Samstag, 6. Juli 2013, 00:54:35 schrieb Heiko Stübner:
> Some timer variants use an inverted setting to mask the timer interrupt.
> Therefore add a quirk to handle these variants.
> 
> Signed-off-by: Ulrich Prinz 
> ---
>  drivers/clocksource/dw_apb_timer.c |   23 ++-
>  include/linux/dw_apb_timer.h   |6 ++
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/clocksource/dw_apb_timer.c
> b/drivers/clocksource/dw_apb_timer.c index 23cd7c6..7705d13 100644
> --- a/drivers/clocksource/dw_apb_timer.c
> +++ b/drivers/clocksource/dw_apb_timer.c
> @@ -84,7 +84,10 @@ static void apbt_disable_int(struct dw_apb_timer *timer)
>  {
>   unsigned long ctrl = apbt_readl(timer, timer->reg_control);
> 
> - ctrl |= APBTMR_CONTROL_INT;
> + if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
> + ctrl &= ~APBTMR_CONTROL_INT;
> + else
> + ctrl |= APBTMR_CONTROL_INT;
>   apbt_writel(timer, ctrl, timer->reg_control);
>  }
> 
> @@ -134,7 +137,10 @@ static void apbt_enable_int(struct
> dw_apb_clock_event_device *dw_ced) /* clear pending intr */
>   if (dw_ced->eoi)
>   dw_ced->eoi(timer);
> - ctrl &= ~APBTMR_CONTROL_INT;
> + if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
> + ctrl |= APBTMR_CONTROL_INT;
> + else
> + ctrl &= ~APBTMR_CONTROL_INT;
>   apbt_writel(timer, ctrl, timer->reg_control);
>  }
> 
> @@ -195,7 +201,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
>   if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
>   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> 
> - ctrl &= ~APBTMR_CONTROL_INT;
> + if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
> + ctrl |= APBTMR_CONTROL_INT;
> + else
> + ctrl &= ~APBTMR_CONTROL_INT;
>   ctrl |= APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
>   break;
> @@ -363,9 +372,13 @@ void dw_apb_clocksource_start(struct
> dw_apb_clocksource *dw_cs) if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
>   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
> 
> - /* enable, mask interrupt */
> + /* set periodic, mask interrupt, enable timer */
>   ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
> - ctrl |= (APBTMR_CONTROL_ENABLE | APBTMR_CONTROL_INT);
> + if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
> + ctrl &= ~APBTMR_CONTROL_INT;
> + else
> + ctrl |= APBTMR_CONTROL_INT;
> + ctrl |= APBTMR_CONTROL_ENABLE;
>   apbt_writel(timer, ctrl, timer->reg_control);
>   /* read it once to get cached counter value initialized */
>   dw_apb_clocksource_read(dw_cs);
> diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
> index fbe4c6b..7d36d91 100644
> --- a/include/linux/dw_apb_timer.h
> +++ b/include/linux/dw_apb_timer.h
> @@ -30,6 +30,12 @@
>   */
>  #define APBTMR_QUIRK_NO_EOI  BIT(1)
> 
> +/* The IP uses an inverted interrupt-mask bit.
> + * Instead of activating interrupts clearing a maks bit, it needs an
> enable + * bit to be set 1.
> + */
> +#define APBTMR_QUIRK_INVERSE_INTMASK BIT(2)
> +
>  struct dw_apb_timer {
>   void __iomem*base;
>   unsigned long   freq;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] arm: Convert sa1111 platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Shuah Khan

Convert arch/arm/common/sa platform and bus legacy pm_ops to dev_pm_ops.
This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
Compile tested.

Signed-off-by: Shuah Khan 
---
 arch/arm/common/sa.c |   39 ++-
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/arch/arm/common/sa.c b/arch/arm/common/sa.c
index e57d7e5..95594f0 100644
--- a/arch/arm/common/sa.c
+++ b/arch/arm/common/sa.c
@@ -107,7 +107,7 @@ struct sa {
spinlock_t  lock;
void __iomem*base;
struct sa_platform_data *pdata;
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
void*saved_state;
 #endif
 };
@@ -870,11 +870,11 @@ struct sa_save_data {
unsigned intwakeen1;
 };
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 
-static int sa_suspend(struct platform_device *dev, pm_message_t state)
+static int sa_suspend(struct device *dev)
 {
-   struct sa *sachip = platform_get_drvdata(dev);
+   struct sa *sachip = platform_get_drvdata(to_platform_device(dev));
struct sa_save_data *save;
unsigned long flags;
unsigned int val;
@@ -937,9 +937,10 @@ static int sa_suspend(struct platform_device *dev, 
pm_message_t state)
  * restored by their respective drivers, and must be called
  * via LDM after this function.
  */
-static int sa_resume(struct platform_device *dev)
+static int sa_resume(struct device *dev)
 {
-   struct sa *sachip = platform_get_drvdata(dev);
+   struct platform_device *pdev = to_platform_device(dev);
+   struct sa *sachip = platform_get_drvdata(pdev);
struct sa_save_data *save;
unsigned long flags, id;
void __iomem *base;
@@ -955,7 +956,7 @@ static int sa_resume(struct platform_device *dev)
id = sa_readl(sachip->base + SA_SKID);
if ((id & SKID_ID_MASK) != SKID_SA_ID) {
__sa_remove(sachip);
-   platform_set_drvdata(dev, NULL);
+   platform_set_drvdata(pdev, NULL);
kfree(save);
return 0;
}
@@ -1005,9 +1006,7 @@ static int sa_resume(struct platform_device *dev)
return 0;
 }
 
-#else
-#define sa_suspend NULL
-#define sa_resume  NULL
+static SIMPLE_DEV_PM_OPS(sa_dev_pm_ops, sa_suspend, sa_resume);
 #endif
 
 static int sa_probe(struct platform_device *pdev)
@@ -1030,7 +1029,7 @@ static int sa_remove(struct platform_device *pdev)
struct sa *sachip = platform_get_drvdata(pdev);
 
if (sachip) {
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
kfree(sachip->saved_state);
sachip->saved_state = NULL;
 #endif
@@ -1053,11 +1052,12 @@ static int sa_remove(struct platform_device *pdev)
 static struct platform_driver sa_device_driver = {
.probe  = sa_probe,
.remove = sa_remove,
-   .suspend= sa_suspend,
-   .resume = sa_resume,
.driver = {
.name   = "sa",
.owner  = THIS_MODULE,
+#ifdef CONFIG_PM_SLEEP
+   .pm = _dev_pm_ops,
+#endif
},
 };
 
@@ -1297,14 +1297,15 @@ static int sa_match(struct device *_dev, struct 
device_driver *_drv)
return dev->devid & drv->devid;
 }
 
-static int sa_bus_suspend(struct device *dev, pm_message_t state)
+#ifdef CONFIG_PM_SLEEP
+static int sa_bus_suspend(struct device *dev)
 {
struct sa_dev *sadev = SA_DEV(dev);
struct sa_driver *drv = SA_DRV(dev->driver);
int ret = 0;
 
if (drv && drv->suspend)
-   ret = drv->suspend(sadev, state);
+   ret = drv->suspend(sadev, PMSG_SUSPEND);
return ret;
 }
 
@@ -1318,6 +1319,9 @@ static int sa_bus_resume(struct device *dev)
ret = drv->resume(sadev);
return ret;
 }
+static SIMPLE_DEV_PM_OPS(sa_bus_dev_pm_ops, sa_bus_suspend,
+sa_bus_resume);
+#endif
 
 static void sa_bus_shutdown(struct device *dev)
 {
@@ -1354,8 +1358,9 @@ struct bus_type sa_bus_type = {
.match  = sa_match,
.probe  = sa_bus_probe,
.remove = sa_bus_remove,
-   .suspend= sa_bus_suspend,
-   .resume = sa_bus_resume,
+#ifdef CONFIG_PM_SLEEP
+   .pm = _bus_dev_pm_ops,
+#endif
.shutdown   = sa_bus_shutdown,
 };
 EXPORT_SYMBOL(sa_bus_type);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 9/9] clocksource: dw_apb_timer: special variant for rockchip rk3188 timers

2013-07-05 Thread Heiko Stübner

The rk3188 uses a variant of the timer containing two registers for load_count
and current_values.

Signed-off-by: Heiko Stuebner 
---
 .../bindings/arm/rockchip/rk3188-timer.txt |   20 
 drivers/clocksource/dw_apb_timer_of.c  |6 ++
 2 files changed, 26 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/arm/rockchip/rk3188-timer.txt

diff --git a/Documentation/devicetree/bindings/arm/rockchip/rk3188-timer.txt 
b/Documentation/devicetree/bindings/arm/rockchip/rk3188-timer.txt
new file mode 100644
index 000..ccbb389
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/rockchip/rk3188-timer.txt
@@ -0,0 +1,20 @@
+Rockchip rk3188 timer:
+--
+
+The rk3188 SoCs contain a slightly modified dw-apb-timer.
+
+Required node properties:
+- compatible value : = "rockchip,rk3188-dw-apb-timer-osc";
+
+For the other properties see the generic documentation in
+../../rtc/dw-apb.txt
+
+Example:
+
+   timer3: timer@ffe0 {
+   compatible = "rockchip,rk3188-dw-apb-timer-osc";
+   interrupts = <0 170 4>;
+   reg = <0xffe0 0x1000>;
+   clocks = <_clk>, <_pclk>;
+   clock-names = "timer", "pclk";
+   };
diff --git a/drivers/clocksource/dw_apb_timer_of.c 
b/drivers/clocksource/dw_apb_timer_of.c
index 4bcc1c1..7824796 100644
--- a/drivers/clocksource/dw_apb_timer_of.c
+++ b/drivers/clocksource/dw_apb_timer_of.c
@@ -38,6 +38,11 @@ static void timer_get_base_and_rate(struct device_node *np,
 
*quirks = 0;
 
+   if (of_device_is_compatible(np, "rockchip,rk3188-dw-apb-timer-osc"))
+   *quirks |= APBTMR_QUIRK_64BIT_COUNTER | APBTMR_QUIRK_NO_EOI |
+  APBTMR_QUIRK_INVERSE_INTMASK |
+  APBTMR_QUIRK_INVERSE_PERIODIC;
+
/*
 * Not all implementations use a periphal clock, so don't panic
 * if it's not present
@@ -165,3 +170,4 @@ static void __init dw_apb_timer_init(struct device_node 
*timer)
 }
 CLOCKSOURCE_OF_DECLARE(pc3x2_timer, "picochip,pc3x2-timer", dw_apb_timer_init);
 CLOCKSOURCE_OF_DECLARE(apb_timer, "snps,dw-apb-timer-osc", dw_apb_timer_init);
+CLOCKSOURCE_OF_DECLARE(rk3188_timer, "rockchip,rk3188-dw-apb-timer-osc", 
dw_apb_timer_init);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/9] clocksource: dw_apb_timer_of: add quirk handling

2013-07-05 Thread Heiko Stübner

timer_get_base_and_rate now also can extract informations about present
hardware-quirks from the devicetree node and transmit it to the
clocksource / clockevent init function.

Signed-off-by: Heiko Stuebner 
---
 drivers/clocksource/dw_apb_timer_of.c |   21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/clocksource/dw_apb_timer_of.c 
b/drivers/clocksource/dw_apb_timer_of.c
index b5412af..4bcc1c1 100644
--- a/drivers/clocksource/dw_apb_timer_of.c
+++ b/drivers/clocksource/dw_apb_timer_of.c
@@ -26,7 +26,7 @@
 #include 
 
 static void timer_get_base_and_rate(struct device_node *np,
-   void __iomem **base, u32 *rate)
+   void __iomem **base, u32 *rate, int *quirks)
 {
struct clk *timer_clk;
struct clk *pclk;
@@ -36,6 +36,8 @@ static void timer_get_base_and_rate(struct device_node *np,
if (!*base)
panic("Unable to map regs for %s", np->name);
 
+   *quirks = 0;
+
/*
 * Not all implementations use a periphal clock, so don't panic
 * if it's not present
@@ -66,15 +68,16 @@ static void add_clockevent(struct device_node *event_timer)
void __iomem *iobase;
struct dw_apb_clock_event_device *ced;
u32 irq, rate;
+   int quirks;
 
irq = irq_of_parse_and_map(event_timer, 0);
if (irq == NO_IRQ)
panic("No IRQ for clock event timer");
 
-   timer_get_base_and_rate(event_timer, , );
+   timer_get_base_and_rate(event_timer, , , );
 
ced = dw_apb_clockevent_init(0, event_timer->name, 300, iobase, irq,
-rate, 0);
+rate, quirks);
if (!ced)
panic("Unable to initialise clockevent device");
 
@@ -89,10 +92,12 @@ static void add_clocksource(struct device_node 
*source_timer)
void __iomem *iobase;
struct dw_apb_clocksource *cs;
u32 rate;
+   int quirks;
 
-   timer_get_base_and_rate(source_timer, , );
+   timer_get_base_and_rate(source_timer, , , );
 
-   cs = dw_apb_clocksource_init(300, source_timer->name, iobase, rate, 0);
+   cs = dw_apb_clocksource_init(300, source_timer->name, iobase, rate,
+quirks);
if (!cs)
panic("Unable to initialise clocksource device");
 
@@ -106,6 +111,9 @@ static void add_clocksource(struct device_node 
*source_timer)
 */
sched_io_base = iobase + 0x04;
sched_rate = rate;
+
+   if (quirks & APBTMR_QUIRK_64BIT_COUNTER)
+   sched_io_base += 0x04;
 }
 
 static u32 read_sched_clock(void)
@@ -122,11 +130,12 @@ static const struct of_device_id sptimer_ids[] 
__initconst = {
 static void init_sched_clock(void)
 {
struct device_node *sched_timer;
+   int quirks;
 
sched_timer = of_find_matching_node(NULL, sptimer_ids);
if (sched_timer) {
timer_get_base_and_rate(sched_timer, _io_base,
-   _rate);
+   _rate, );
of_node_put(sched_timer);
}
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/9] clocksource: dw_apb_timer: quirk for inverted timer mode setting

2013-07-05 Thread Heiko Stübner

From: Ulrich Prinz 

Some variants of SOCs using dw_apb_timer have inverted logic for the
bit that sets one-shot / periodic mode or free running timer. This
commit adds the new APBTMR_QUIRK_INVERSE_PERIODIC.

Signed-off-by: Ulrich Prinz 
---
 drivers/clocksource/dw_apb_timer.c |   11 +--
 include/linux/dw_apb_timer.h   |6 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index 7705d13..a2e8306 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -159,7 +159,11 @@ static void apbt_set_mode(enum clock_event_mode mode,
case CLOCK_EVT_MODE_PERIODIC:
period = DIV_ROUND_UP(timer->freq, HZ);
ctrl = apbt_readl(timer, timer->reg_control);
-   ctrl |= APBTMR_CONTROL_MODE_PERIODIC;
+
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_PERIODIC)
+   ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
+   else
+   ctrl |= APBTMR_CONTROL_MODE_PERIODIC;
apbt_writel(timer, ctrl, timer->reg_control);
/*
 * DW APB p. 46, have to disable timer before load counter,
@@ -186,7 +190,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
 * the next event, therefore emulate the one-shot mode.
 */
ctrl &= ~APBTMR_CONTROL_ENABLE;
-   ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_PERIODIC)
+   ctrl |= APBTMR_CONTROL_MODE_PERIODIC;
+   else
+   ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
 
apbt_writel(timer, ctrl, timer->reg_control);
/* write again to set free running mode */
diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
index 7d36d91..5d9210cc 100644
--- a/include/linux/dw_apb_timer.h
+++ b/include/linux/dw_apb_timer.h
@@ -36,6 +36,12 @@
  */
 #define APBTMR_QUIRK_INVERSE_INTMASK   BIT(2)
 
+/* The IP uses inverted logic for the bit setting periodic mode.
+ * Periodic means it times out after the period is over and is set to
+ * 1 in the original IP. This IP uses 1 for free running mode.
+ */
+#define APBTMR_QUIRK_INVERSE_PERIODIC  BIT(3)
+
 struct dw_apb_timer {
void __iomem*base;
unsigned long   freq;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/9] clocksource: dw_apb_timer: quirk for inverted int mask

2013-07-05 Thread Heiko Stübner

Some timer variants use an inverted setting to mask the timer interrupt.
Therefore add a quirk to handle these variants.

Signed-off-by: Ulrich Prinz 
---
 drivers/clocksource/dw_apb_timer.c |   23 ++-
 include/linux/dw_apb_timer.h   |6 ++
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index 23cd7c6..7705d13 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -84,7 +84,10 @@ static void apbt_disable_int(struct dw_apb_timer *timer)
 {
unsigned long ctrl = apbt_readl(timer, timer->reg_control);
 
-   ctrl |= APBTMR_CONTROL_INT;
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
+   ctrl &= ~APBTMR_CONTROL_INT;
+   else
+   ctrl |= APBTMR_CONTROL_INT;
apbt_writel(timer, ctrl, timer->reg_control);
 }
 
@@ -134,7 +137,10 @@ static void apbt_enable_int(struct 
dw_apb_clock_event_device *dw_ced)
/* clear pending intr */
if (dw_ced->eoi)
dw_ced->eoi(timer);
-   ctrl &= ~APBTMR_CONTROL_INT;
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
+   ctrl |= APBTMR_CONTROL_INT;
+   else
+   ctrl &= ~APBTMR_CONTROL_INT;
apbt_writel(timer, ctrl, timer->reg_control);
 }
 
@@ -195,7 +201,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
apbt_writel(timer, 0, timer->reg_load_count + 0x4);
 
-   ctrl &= ~APBTMR_CONTROL_INT;
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
+   ctrl |= APBTMR_CONTROL_INT;
+   else
+   ctrl &= ~APBTMR_CONTROL_INT;
ctrl |= APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
break;
@@ -363,9 +372,13 @@ void dw_apb_clocksource_start(struct dw_apb_clocksource 
*dw_cs)
if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
apbt_writel(timer, 0, timer->reg_load_count + 0x4);
 
-   /* enable, mask interrupt */
+   /* set periodic, mask interrupt, enable timer */
ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
-   ctrl |= (APBTMR_CONTROL_ENABLE | APBTMR_CONTROL_INT);
+   if (timer->quirks & APBTMR_QUIRK_INVERSE_INTMASK)
+   ctrl &= ~APBTMR_CONTROL_INT;
+   else
+   ctrl |= APBTMR_CONTROL_INT;
+   ctrl |= APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
/* read it once to get cached counter value initialized */
dw_apb_clocksource_read(dw_cs);
diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
index fbe4c6b..7d36d91 100644
--- a/include/linux/dw_apb_timer.h
+++ b/include/linux/dw_apb_timer.h
@@ -30,6 +30,12 @@
  */
 #define APBTMR_QUIRK_NO_EOIBIT(1)
 
+/* The IP uses an inverted interrupt-mask bit.
+ * Instead of activating interrupts clearing a maks bit, it needs an enable
+ * bit to be set 1.
+ */
+#define APBTMR_QUIRK_INVERSE_INTMASK   BIT(2)
+
 struct dw_apb_timer {
void __iomem*base;
unsigned long   freq;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/9] clocksource: dw_apb_timer: quirk for variants without EOI register

2013-07-05 Thread Heiko Stübner

Some variants of the dw_apb_timer don't have an eoi register but instead expect 
a
one to be written to the int_status register at eoi time.

Signed-off-by: Ulrich Prinz 
---
 drivers/clocksource/dw_apb_timer.c |   10 +-
 include/linux/dw_apb_timer.h   |5 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index 5f80a30..23cd7c6 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -104,6 +104,11 @@ static void apbt_eoi(struct dw_apb_timer *timer)
apbt_readl(timer, timer->reg_eoi);
 }
 
+static void apbt_eoi_int_status(struct dw_apb_timer *timer)
+{
+   apbt_writel(timer, 1, timer->reg_int_status);
+}
+
 static irqreturn_t dw_apb_clockevent_irq(int irq, void *data)
 {
struct clock_event_device *evt = data;
@@ -286,7 +291,10 @@ dw_apb_clockevent_init(int cpu, const char *name, unsigned 
rating,
  IRQF_NOBALANCING |
  IRQF_DISABLED;
 
-   dw_ced->eoi = apbt_eoi;
+   if (quirks & APBTMR_QUIRK_NO_EOI)
+   dw_ced->eoi = apbt_eoi_int_status;
+   else
+   dw_ced->eoi = apbt_eoi;
err = setup_irq(irq, _ced->irqaction);
if (err) {
pr_err("failed to request timer irq\n");
diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
index 80f6686..fbe4c6b 100644
--- a/include/linux/dw_apb_timer.h
+++ b/include/linux/dw_apb_timer.h
@@ -25,6 +25,11 @@
  */
 #define APBTMR_QUIRK_64BIT_COUNTER BIT(0)
 
+/* The IP does not provide a end-of-interrupt register to clear pending
+ * interrupts, but requires to write a 1 to the interrupt-status register.
+ */
+#define APBTMR_QUIRK_NO_EOIBIT(1)
+
 struct dw_apb_timer {
void __iomem*base;
unsigned long   freq;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/9] clocksource: dw_apb_timer: use the eoi callback to clear pending interrupts

2013-07-05 Thread Heiko Stübner

Some timer variants have different mechanisms to clear a pending timer
interrupt. Therefore don't hardcode the reading of the eoi register to
clear them, but instead use the already existing eoi callback for this.

Signed-off-by: Ulrich Prinz 
---
 drivers/clocksource/dw_apb_timer.c |   11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index bd45351..5f80a30 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -121,11 +121,14 @@ static irqreturn_t dw_apb_clockevent_irq(int irq, void 
*data)
return IRQ_HANDLED;
 }
 
-static void apbt_enable_int(struct dw_apb_timer *timer)
+static void apbt_enable_int(struct dw_apb_clock_event_device *dw_ced)
 {
+   struct dw_apb_timer *timer = _ced->timer;
unsigned long ctrl = apbt_readl(timer, timer->reg_control);
+
/* clear pending intr */
-   apbt_readl(timer, timer->reg_eoi);
+   if (dw_ced->eoi)
+   dw_ced->eoi(timer);
ctrl &= ~APBTMR_CONTROL_INT;
apbt_writel(timer, ctrl, timer->reg_control);
 }
@@ -200,7 +203,7 @@ static void apbt_set_mode(enum clock_event_mode mode,
break;
 
case CLOCK_EVT_MODE_RESUME:
-   apbt_enable_int(timer);
+   apbt_enable_int(dw_ced);
break;
}
 }
@@ -325,7 +328,7 @@ void dw_apb_clockevent_register(struct 
dw_apb_clock_event_device *dw_ced)
 
apbt_writel(timer, 0, timer->reg_control);
clockevents_register_device(_ced->ced);
-   apbt_enable_int(timer);
+   apbt_enable_int(dw_ced);
 }
 
 /**
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/9] clocksource: dw_apb_timer: quirk for variants with 64bit counter

2013-07-05 Thread Heiko Stübner

This adds a quirk for IP variants containing two load_count and value
registers that are used to provide 64bit accuracy on 32bit systems.

The added accuracy is currently not used, the driver is only adapted to
handle the different register layout and make it work on affected devices.

Signed-off-by: Heiko Stuebner 
---
 drivers/clocksource/dw_apb_timer.c |   27 +++
 include/linux/dw_apb_timer.h   |6 ++
 2 files changed, 33 insertions(+)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index f5e7be8..bd45351 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -56,6 +56,17 @@ static void apbt_init_regs(struct dw_apb_timer *timer, int 
quirks)
timer->reg_control = APBTMR_N_CONTROL;
timer->reg_eoi = APBTMR_N_EOI;
timer->reg_int_status = APBTMR_N_INT_STATUS;
+
+   /*
+* On variants with 64bit counters some registers are
+* moved further down.
+*/
+   if (quirks & APBTMR_QUIRK_64BIT_COUNTER) {
+   timer->reg_current_value += 0x4;
+   timer->reg_control += 0x8;
+   timer->reg_eoi += 0x8;
+   timer->reg_int_status += 0x8;
+   }
 }
 
 static unsigned long apbt_readl(struct dw_apb_timer *timer, unsigned long offs)
@@ -145,6 +156,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
udelay(1);
pr_debug("Setting clock period %lu for HZ %d\n", period, HZ);
apbt_writel(timer, period, timer->reg_load_count);
+
+   if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
+   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
+
ctrl |= APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
break;
@@ -168,6 +183,10 @@ static void apbt_set_mode(enum clock_event_mode mode,
 * running mode.
 */
apbt_writel(timer, ~0, timer->reg_load_count);
+
+   if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
+   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
+
ctrl &= ~APBTMR_CONTROL_INT;
ctrl |= APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
@@ -199,6 +218,10 @@ static int apbt_next_event(unsigned long delta,
apbt_writel(timer, ctrl, timer->reg_control);
/* write new count */
apbt_writel(timer, delta, timer->reg_load_count);
+
+   if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
+   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
+
ctrl |= APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
 
@@ -325,6 +348,10 @@ void dw_apb_clocksource_start(struct dw_apb_clocksource 
*dw_cs)
ctrl &= ~APBTMR_CONTROL_ENABLE;
apbt_writel(timer, ctrl, timer->reg_control);
apbt_writel(timer, ~0, timer->reg_load_count);
+
+   if (timer->quirks & APBTMR_QUIRK_64BIT_COUNTER)
+   apbt_writel(timer, 0, timer->reg_load_count + 0x4);
+
/* enable, mask interrupt */
ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
ctrl |= (APBTMR_CONTROL_ENABLE | APBTMR_CONTROL_INT);
diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
index 7dc7166..80f6686 100644
--- a/include/linux/dw_apb_timer.h
+++ b/include/linux/dw_apb_timer.h
@@ -19,6 +19,12 @@
 
 #define APBTMRS_REG_SIZE   0x14
 
+/* The IP uses two registers for count and values, to provide 64bit accuracy
+ * on 32bit platforms. The additional registers move the following registers
+ * down by 0x8 byte, as both the count and value registers are duplicated.
+ */
+#define APBTMR_QUIRK_64BIT_COUNTER BIT(0)
+
 struct dw_apb_timer {
void __iomem*base;
unsigned long   freq;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/9] clocksource: dw_apb_timer: infrastructure to handle quirks

2013-07-05 Thread Heiko Stübner

There exist variants of the timer IP with some modified properties.

Therefore add infrastructure to handle hardware-quirks in the driver.

Signed-off-by: Heiko Stuebner 
---
 arch/x86/kernel/apb_timer.c   |4 ++--
 drivers/clocksource/dw_apb_timer.c|7 +--
 drivers/clocksource/dw_apb_timer_of.c |4 ++--
 include/linux/dw_apb_timer.h  |6 --
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/apb_timer.c b/arch/x86/kernel/apb_timer.c
index c9876ef..e7fe0f6 100644
--- a/arch/x86/kernel/apb_timer.c
+++ b/arch/x86/kernel/apb_timer.c
@@ -121,7 +121,7 @@ static inline void apbt_set_mapping(void)
 
clocksource_apbt = dw_apb_clocksource_init(APBT_CLOCKSOURCE_RATING,
"apbt0", apbt_virt_address + phy_cs_timer_id *
-   APBTMRS_REG_SIZE, apbt_freq);
+   APBTMRS_REG_SIZE, apbt_freq, 0);
return;
 
 panic_noapbt:
@@ -159,7 +159,7 @@ static int __init apbt_clockevent_register(void)
adev->timer = dw_apb_clockevent_init(smp_processor_id(), "apbt0",
mrst_timer_options == MRST_TIMER_LAPIC_APBT ?
APBT_CLOCKEVENT_RATING - 100 : APBT_CLOCKEVENT_RATING,
-   adev_virt_addr(adev), 0, apbt_freq);
+   adev_virt_addr(adev), 0, apbt_freq, 0);
/* Firmware does EOI handling for us. */
adev->timer->eoi = NULL;
 
diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index 8c2a35f..01bdac0 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -213,7 +213,8 @@ static int apbt_next_event(unsigned long delta,
  */
 struct dw_apb_clock_event_device *
 dw_apb_clockevent_init(int cpu, const char *name, unsigned rating,
-  void __iomem *base, int irq, unsigned long freq)
+  void __iomem *base, int irq, unsigned long freq,
+  int quirks)
 {
struct dw_apb_clock_event_device *dw_ced =
kzalloc(sizeof(*dw_ced), GFP_KERNEL);
@@ -225,6 +226,7 @@ dw_apb_clockevent_init(int cpu, const char *name, unsigned 
rating,
dw_ced->timer.base = base;
dw_ced->timer.irq = irq;
dw_ced->timer.freq = freq;
+   dw_ced->timer.quirks = quirks;
 
clockevents_calc_mult_shift(_ced->ced, freq, APBT_MIN_PERIOD);
dw_ced->ced.max_delta_ns = clockevent_delta2ns(0x7fff,
@@ -349,7 +351,7 @@ static void apbt_restart_clocksource(struct clocksource *cs)
  */
 struct dw_apb_clocksource *
 dw_apb_clocksource_init(unsigned rating, const char *name, void __iomem *base,
-   unsigned long freq)
+   unsigned long freq, int quirks)
 {
struct dw_apb_clocksource *dw_cs = kzalloc(sizeof(*dw_cs), GFP_KERNEL);
 
@@ -358,6 +360,7 @@ dw_apb_clocksource_init(unsigned rating, const char *name, 
void __iomem *base,
 
dw_cs->timer.base = base;
dw_cs->timer.freq = freq;
+   dw_cs->timer.quirks = quirks;
dw_cs->cs.name = name;
dw_cs->cs.rating = rating;
dw_cs->cs.read = __apbt_read_clocksource;
diff --git a/drivers/clocksource/dw_apb_timer_of.c 
b/drivers/clocksource/dw_apb_timer_of.c
index cef5544..b5412af 100644
--- a/drivers/clocksource/dw_apb_timer_of.c
+++ b/drivers/clocksource/dw_apb_timer_of.c
@@ -74,7 +74,7 @@ static void add_clockevent(struct device_node *event_timer)
timer_get_base_and_rate(event_timer, , );
 
ced = dw_apb_clockevent_init(0, event_timer->name, 300, iobase, irq,
-rate);
+rate, 0);
if (!ced)
panic("Unable to initialise clockevent device");
 
@@ -92,7 +92,7 @@ static void add_clocksource(struct device_node *source_timer)
 
timer_get_base_and_rate(source_timer, , );
 
-   cs = dw_apb_clocksource_init(300, source_timer->name, iobase, rate);
+   cs = dw_apb_clocksource_init(300, source_timer->name, iobase, rate, 0);
if (!cs)
panic("Unable to initialise clocksource device");
 
diff --git a/include/linux/dw_apb_timer.h b/include/linux/dw_apb_timer.h
index 07261d5..67d09c7 100644
--- a/include/linux/dw_apb_timer.h
+++ b/include/linux/dw_apb_timer.h
@@ -23,6 +23,7 @@ struct dw_apb_timer {
void __iomem*base;
unsigned long   freq;
int irq;
+   int quirks;
 };
 
 struct dw_apb_clock_event_device {
@@ -44,10 +45,11 @@ void dw_apb_clockevent_stop(struct 
dw_apb_clock_event_device *dw_ced);
 
 struct dw_apb_clock_event_device *
 dw_apb_clockevent_init(int cpu, const char *name, unsigned rating,
-  void __iomem *base, int irq, unsigned long freq);
+  void __iomem *base, int irq, unsigned long freq,
+  int quirks);
 struct dw_apb_clocksource *

[PATCH 2/9] clocksource: dw_apb_timer: flexible register addresses

2013-07-05 Thread Heiko Stübner

There exists variants of the apb-timer that use slightly different
register positions. To accomodate this, add elements to the timer struct
to hold the actual register offsets.

Signed-off-by: Heiko Stuebner 
---
 drivers/clocksource/dw_apb_timer.c |   83 ++--
 include/linux/dw_apb_timer.h   |5 +++
 2 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/drivers/clocksource/dw_apb_timer.c 
b/drivers/clocksource/dw_apb_timer.c
index 01bdac0..f5e7be8 100644
--- a/drivers/clocksource/dw_apb_timer.c
+++ b/drivers/clocksource/dw_apb_timer.c
@@ -49,6 +49,15 @@ clocksource_to_dw_apb_clocksource(struct clocksource *cs)
return container_of(cs, struct dw_apb_clocksource, cs);
 }
 
+static void apbt_init_regs(struct dw_apb_timer *timer, int quirks)
+{
+   timer->reg_load_count = APBTMR_N_LOAD_COUNT;
+   timer->reg_current_value = APBTMR_N_CURRENT_VALUE;
+   timer->reg_control = APBTMR_N_CONTROL;
+   timer->reg_eoi = APBTMR_N_EOI;
+   timer->reg_int_status = APBTMR_N_INT_STATUS;
+}
+
 static unsigned long apbt_readl(struct dw_apb_timer *timer, unsigned long offs)
 {
return readl(timer->base + offs);
@@ -62,10 +71,10 @@ static void apbt_writel(struct dw_apb_timer *timer, 
unsigned long val,
 
 static void apbt_disable_int(struct dw_apb_timer *timer)
 {
-   unsigned long ctrl = apbt_readl(timer, APBTMR_N_CONTROL);
+   unsigned long ctrl = apbt_readl(timer, timer->reg_control);
 
ctrl |= APBTMR_CONTROL_INT;
-   apbt_writel(timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
 }
 
 /**
@@ -81,7 +90,7 @@ void dw_apb_clockevent_pause(struct dw_apb_clock_event_device 
*dw_ced)
 
 static void apbt_eoi(struct dw_apb_timer *timer)
 {
-   apbt_readl(timer, APBTMR_N_EOI);
+   apbt_readl(timer, timer->reg_eoi);
 }
 
 static irqreturn_t dw_apb_clockevent_irq(int irq, void *data)
@@ -103,11 +112,11 @@ static irqreturn_t dw_apb_clockevent_irq(int irq, void 
*data)
 
 static void apbt_enable_int(struct dw_apb_timer *timer)
 {
-   unsigned long ctrl = apbt_readl(timer, APBTMR_N_CONTROL);
+   unsigned long ctrl = apbt_readl(timer, timer->reg_control);
/* clear pending intr */
-   apbt_readl(timer, APBTMR_N_EOI);
+   apbt_readl(timer, timer->reg_eoi);
ctrl &= ~APBTMR_CONTROL_INT;
-   apbt_writel(timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
 }
 
 static void apbt_set_mode(enum clock_event_mode mode,
@@ -116,31 +125,32 @@ static void apbt_set_mode(enum clock_event_mode mode,
unsigned long ctrl;
unsigned long period;
struct dw_apb_clock_event_device *dw_ced = ced_to_dw_apb_ced(evt);
+   struct dw_apb_timer *timer = _ced->timer;
 
pr_debug("%s CPU %d mode=%d\n", __func__, first_cpu(*evt->cpumask),
 mode);
 
switch (mode) {
case CLOCK_EVT_MODE_PERIODIC:
-   period = DIV_ROUND_UP(dw_ced->timer.freq, HZ);
-   ctrl = apbt_readl(_ced->timer, APBTMR_N_CONTROL);
+   period = DIV_ROUND_UP(timer->freq, HZ);
+   ctrl = apbt_readl(timer, timer->reg_control);
ctrl |= APBTMR_CONTROL_MODE_PERIODIC;
-   apbt_writel(_ced->timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
/*
 * DW APB p. 46, have to disable timer before load counter,
 * may cause sync problem.
 */
ctrl &= ~APBTMR_CONTROL_ENABLE;
-   apbt_writel(_ced->timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
udelay(1);
pr_debug("Setting clock period %lu for HZ %d\n", period, HZ);
-   apbt_writel(_ced->timer, period, APBTMR_N_LOAD_COUNT);
+   apbt_writel(timer, period, timer->reg_load_count);
ctrl |= APBTMR_CONTROL_ENABLE;
-   apbt_writel(_ced->timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
break;
 
case CLOCK_EVT_MODE_ONESHOT:
-   ctrl = apbt_readl(_ced->timer, APBTMR_N_CONTROL);
+   ctrl = apbt_readl(timer, timer->reg_control);
/*
 * set free running mode, this mode will let timer reload max
 * timeout which will give time (3min on 25MHz clock) to rearm
@@ -149,29 +159,29 @@ static void apbt_set_mode(enum clock_event_mode mode,
ctrl &= ~APBTMR_CONTROL_ENABLE;
ctrl &= ~APBTMR_CONTROL_MODE_PERIODIC;
 
-   apbt_writel(_ced->timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);
/* write again to set free running mode */
-   apbt_writel(_ced->timer, ctrl, APBTMR_N_CONTROL);
+   apbt_writel(timer, ctrl, timer->reg_control);

[PATCH 0/9] clocksource: dw_apb_timer: support for timer variant used in rk3188 SoCs

2013-07-05 Thread Heiko Stübner

The Rockchip rk3188 SoCs use a variant of the timer with some slight
modifications. This series implements them as quirks for the dw_apb_timer.

Tested on a rk3188 for the quirk handling and on a rk3066a to check that
nothing broke.

Heiko Stuebner (5):
  clocksource: dw_apb_timer: infrastructure to handle quirks
  clocksource: dw_apb_timer: flexible register addresses
  clocksource: dw_apb_timer: quirk for variants with 64bit counter
  clocksource: dw_apb_timer_of: add quirk handling
  clocksource: dw_apb_timer: special variant for rockchip rk3188 timers

Ulrich Prinz (4):
  clocksource: dw_apb_timer: use the eoi callback to clear pending interrupts
  clocksource: dw_apb_timer: quirk for variants without EOI register
  clocksource: dw_apb_timer: quirk for inverted int mask
  clocksource: dw_apb_timer: quirk for inverted timer mode setting

 .../bindings/arm/rockchip/rk3188-timer.txt |   20 +++
 arch/x86/kernel/apb_timer.c|4 +-
 drivers/clocksource/dw_apb_timer.c |  166 +++-
 drivers/clocksource/dw_apb_timer_of.c  |   27 +++-
 include/linux/dw_apb_timer.h   |   34 +++-
 5 files changed, 198 insertions(+), 53 deletions(-)
 create mode 100644 
Documentation/devicetree/bindings/arm/rockchip/rk3188-timer.txt

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: Convert sa1111 platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Shuah Khan

On 07/05/2013 04:45 PM, Shuah Khan wrote:
> Convert arch/arm/common/sa platform and bus legacy pm_ops to dev_pm_ops.
> This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
> platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
> Compile tested.
>
> Signed-off-by: Shuah Khan 
> ---
>   arch/arm/common/sa.c |2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm/common/sa.c b/arch/arm/common/sa.c
> index 2a64c12..95594f0 100644
> --- a/arch/arm/common/sa.c
> +++ b/arch/arm/common/sa.c
> @@ -1321,7 +1321,7 @@ static int sa_bus_resume(struct device *dev)
>   }
>   static SIMPLE_DEV_PM_OPS(sa_bus_dev_pm_ops, sa_bus_suspend,
>sa_bus_resume);
> -#endif
> +#endif
>
>   static void sa_bus_shutdown(struct device *dev)
>   {
>

Please ignore this patch - it is not correct.

-- Shuah

-- 
Shuah Khan, Linux Kernel Developer - Open Source Group Samsung Research 
America (Silicon Valley) shuah...@samsung.com | (970) 672-0658
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] perf: add ability to sample physical data addresses

2013-07-05 Thread Stephane Eranian

Peter,

On Fri, Jun 28, 2013 at 11:58 AM, Peter Zijlstra  wrote:
> On Wed, Jun 26, 2013 at 09:10:50PM +0200, Stephane Eranian wrote:
>> After more investigation with the author of the false sharing
>> detection tool, I think
>> that if the mapping changes, it is okay. The tool can detect this and
>> drop the analysis
>> at that address. So as long as we can flag the mapping change, we are
>> okay. Hopefully,
>> it does not occur frequently. If so, then I think there are bigger
>> issues to fix on the system
>> than false sharing.
>
> But if you index everything using dev:inode:offset it doesn't matter if the
> mapping changes; you're invariant to map placement.
>
> And the thing is; we already compute most of that anyway in order to find the
> code in DSOs, except there we use filename instead of dev:inode. However if
> there were means to open files using dev:inode that might actually be more
> reliable than using the filename.

So, I tried on an example using shmat(). I map the same shared segment twice
in the same process. Then I fork(): I see  this in /proc/PID/maps:

7f80fce28000-7f80fce29000 rw-s  00:04 1376262
  /SYSV (deleted)
7f80fce29000-7f80fce2a000 rw-s  00:04 1343491
  /SYSV (deleted)
7f80fce2a000-7f80fce2b000 rw-s  00:04 1343491
  /SYSV (deleted)

The segment at 1343491 is the one mapped twice. So that number (shmid)
can be used to identify identical mappings. It appears the same way in both
processes. The other 1376262 mapping is just to verify that each segment
gets a different number.

So it looks possible to use this approach across process to identify identical
physical mappings. However, this is not very practical.

The first reason is that perf_event does not capture shmat() mappings in MMAP
records.

The second is is that if you rely on /proc/PID/maps, you will have to
have the tool
constantly poll that file for new shared mappings. This is not how
perf works today,
not even in system-wide mode. /proc/PID/maps is swept only once when perf
record -a is started.

Ingo is proposing a ioctl() to flush the mappings. But then, when is a
good time to do this
from the tool?

So my approach with PERF_SAMPLE_PHYS_ADDR looks easier on the tools which
if I recall is the philosophy behind perf_events.

Any more comments?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: Convert sa1111 platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Sergei Shtylyov


Hello.

On 07/06/2013 02:44 AM, Shuah Khan wrote:


Convert arch/arm/common/sa platform and bus legacy pm_ops to dev_pm_ops.
This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
Compile tested.


   It may be compile tested but the patch description doesn't match the 
patch (which is a simple trailing space fix).



Signed-off-by: Shuah Khan 
---
  arch/arm/common/sa.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/common/sa.c b/arch/arm/common/sa.c
index 2a64c12..95594f0 100644
--- a/arch/arm/common/sa.c
+++ b/arch/arm/common/sa.c
@@ -1321,7 +1321,7 @@ static int sa_bus_resume(struct device *dev)
  }
  static SIMPLE_DEV_PM_OPS(sa_bus_dev_pm_ops, sa_bus_suspend,
 sa_bus_resume);
-#endif
+#endif


WBR, Sergei


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: Convert sa1111 platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Russell King - ARM Linux

On Fri, Jul 05, 2013 at 04:44:57PM -0600, Shuah Khan wrote:
> Convert arch/arm/common/sa platform and bus legacy pm_ops to dev_pm_ops.
> This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
> platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
> Compile tested.

Err...

> diff --git a/arch/arm/common/sa.c b/arch/arm/common/sa.c
> index 2a64c12..95594f0 100644
> --- a/arch/arm/common/sa.c
> +++ b/arch/arm/common/sa.c
> @@ -1321,7 +1321,7 @@ static int sa_bus_resume(struct device *dev)
>  }
>  static SIMPLE_DEV_PM_OPS(sa_bus_dev_pm_ops, sa_bus_suspend,
>sa_bus_resume);
> -#endif 
> +#endif

Patch doesn't match description.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] arm: Convert sa1111 platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Shuah Khan

Convert arch/arm/common/sa platform and bus legacy pm_ops to dev_pm_ops.
This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
Compile tested.

Signed-off-by: Shuah Khan 
---
 arch/arm/common/sa.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/common/sa.c b/arch/arm/common/sa.c
index 2a64c12..95594f0 100644
--- a/arch/arm/common/sa.c
+++ b/arch/arm/common/sa.c
@@ -1321,7 +1321,7 @@ static int sa_bus_resume(struct device *dev)
 }
 static SIMPLE_DEV_PM_OPS(sa_bus_dev_pm_ops, sa_bus_suspend,
 sa_bus_resume);
-#endif 
+#endif
 
 static void sa_bus_shutdown(struct device *dev)
 {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] arm: Convert scoop platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Shuah Khan

Convert arch/arm/common/scoop platform and bus legacy pm_ops to dev_pm_ops.
This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
Compile tested.

Signed-off-by: Shuah Khan 
---
 arch/arm/common/scoop.c |   19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/arm/common/scoop.c b/arch/arm/common/scoop.c
index a5c3dc3..8d64fcb 100644
--- a/arch/arm/common/scoop.c
+++ b/arch/arm/common/scoop.c
@@ -141,7 +141,7 @@ EXPORT_SYMBOL(reset_scoop);
 EXPORT_SYMBOL(read_scoop_reg);
 EXPORT_SYMBOL(write_scoop_reg);
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 static void check_scoop_reg(struct scoop_dev *sdev)
 {
unsigned short mcr;
@@ -151,9 +151,9 @@ static void check_scoop_reg(struct scoop_dev *sdev)
iowrite16(0x0101, sdev->base + SCOOP_MCR);
 }
 
-static int scoop_suspend(struct platform_device *dev, pm_message_t state)
+static int scoop_suspend(struct device *dev)
 {
-   struct scoop_dev *sdev = platform_get_drvdata(dev);
+   struct scoop_dev *sdev = platform_get_drvdata(to_platform_device(dev));
 
check_scoop_reg(sdev);
sdev->scoop_gpwr = ioread16(sdev->base + SCOOP_GPWR);
@@ -162,18 +162,16 @@ static int scoop_suspend(struct platform_device *dev, 
pm_message_t state)
return 0;
 }
 
-static int scoop_resume(struct platform_device *dev)
+static int scoop_resume(struct device *dev)
 {
-   struct scoop_dev *sdev = platform_get_drvdata(dev);
+   struct scoop_dev *sdev = platform_get_drvdata(to_platform_device(dev));
 
check_scoop_reg(sdev);
iowrite16(sdev->scoop_gpwr, sdev->base + SCOOP_GPWR);
 
return 0;
 }
-#else
-#define scoop_suspend  NULL
-#define scoop_resume   NULL
+static SIMPLE_DEV_PM_OPS(scoop_dev_pm_ops, scoop_suspend, scoop_resume);
 #endif
 
 static int scoop_probe(struct platform_device *pdev)
@@ -269,10 +267,11 @@ static int scoop_remove(struct platform_device *pdev)
 static struct platform_driver scoop_driver = {
.probe  = scoop_probe,
.remove = scoop_remove,
-   .suspend= scoop_suspend,
-   .resume = scoop_resume,
.driver = {
.name   = "sharp-scoop",
+#ifdef CONFIG_PM_SLEEP
+   .pm = _dev_pm_ops,
+#endif
},
 };
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] arm: Convert locomo platform and bus legacy pm_ops to dev_pm_ops

2013-07-05 Thread Shuah Khan

Convert arch/arm/common/locomo platform and bus legacy pm_ops to dev_pm_ops.
This change also updates the use of COMFIG_PM to CONFIG_PM_SLEEP as this
platform and bus code implements PM_SLEEP ops and not the PM_RUNTIME ops.
Compile tested.

Signed-off-by: Shuah Khan 
---
 arch/arm/common/locomo.c |   33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/arm/common/locomo.c b/arch/arm/common/locomo.c
index b55c362..4028e41 100644
--- a/arch/arm/common/locomo.c
+++ b/arch/arm/common/locomo.c
@@ -262,7 +262,7 @@ locomo_init_one_child(struct locomo *lchip, struct 
locomo_dev_info *info)
return ret;
 }
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 
 struct locomo_save_data {
u16 LCM_GPO;
@@ -272,9 +272,9 @@ struct locomo_save_data {
u16 LCM_SPIMD;
 };
 
-static int locomo_suspend(struct platform_device *dev, pm_message_t state)
+static int locomo_suspend(struct device *dev)
 {
-   struct locomo *lchip = platform_get_drvdata(dev);
+   struct locomo *lchip = platform_get_drvdata(to_platform_device(dev));
struct locomo_save_data *save;
unsigned long flags;
 
@@ -316,9 +316,9 @@ static int locomo_suspend(struct platform_device *dev, 
pm_message_t state)
return 0;
 }
 
-static int locomo_resume(struct platform_device *dev)
+static int locomo_resume(struct device *dev)
 {
-   struct locomo *lchip = platform_get_drvdata(dev);
+   struct locomo *lchip = platform_get_drvdata(to_platform_device(dev));
struct locomo_save_data *save;
unsigned long r;
unsigned long flags;
@@ -351,6 +351,8 @@ static int locomo_resume(struct platform_device *dev)
 
return 0;
 }
+
+static SIMPLE_DEV_PM_OPS(locomo_dev_pm_ops, locomo_suspend, locomo_resume);
 #endif
 
 
@@ -519,12 +521,11 @@ static int locomo_remove(struct platform_device *dev)
 static struct platform_driver locomo_device_driver = {
.probe  = locomo_probe,
.remove = locomo_remove,
-#ifdef CONFIG_PM
-   .suspend= locomo_suspend,
-   .resume = locomo_resume,
-#endif
.driver = {
.name   = "locomo",
+#ifdef CONFIG_PM_SLEEP
+   .pm = _dev_pm_ops,
+#endif
},
 };
 
@@ -826,14 +827,15 @@ static int locomo_match(struct device *_dev, struct 
device_driver *_drv)
return dev->devid == drv->devid;
 }
 
-static int locomo_bus_suspend(struct device *dev, pm_message_t state)
+#ifdef CONFIG_PM_SLEEP
+static int locomo_bus_suspend(struct device *dev)
 {
struct locomo_dev *ldev = LOCOMO_DEV(dev);
struct locomo_driver *drv = LOCOMO_DRV(dev->driver);
int ret = 0;
 
if (drv && drv->suspend)
-   ret = drv->suspend(ldev, state);
+   ret = drv->suspend(ldev, PMSG_SUSPEND);
return ret;
 }
 
@@ -848,6 +850,10 @@ static int locomo_bus_resume(struct device *dev)
return ret;
 }
 
+static SIMPLE_DEV_PM_OPS(locomo_bus_dev_pm_ops, locomo_bus_suspend,
+locomo_bus_resume);
+#endif
+
 static int locomo_bus_probe(struct device *dev)
 {
struct locomo_dev *ldev = LOCOMO_DEV(dev);
@@ -875,8 +881,9 @@ struct bus_type locomo_bus_type = {
.match  = locomo_match,
.probe  = locomo_bus_probe,
.remove = locomo_bus_remove,
-   .suspend= locomo_bus_suspend,
-   .resume = locomo_bus_resume,
+#ifdef CONFIG_PM_SLEEP
+   .pm = _bus_dev_pm_ops,
+#endif
 };
 
 int locomo_driver_register(struct locomo_driver *driver)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] list: add list_del_each_entry

2013-07-05 Thread Filipe David Manana

On Fri, Jul 5, 2013 at 9:41 PM, Jörn Engel  wrote:
> I have seen a lot of boilerplate code that either follows the pattern of
> while (!list_empty(head)) {
> pos = list_entry(head->next, struct foo, list);
> list_del(pos->list);
> ...
> }
> or some variant thereof.
>
> With this patch in, people can use
> list_del_each_entry(pos, head, list) {
> ...
> }
>
> The patch also adds a list_del_each variant, even though I have
> only found a single user for that one so far.
>
> Signed-off-by: Joern Engel 
> ---
>  include/linux/list.h |   18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 6a1f8df..ab39c7d 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -557,6 +557,24 @@ static inline void list_splice_tail_init(struct 
> list_head *list,
>  #define list_safe_reset_next(pos, n, member)   \
> n = list_entry(pos->member.next, typeof(*pos), member)
>
> +/**
> + * list_del_each - removes an entry from the list until it is empty
> + * @pos:   the  list_head to use as a loop cursor.
> + * @head:  the head of your list.
> + */
> +#define list_del_each(pos, head) \
> +   while (list_empty(head) ? 0 : (pos = (head)->next, list_del(pos), 1))
> +
> +/**
> + * list_del_each_entry - removes an entry from the list until it is empty
> + * @pos:   the type * to use as loop cursor.
> + * @head:  the head of your list.
> + * @member:the name of the list_struct within the struct
> + */
> +#define list_del_each_entry(pos, head, member) \
> +   while (list_empty(head) && (pos = list_first_entry((head), \
> +   typeof(*pos), member), list_del((head)->next), 1))
> +

Shouldn't it be while (!list_empty(head) ... ?
(not operator addition)

thanks

>  /*
>   * Double linked lists with a single pointer list head.
>   * Mostly useful for hash tables where the two pointer list head is
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTD EEPROM support and driver integration

2013-07-05 Thread Arnd Bergmann

On Saturday 06 July 2013, Maxime Ripard wrote:
> > My first thought is that it should be more generic than that and not
> > have the mac address hardcoded as the purpose. We could possibly use
> > regmap as the in-kernel interface, and come up with a more generic
> > way of referring to registers in another device node.
> 
> Hmm, I maybe wasn't as clear as I wanted. Here mac-storage was just an
> example. It should indeed be completely generic, and a device could have
> several "storage source" defined, each driver knowing what property it
> would need, pretty much like what's done currently for the regulators
> for example.
> 
> We will have such a use case anyway for the Allwinner stuff, since the
> fuses can be used for several thing, including storing the SoC ID,
> serial numbers, and so on.

Ah, I see. In general, we have two ways of expressing the same thing
here:

a) like interrupts, regs, dmas, clocks, pinctrl, reset, pwm: fixed property 
names

regmap = < 0xstart 0xlen>;
regmap-names = "mac-address";

b) like gpio, regulator: variable property names

mac-storage = < 0xstart 0xlen>;

It's unfortunate that we already have examples of both. They are largely
equivalent, but the tendency is towards the first.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTD EEPROM support and driver integration

2013-07-05 Thread Maxime Ripard

Hi Arnd,

On Fri, Jul 05, 2013 at 11:02:40PM +0200, Arnd Bergmann wrote:
> On Friday 05 July 2013, Maxime Ripard wrote:
> > Hi everyone,
> > 
> > In the last weeks, we've drivers coming up both about mostly some very
> > simple drivers that expose to the userspace a few bytes of memory-mapped
> > IO. Both will probably live under drivers/misc/eeprom, where there's
> > support already for other kind of what could be assimilated as eeproms
> > (AT24, AT25, etc.).
> > 
> > Now, besides the code duplication every driver has to make to register
> > the sysfs attributes, it wouldn't cause much of a problem.
> > 
> > Except that these EEPROMs could store values useful for other drivers in
> > Linux. For example:
> >   - imx28 OCOTP (the latest of the two EEPROM drivers recently submitted)
> > use it most of the time to store its MAC addresses
> >   - Allwinner SoCs SID (the other latest driver submitted) use it
> > sometime to store a MAC address, and also the SoC ID, which could be
> > useful to implement a SoC bus driver.
> >   - Some Allwinner boards (and presumably others) use an AT24 to store
> > the MAC address in it.
> > 
> > For now, everyone comes up with a different solution:
> >   - imx28 has a hook in mach-mxs to patch the device tree at runtime and
> > add the values retrieved from the OCOTP to it.
> >   - AT24 seem to have some function exported (at24_macc_(read|write)) so
> > that other part of the kernel can use them to retrieve values from
> > such an EEPROM.
> >   - Allwinner SoCs have, well, basically nothing for now, which is why I
> > send this email.
> > 
> > The current way of working has several flaws imho:
> >   - The code is heavily duplicated: the sysfs registering is common to
> > every driver, and at the end of the day, every driver should only
> > give a read/write callback, and that's it.
> >   - The "consumer" drivers should not have to worry at all about the
> > EEPROM type it should retrieve the needed value from, let alone
> > dealing with the number of instances of such an EEPROM.
> > 
> > To solve this issues, I think I have some solution. Would merging this
> > drivers into MTD make some sense? It seems like there is already some
> > EEPROM drivers into drivers/mtd (such as an AT25 one, which also have a
> > drivers under drivers/misc/eeprom), so I guess it does, but I'd like to
> > have your opinion on this.
> 
> I always felt that we should eventually move the eeprom drivers out of
> drivers/misc into their own subsystem. Moving them under drivers/mtd
> also seems reasonable.
> 
> Having a common API sounds like a good idea, and we should probably
> spend some time comparing the options.

Great :)

> > If so, would some kind of MTD in-kernel API to retrieve values from MTD
> > devices would be acceptable to you? I mostly have DT in mind, so I'm
> > thinking of having DT bindings to that API such as
> > 
> >   mac-storage = < 0xoffset 0xsize>
> > 
> > to describe which device to get a value from, and where in that device.
> > 
> > That would allow consumer drivers to only have to call a function like
> > of_mtd_get_value and let the MTD subsystem do the hard work.
> > 
> > What's your feeling on this?
> 
> My first thought is that it should be more generic than that and not
> have the mac address hardcoded as the purpose. We could possibly use
> regmap as the in-kernel interface, and come up with a more generic
> way of referring to registers in another device node.

Hmm, I maybe wasn't as clear as I wanted. Here mac-storage was just an
example. It should indeed be completely generic, and a device could have
several "storage source" defined, each driver knowing what property it
would need, pretty much like what's done currently for the regulators
for example.

We will have such a use case anyway for the Allwinner stuff, since the
fuses can be used for several thing, including storing the SoC ID,
serial numbers, and so on.

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com


signature.asc
Description: Digital signature

RE: Yet more softlockups.

2013-07-05 Thread Thomas Gleixner

On Fri, 5 Jul 2013, Seiji Aguchi wrote:
> > -Original Message-
> > Hmmm... this makes me wonder if the interrupt tracepoint stuff is at
> > fault here, as it changes the IDT handling for NMI context.
> 
> This softlockup happens while disabling the interrupt tracepoints,
> Because if it is enabled, "smp_trace_apic_timer_interrupt" is displayed
> instead of "smp_apic_timer_interrupt" in the call trace below.
> 
> But I can't say anything how this issue is related to the tracepoint stuff,

I doubt it is related. I rather suspect that trinity is able to fuzz
perf into a DoS facility.

Dave, can you trace the perf sys calls and dump that data over serial
when the softlockup hits?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 00/10] clocksource: sunxi: Timer fixes and cleanup

2013-07-05 Thread Maxime Ripard

Hi everyone,

The first timer code we merged when adding support for the A13 some
time back was mostly a clean up from the source drop we had, without
any documentation.  This happened to work, but the code merged in
turned out to be far from perfect, and had several flaws.

This patchset hopefully fixes these flaws, and cleanup most of the
driver as well, to end up in an almost complete rewrite of it (even
though it's not that long).

It also finally adds a clocksource driver using the second timer as
our monotonic clock source.

These flaws have all been spotted when trying to add the A31 support,
work that is still ongoing, but will hopefully benefit from this
patchset as well.

Thanks,
Maxime

Changes from v3:
  - Reintroduce the rate variable to cache the parent clock rate
  - Remove the interval programming at probe time that was
reintroduced in the v3 due to a poor rebase.

Changes from v2:
  - Use the clocksource timer to get the amount of time we have to
wait for when disabling and enabling back a timer
  - Added patch to add parenthesis around the macros arguments
  - Renamed the AUTORELOAD register define to the more meaningful
RELOAD name

Changes from v1:
  - Rebased on top of linux-next to benefit from the move to all
architectures of the sched_clock functions
  - Moved the clock source to the second timer instead of the 64 bits
free-running counter like suggested by Thomas.

Maxime Ripard (10):
  clocksource: sun4i: Use the BIT macros where possible
  clocksource: sun4i: Wrap macros arguments in parenthesis
  clocksource: sun4i: rename AUTORELOAD define to RELOAD
  clocksource: sun4i: Add clocksource and sched clock drivers
  clocksource: sun4i: Don't forget to enable the clock we use
  clocksource: sun4i: Fix the next event code
  clocksource: sun4i: Factor out some timer code
  clocksource: sun4i: Remove TIMER_SCAL variable
  clocksource: sun4i: Cleanup parent clock setup
  clocksource: sun4i: Fix bug when switching from periodic to oneshot
modes

 drivers/clocksource/sun4i_timer.c | 110 +++---
 1 file changed, 78 insertions(+), 32 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 01/10] clocksource: sun4i: Use the BIT macros where possible

2013-07-05 Thread Maxime Ripard

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index d4674e7..bdf34d9 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -24,12 +24,12 @@
 #include 
 
 #define TIMER_IRQ_EN_REG   0x00
-#define TIMER_IRQ_EN(val)  (1 << val)
+#define TIMER_IRQ_EN(val)  BIT(val)
 #define TIMER_IRQ_ST_REG   0x04
 #define TIMER_CTL_REG(val) (0x10 * val + 0x10)
-#define TIMER_CTL_ENABLE   (1 << 0)
-#define TIMER_CTL_AUTORELOAD   (1 << 1)
-#define TIMER_CTL_ONESHOT  (1 << 7)
+#define TIMER_CTL_ENABLE   BIT(0)
+#define TIMER_CTL_AUTORELOAD   BIT(1)
+#define TIMER_CTL_ONESHOT  BIT(7)
 #define TIMER_INTVAL_REG(val)  (0x10 * val + 0x14)
 #define TIMER_CNTVAL_REG(val)  (0x10 * val + 0x18)
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 03/10] clocksource: sun4i: rename AUTORELOAD define to RELOAD

2013-07-05 Thread Maxime Ripard

The name AUTORELOAD was actually pretty bad since it doesn't make the
register reload the previous interval when it expires, but setting this
value pushes the new programmed interval to the internal timer counter.
Rename it to RELOAD instead.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index 34ab658..f5e227b 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -28,7 +28,7 @@
 #define TIMER_IRQ_ST_REG   0x04
 #define TIMER_CTL_REG(val) (0x10 * val + 0x10)
 #define TIMER_CTL_ENABLE   BIT(0)
-#define TIMER_CTL_AUTORELOAD   BIT(1)
+#define TIMER_CTL_RELOAD   BIT(1)
 #define TIMER_CTL_ONESHOT  BIT(7)
 #define TIMER_INTVAL_REG(val)  (0x10 * (val) + 0x14)
 #define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
@@ -129,7 +129,7 @@ static void __init sun4i_timer_init(struct device_node 
*node)
 
/* set mode to auto reload */
val = readl(timer_base + TIMER_CTL_REG(0));
-   writel(val | TIMER_CTL_AUTORELOAD, timer_base + TIMER_CTL_REG(0));
+   writel(val | TIMER_CTL_RELOAD, timer_base + TIMER_CTL_REG(0));
 
ret = setup_irq(irq, _timer_irq);
if (ret)
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 04/10] clocksource: sun4i: Add clocksource and sched clock drivers

2013-07-05 Thread Maxime Ripard

Use the second timer found on the Allwinner SoCs as a clock source and
sched clock, that were both not used yet on these platforms.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index f5e227b..b581c93 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -96,6 +97,11 @@ static struct irqaction sun4i_timer_irq = {
.dev_id = _clockevent,
 };
 
+static u32 sun4i_timer_sched_read(void)
+{
+   return ~readl(timer_base + TIMER_CNTVAL_REG(1));
+}
+
 static void __init sun4i_timer_init(struct device_node *node)
 {
unsigned long rate = 0;
@@ -117,6 +123,15 @@ static void __init sun4i_timer_init(struct device_node 
*node)
 
rate = clk_get_rate(clk);
 
+   writel(~0, timer_base + TIMER_INTVAL_REG(1));
+   writel(TIMER_CTL_ENABLE | TIMER_CTL_RELOAD |
+  TIMER_CTL_CLK_SRC(TIMER_CTL_CLK_SRC_OSC24M),
+  timer_base + TIMER_CTL_REG(1));
+
+   setup_sched_clock(sun4i_timer_sched_read, 32, rate);
+   clocksource_mmio_init(timer_base + TIMER_CNTVAL_REG(1), node->name,
+ rate, 300, 32, clocksource_mmio_readl_down);
+
writel(rate / (TIMER_SCAL * HZ),
   timer_base + TIMER_INTVAL_REG(0));
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 05/10] clocksource: sun4i: Don't forget to enable the clock we use

2013-07-05 Thread Maxime Ripard

Even if in our case, this clock was non-gatable, used as a parent clock
for several IPs, it still is a good idea to enable it.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index b581c93..8e9c651 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -120,6 +120,7 @@ static void __init sun4i_timer_init(struct device_node 
*node)
clk = of_clk_get(node, 0);
if (IS_ERR(clk))
panic("Can't get timer clock");
+   clk_prepare_enable(clk);
 
rate = clk_get_rate(clk);
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 08/10] clocksource: sun4i: Remove TIMER_SCAL variable

2013-07-05 Thread Maxime Ripard

The prescaler is only used when using the internal low frequency
oscillator (at 32kHz). Since we're using the higher frequency oscillator
at 24MHz, we can just remove it.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index dd78b63..3217adc 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -34,8 +34,6 @@
 #define TIMER_INTVAL_REG(val)  (0x10 * (val) + 0x14)
 #define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
 
-#define TIMER_SCAL 16
-
 static void __iomem *timer_base;
 
 /*
@@ -168,8 +166,7 @@ static void __init sun4i_timer_init(struct device_node 
*node)
clocksource_mmio_init(timer_base + TIMER_CNTVAL_REG(1), node->name,
  rate, 300, 32, clocksource_mmio_readl_down);
 
-   writel(rate / (TIMER_SCAL * HZ),
-  timer_base + TIMER_INTVAL_REG(0));
+   writel(rate / HZ, timer_base + TIMER_INTVAL_REG(0));
 
/* set clock source to HOSC, 16 pre-division */
val = readl(timer_base + TIMER_CTL_REG(0));
@@ -192,8 +189,8 @@ static void __init sun4i_timer_init(struct device_node 
*node)
 
sun4i_clockevent.cpumask = cpumask_of(0);
 
-   clockevents_config_and_register(_clockevent, rate / TIMER_SCAL,
-   0x1, 0xff);
+   clockevents_config_and_register(_clockevent, rate, 0x1,
+   0x);
 }
 CLOCKSOURCE_OF_DECLARE(sun4i, "allwinner,sun4i-timer",
   sun4i_timer_init);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 07/10] clocksource: sun4i: Factor out some timer code

2013-07-05 Thread Maxime Ripard

The set_next_event and set_mode callbacks share a lot of common code we
can easily factor to avoid duplication and mistakes.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 48 ++-
 1 file changed, 32 insertions(+), 16 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index 7123f65..dd78b63 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -52,24 +52,46 @@ static void sun4i_clkevt_sync(void)
cpu_relax();
 }
 
+static void sun4i_clkevt_time_stop(u8 timer)
+{
+   u32 val = readl(timer_base + TIMER_CTL_REG(timer));
+   writel(val & ~TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(timer));
+   sun4i_clkevt_sync();
+}
+
+static void sun4i_clkevt_time_setup(u8 timer, unsigned long delay)
+{
+   writel(delay, timer_base + TIMER_INTVAL_REG(timer));
+}
+
+static void sun4i_clkevt_time_start(u8 timer, bool periodic)
+{
+   u32 val = readl(timer_base + TIMER_CTL_REG(timer));
+
+   if (periodic)
+   val &= ~TIMER_CTL_ONESHOT;
+   else
+   val |= TIMER_CTL_ONESHOT;
+
+   writel(val | TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(timer));
+}
+
 static void sun4i_clkevt_mode(enum clock_event_mode mode,
  struct clock_event_device *clk)
 {
-   u32 u = readl(timer_base + TIMER_CTL_REG(0));
-
switch (mode) {
case CLOCK_EVT_MODE_PERIODIC:
-   u &= ~(TIMER_CTL_ONESHOT);
-   writel(u | TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(0));
+   sun4i_clkevt_time_stop(0);
+   sun4i_clkevt_time_start(0, true);
break;
-
case CLOCK_EVT_MODE_ONESHOT:
-   writel(u | TIMER_CTL_ONESHOT, timer_base + TIMER_CTL_REG(0));
+   sun4i_clkevt_time_stop(0);
+   sun4i_clkevt_time_start(0, false);
break;
case CLOCK_EVT_MODE_UNUSED:
case CLOCK_EVT_MODE_SHUTDOWN:
default:
-   writel(u & ~(TIMER_CTL_ENABLE), timer_base + TIMER_CTL_REG(0));
+   sun4i_clkevt_time_stop(0);
break;
}
 }
@@ -77,15 +99,9 @@ static void sun4i_clkevt_mode(enum clock_event_mode mode,
 static int sun4i_clkevt_next_event(unsigned long evt,
   struct clock_event_device *unused)
 {
-   u32 val = readl(timer_base + TIMER_CTL_REG(0));
-   writel(val & ~TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(0));
-   sun4i_clkevt_sync();
-
-   writel(evt, timer_base + TIMER_INTVAL_REG(0));
-
-   val = readl(timer_base + TIMER_CTL_REG(0));
-   writel(val | TIMER_CTL_ENABLE | TIMER_CTL_AUTORELOAD,
-  timer_base + TIMER_CTL_REG(0));
+   sun4i_clkevt_time_stop(0);
+   sun4i_clkevt_time_setup(0, evt);
+   sun4i_clkevt_time_start(0, false);
 
return 0;
 }
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 10/10] clocksource: sun4i: Fix bug when switching from periodic to oneshot modes

2013-07-05 Thread Maxime Ripard

The interval was firing at was set up at probe time, and only changed in
the set_next_event, and never changed back, which is not really what is
expected.

When enabling the periodic mode, now set an interval to tick every
jiffy.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index 2fadb3b..d00d50a 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -38,6 +38,7 @@
 #define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
 
 static void __iomem *timer_base;
+static u32 ticks_per_jiffy;
 
 /*
  * When we disable a timer, we need to wait at least for 2 cycles of
@@ -74,7 +75,8 @@ static void sun4i_clkevt_time_start(u8 timer, bool periodic)
else
val |= TIMER_CTL_ONESHOT;
 
-   writel(val | TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(timer));
+   writel(val | TIMER_CTL_ENABLE | TIMER_CTL_RELOAD,
+  timer_base + TIMER_CTL_REG(timer));
 }
 
 static void sun4i_clkevt_mode(enum clock_event_mode mode,
@@ -83,6 +85,7 @@ static void sun4i_clkevt_mode(enum clock_event_mode mode,
switch (mode) {
case CLOCK_EVT_MODE_PERIODIC:
sun4i_clkevt_time_stop(0);
+   sun4i_clkevt_time_setup(0, ticks_per_jiffy);
sun4i_clkevt_time_start(0, true);
break;
case CLOCK_EVT_MODE_ONESHOT:
@@ -169,9 +172,9 @@ static void __init sun4i_timer_init(struct device_node 
*node)
clocksource_mmio_init(timer_base + TIMER_CNTVAL_REG(1), node->name,
  rate, 300, 32, clocksource_mmio_readl_down);
 
-   writel(rate / HZ, timer_base + TIMER_INTVAL_REG(0));
+   ticks_per_jiffy = DIV_ROUND_UP(clk_get_rate(clk), HZ);
 
-   writel(TIMER_CTL_CLK_SRC(TIMER_CTL_CLK_SRC_OSC24M) | TIMER_CTL_RELOAD,
+   writel(TIMER_CTL_CLK_SRC(TIMER_CTL_CLK_SRC_OSC24M),
   timer_base + TIMER_CTL_REG(0));
 
ret = setup_irq(irq, _timer_irq);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 09/10] clocksource: sun4i: Cleanup parent clock setup

2013-07-05 Thread Maxime Ripard

The current bring-up code for the timer was overly complicated. The only
thing we need is actually which clock we want to use as source and
that's pretty much all. Let's keep it that way.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index 3217adc..2fadb3b 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -30,6 +30,9 @@
 #define TIMER_CTL_REG(val) (0x10 * val + 0x10)
 #define TIMER_CTL_ENABLE   BIT(0)
 #define TIMER_CTL_RELOAD   BIT(1)
+#define TIMER_CTL_CLK_SRC(val) (((val) & 0x3) << 2)
+#define TIMER_CTL_CLK_SRC_OSC24M   (1)
+#define TIMER_CTL_CLK_PRES(val)(((val) & 0x7) << 4)
 #define TIMER_CTL_ONESHOT  BIT(7)
 #define TIMER_INTVAL_REG(val)  (0x10 * (val) + 0x14)
 #define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
@@ -168,16 +171,8 @@ static void __init sun4i_timer_init(struct device_node 
*node)
 
writel(rate / HZ, timer_base + TIMER_INTVAL_REG(0));
 
-   /* set clock source to HOSC, 16 pre-division */
-   val = readl(timer_base + TIMER_CTL_REG(0));
-   val &= ~(0x07 << 4);
-   val &= ~(0x03 << 2);
-   val |= (4 << 4) | (1 << 2);
-   writel(val, timer_base + TIMER_CTL_REG(0));
-
-   /* set mode to auto reload */
-   val = readl(timer_base + TIMER_CTL_REG(0));
-   writel(val | TIMER_CTL_RELOAD, timer_base + TIMER_CTL_REG(0));
+   writel(TIMER_CTL_CLK_SRC(TIMER_CTL_CLK_SRC_OSC24M) | TIMER_CTL_RELOAD,
+  timer_base + TIMER_CTL_REG(0));
 
ret = setup_irq(irq, _timer_irq);
if (ret)
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

2013-07-05 Thread Thomas Gleixner

On Fri, 5 Jul 2013, Borislav Petkov wrote:
> On Fri, Jul 05, 2013 at 11:50:05PM +0200, Thomas Gleixner wrote:
> > Yeah, but our well justified paranoia still prevents us from trusting
> > these CPU flags. Maybe some day BIOS is going to be replaced by
> > something useful. You know: Hope springs eternal
> 
> Not in the next 10 yrs at least if one took a look at the
> overengineered, obese at birth and braindead crap by the name of UEFI.

Good news! 10 years is way less than eternity and just before
retirement :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 06/10] clocksource: sun4i: Fix the next event code

2013-07-05 Thread Maxime Ripard

The next_event logic was setting the next interval to fire in the
current timer value instead of the interval value register, which is
obviously wrong.

Plus, the logic to set the actual value was wrong as well: the interval
register can only be modified when the timer is disabled, and then
enable it back, otherwise, it'll have no effect. Fix this logic as well
since that code couldn't possibly work.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index 8e9c651..7123f65 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -38,6 +38,20 @@
 
 static void __iomem *timer_base;
 
+/*
+ * When we disable a timer, we need to wait at least for 2 cycles of
+ * the timer source clock. We will use for that the clocksource timer
+ * that is already setup and runs at the same frequency than the other
+ * timers, and we never will be disabled.
+ */
+static void sun4i_clkevt_sync(void)
+{
+   u32 old = readl(timer_base + TIMER_CNTVAL_REG(1));
+
+   while ((old - readl(timer_base + TIMER_CNTVAL_REG(1))) < 3)
+   cpu_relax();
+}
+
 static void sun4i_clkevt_mode(enum clock_event_mode mode,
  struct clock_event_device *clk)
 {
@@ -63,9 +77,14 @@ static void sun4i_clkevt_mode(enum clock_event_mode mode,
 static int sun4i_clkevt_next_event(unsigned long evt,
   struct clock_event_device *unused)
 {
-   u32 u = readl(timer_base + TIMER_CTL_REG(0));
-   writel(evt, timer_base + TIMER_CNTVAL_REG(0));
-   writel(u | TIMER_CTL_ENABLE | TIMER_CTL_AUTORELOAD,
+   u32 val = readl(timer_base + TIMER_CTL_REG(0));
+   writel(val & ~TIMER_CTL_ENABLE, timer_base + TIMER_CTL_REG(0));
+   sun4i_clkevt_sync();
+
+   writel(evt, timer_base + TIMER_INTVAL_REG(0));
+
+   val = readl(timer_base + TIMER_CTL_REG(0));
+   writel(val | TIMER_CTL_ENABLE | TIMER_CTL_AUTORELOAD,
   timer_base + TIMER_CTL_REG(0));
 
return 0;
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv4 02/10] clocksource: sun4i: Wrap macros arguments in parenthesis

2013-07-05 Thread Maxime Ripard

The macros were not using parenthesis to escape the arguments passed to
them. It is pretty unsafe, so add those parenthesis.

Signed-off-by: Maxime Ripard 
---
 drivers/clocksource/sun4i_timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/sun4i_timer.c 
b/drivers/clocksource/sun4i_timer.c
index bdf34d9..34ab658 100644
--- a/drivers/clocksource/sun4i_timer.c
+++ b/drivers/clocksource/sun4i_timer.c
@@ -30,8 +30,8 @@
 #define TIMER_CTL_ENABLE   BIT(0)
 #define TIMER_CTL_AUTORELOAD   BIT(1)
 #define TIMER_CTL_ONESHOT  BIT(7)
-#define TIMER_INTVAL_REG(val)  (0x10 * val + 0x14)
-#define TIMER_CNTVAL_REG(val)  (0x10 * val + 0x18)
+#define TIMER_INTVAL_REG(val)  (0x10 * (val) + 0x14)
+#define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
 
 #define TIMER_SCAL 16
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv3 08/10] clocksource: sun4i: Remove TIMER_SCAL variable

2013-07-05 Thread Maxime Ripard

On Fri, Jul 05, 2013 at 10:48:45PM +0200, Thomas Gleixner wrote:
> 
> 
> On Fri, 5 Jul 2013, Maxime Ripard wrote:
> 
> > The prescaler is only used when using the internal low frequency
> > oscillator (at 32kHz). Since we're using the higher frequency oscillator
> > at 24MHz, we can just remove it.
> > 
> > Signed-off-by: Maxime Ripard 
> > ---
> >  drivers/clocksource/sun4i_timer.c | 11 +++
> >  1 file changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/clocksource/sun4i_timer.c 
> > b/drivers/clocksource/sun4i_timer.c
> > index 00e17d9..2f84075 100644
> > --- a/drivers/clocksource/sun4i_timer.c
> > +++ b/drivers/clocksource/sun4i_timer.c
> > @@ -34,8 +34,6 @@
> >  #define TIMER_INTVAL_REG(val)  (0x10 * (val) + 0x14)
> >  #define TIMER_CNTVAL_REG(val)  (0x10 * (val) + 0x18)
> >  
> > -#define TIMER_SCAL 16
> > -
> 
> I can understand this one.
> 
> >  static void __iomem *timer_base;
> >  
> >  /*
> > @@ -139,7 +137,6 @@ static u32 sun4i_timer_sched_read(void)
> >  
> >  static void __init sun4i_timer_init(struct device_node *node)
> >  {
> > -   unsigned long rate = 0;
> > struct clk *clk;
> > int ret, irq;
> > u32 val;
> > @@ -157,8 +154,6 @@ static void __init sun4i_timer_init(struct device_node 
> > *node)
> > panic("Can't get timer clock");
> > clk_prepare_enable(clk);
> >  
> > -   rate = clk_get_rate(clk);
> > -
> 
> But this one is bogus. Why do you want to read the clock rate five
> times in a row instead of using the single cached value?
> 
> That does not make any sense.

Right, I'll send a new version.

Thanks!
Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com


signature.asc
Description: Digital signature

Re: [PATCH 3/3] i915: Don't provide ACPI backlight interface if firmware expects Windows 8

2013-07-05 Thread Rafael J. Wysocki

On Friday, July 05, 2013 11:40:02 PM Rafael J. Wysocki wrote:
> On Friday, July 05, 2013 10:00:55 PM Rafael J. Wysocki wrote:
> > On Friday, July 05, 2013 02:20:14 PM Rafael J. Wysocki wrote:
> > > On Sunday, June 09, 2013 07:01:39 PM Matthew Garrett wrote:
> > > > Windows 8 leaves backlight control up to individual graphics drivers 
> > > > rather
> > > > than making ACPI calls itself. There's plenty of evidence to suggest 
> > > > that
> > > > the Intel driver for Windows doesn't use the ACPI interface, including 
> > > > the
> > > > fact that it's broken on a bunch of machines when the OS claims to 
> > > > support
> > > > Windows 8. The simplest thing to do appears to be to disable the ACPI
> > > > backlight interface on these systems.
> > > > 
> > > > Signed-off-by: Matthew Garrett 
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_dma.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c 
> > > > b/drivers/gpu/drm/i915/i915_dma.c
> > > > index 3b315ba..23b6292 100644
> > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > @@ -1661,6 +1661,9 @@ int i915_driver_load(struct drm_device *dev, 
> > > > unsigned long flags)
> > > > /* Must be done after probing outputs */
> > > > intel_opregion_init(dev);
> > > > acpi_video_register();
> > > > +   /* Don't use ACPI backlight functions on Windows 8 
> > > > platforms */
> > > > +   if (acpi_osi_version() >= ACPI_OSI_WIN_8)
> > > > +   acpi_video_backlight_unregister();
> > > > }
> > > >  
> > > > if (IS_GEN5(dev))
> > > > 
> > > 
> > > Well, this causes build failures to happen when the ACPI video driver is
> > > modular and the graphics driver is not.
> > > 
> > > I'm not sure how to resolve that, so suggestions are welcome.
> > 
> > Actually, that happened with the radeon patch.
> > 
> > That said, ACPI_OSI_WIN_8 doesn't make much sense for !CONFIG_ACPI, for
> > example.
> > 
> > What about making acpi_video_register() do the quirk instead?  We could add 
> > an
> > argument to it indicating whether or not quirks should be applied.
> 
> Actually, I wonder what about the appended patch (on top of the Aaron's
> https://patchwork.kernel.org/patch/2812951/) instead of [1-3/3] from this 
> series.

Or even something as simple as this one.

---
 drivers/acpi/video_detect.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-pm/drivers/acpi/video_detect.c
===
--- linux-pm.orig/drivers/acpi/video_detect.c
+++ linux-pm/drivers/acpi/video_detect.c
@@ -203,6 +203,9 @@ long acpi_video_get_capabilities(acpi_ha
 */
 
dmi_check_system(video_detect_dmi_table);
+
+   if (acpi_gbl_osi_data >= ACPI_OSI_WIN_8)
+   acpi_video_support |= ACPI_VIDEO_BACKLIGHT_FORCE_VENDOR;
} else {
status = acpi_bus_get_device(graphics_handle, _dev);
if (ACPI_FAILURE(status)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] btrfs: use list_del_each_entry

2013-07-05 Thread Jörn Engel

Signed-off-by: Joern Engel 
---
 fs/btrfs/backref.c  |   15 +++
 fs/btrfs/compression.c  |4 +---
 fs/btrfs/disk-io.c  |6 +-
 fs/btrfs/extent-tree.c  |   17 +++--
 fs/btrfs/extent_io.c|8 ++--
 fs/btrfs/inode.c|   16 +++-
 fs/btrfs/ordered-data.c |7 +--
 fs/btrfs/qgroup.c   |   22 --
 fs/btrfs/relocation.c   |6 +-
 fs/btrfs/scrub.c|9 +++--
 fs/btrfs/transaction.c  |5 +
 fs/btrfs/volumes.c  |   11 ++-
 12 files changed, 25 insertions(+), 101 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index bd605c8..3a45e75 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -893,9 +893,7 @@ again:
if (ret)
goto out;
 
-   while (!list_empty()) {
-   ref = list_first_entry(, struct __prelim_ref, list);
-   list_del(>list);
+   list_del_each_entry(ref, , list) {
WARN_ON(ref->count < 0);
if (ref->count && ref->root_id && ref->parent == 0) {
/* no parent == root of tree */
@@ -937,17 +935,10 @@ again:
 
 out:
btrfs_free_path(path);
-   while (!list_empty()) {
-   ref = list_first_entry(, struct __prelim_ref, list);
-   list_del(>list);
+   list_del_each_entry(ref, , list)
kfree(ref);
-   }
-   while (!list_empty(_delayed)) {
-   ref = list_first_entry(_delayed, struct __prelim_ref,
-  list);
-   list_del(>list);
+   list_del_each_entry(ref, _delayed, list)
kfree(ref);
-   }
 
return ret;
 }
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15b9408..e5a7475 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -841,9 +841,7 @@ static void free_workspaces(void)
int i;
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
-   while (!list_empty(_idle_workspace[i])) {
-   workspace = comp_idle_workspace[i].next;
-   list_del(workspace);
+   list_del_each(workspace, _idle_workspace[i]) {
btrfs_compress_op[i]->free_workspace(workspace);
atomic_dec(_alloc_workspace[i]);
}
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6d19a0a..2767b18 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3289,11 +3289,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
struct btrfs_root *gang[8];
int i;
 
-   while (!list_empty(_info->dead_roots)) {
-   gang[0] = list_entry(fs_info->dead_roots.next,
-struct btrfs_root, root_list);
-   list_del([0]->root_list);
-
+   list_del_each_entry(gang[0], _info->dead_roots, root_list) {
if (gang[0]->in_radix) {
btrfs_free_fs_root(fs_info, gang[0]);
} else {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d55123..f7afb9e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2435,10 +2435,7 @@ int btrfs_delayed_refs_qgroup_accounting(struct 
btrfs_trans_handle *trans,
if (!trans->delayed_ref_elem.seq)
return 0;
 
-   while (!list_empty(>qgroup_ref_list)) {
-   qgroup_update = list_first_entry(>qgroup_ref_list,
-struct qgroup_update, list);
-   list_del(_update->list);
+   list_del_each_entry(qgroup_update, >qgroup_ref_list, list) {
if (!ret)
ret = btrfs_qgroup_account_ref(
trans, fs_info, qgroup_update->node,
@@ -7821,12 +7818,8 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
struct rb_node *n;
 
down_write(>extent_commit_sem);
-   while (!list_empty(>caching_block_groups)) {
-   caching_ctl = list_entry(info->caching_block_groups.next,
-struct btrfs_caching_control, list);
-   list_del(_ctl->list);
+   list_del_each_entry(caching_ctl, >caching_block_groups, list)
put_caching_control(caching_ctl);
-   }
up_write(>extent_commit_sem);
 
spin_lock(>block_group_cache_lock);
@@ -7868,10 +7861,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 
release_global_block_rsv(info);
 
-   while(!list_empty(>space_info)) {
-   space_info = list_entry(info->space_info.next,
-   struct btrfs_space_info,
-   list);
+   list_del_each_entry(space_info, >space_info, list) {
if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
if (space_info->bytes_pinned > 0 ||

Re: [PATCH 0/2] introduce list_for_each_entry_del

2013-07-05 Thread Jörn Engel

On Mon, 3 June 2013 13:28:03 -0400, Joern Engel wrote:
> 
> A purely janitorial patchset.  A fairly common pattern is to take a
> list, remove every object from it and do something with this object -
> usually kfree() some variant.  A stupid grep identified roughly 300
> instances, with many more hidden behind more complicated patterns to
> achieve the same end results.

Next version of the same patchset.  Object size is shrinking now, at
least for the one compiler I tested.  And a few kernel hackers met on
a frozen lake in hell with pigs flying overhead and could actually
agree on a name.  While I am sure almost every reader will still
disagree and have one or two better suggestions, I would like to use
this historical moment.

list_del_each and list_del_each_entry is shall be!

Jörn

--
It's just what we asked for, but not what we want!
-- anonymous
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] list: add list_del_each_entry

2013-07-05 Thread Jörn Engel

I have seen a lot of boilerplate code that either follows the pattern of
while (!list_empty(head)) {
pos = list_entry(head->next, struct foo, list);
list_del(pos->list);
...
}
or some variant thereof.

With this patch in, people can use
list_del_each_entry(pos, head, list) {
...
}

The patch also adds a list_del_each variant, even though I have
only found a single user for that one so far.

Signed-off-by: Joern Engel 
---
 include/linux/list.h |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/list.h b/include/linux/list.h
index 6a1f8df..ab39c7d 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -557,6 +557,24 @@ static inline void list_splice_tail_init(struct list_head 
*list,
 #define list_safe_reset_next(pos, n, member)   \
n = list_entry(pos->member.next, typeof(*pos), member)
 
+/**
+ * list_del_each - removes an entry from the list until it is empty
+ * @pos:   the  list_head to use as a loop cursor.
+ * @head:  the head of your list.
+ */
+#define list_del_each(pos, head) \
+   while (list_empty(head) ? 0 : (pos = (head)->next, list_del(pos), 1))
+
+/**
+ * list_del_each_entry - removes an entry from the list until it is empty
+ * @pos:   the type * to use as loop cursor.
+ * @head:  the head of your list.
+ * @member:the name of the list_struct within the struct
+ */
+#define list_del_each_entry(pos, head, member) \
+   while (list_empty(head) && (pos = list_first_entry((head), \
+   typeof(*pos), member), list_del((head)->next), 1))
+
 /*
  * Double linked lists with a single pointer list head.
  * Mostly useful for hash tables where the two pointer list head is
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ASoC: Allow imx-pcm-{dma,fiq}.c to be modules

2013-07-05 Thread Arnd Bergmann

On Friday 05 July 2013, Mark Brown wrote:
> On Fri, Jul 05, 2013 at 10:55:10PM +0200, Arnd Bergmann wrote:
> > On Friday 05 July 2013, Mark Brown wrote:
> 
> > > Is this actually OK with the FIQ APIs?
> 
> > I don't know. Why wouldn't it?
> 
> It was the only reason I could think of why that'd have been done.

I looked in the log and found this part has been patched a couple
of times already, going back and forth between "bool" and "tristate",
always to fix build errors.

Please hold back for now, I'll try to reproduce on the bug on
the current torvalds tree first. I know it was broken in linux-next
as of a few weeks ago, but something else may have changed in the
meantime.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

2013-07-05 Thread Borislav Petkov

On Fri, Jul 05, 2013 at 11:50:05PM +0200, Thomas Gleixner wrote:
> Yeah, but our well justified paranoia still prevents us from trusting
> these CPU flags. Maybe some day BIOS is going to be replaced by
> something useful. You know: Hope springs eternal

Not in the next 10 yrs at least if one took a look at the
overengineered, obese at birth and braindead crap by the name of UEFI.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

2013-07-05 Thread Thomas Gleixner

On Fri, 5 Jul 2013, Peter Zijlstra wrote:
> On Fri, Jul 05, 2013 at 05:24:09PM +0200, Thomas Gleixner wrote:
> > See arch/x86/kernel/tsc.c
> > 
> > We disable the watchdog for the TSC when tsc_clocksource_reliable is
> > set.
> > 
> > tsc_clocksource_reliable is set when:
> > 
> >  - you add tsc=reliable to the kernel command line
> 
> Ah, I didn't know about that one, useful.
> 
> >  - boot_cpu_has(X86_FEATURE_TSC_RELIABLE)
> >  
> >X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
> >moorsetown. So all other machines keep the watchdog enabled.
> 
> Right.. I knew it was enabled on my machines even though they normally
> have usable TSC.

Yeah, but our well justified paranoia still prevents us from trusting
these CPU flags. Maybe some day BIOS is going to be replaced by
something useful. You know: Hope springs eternal




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why no USB id list in the kernel sources?

2013-07-05 Thread Greg KH

On Fri, Jul 05, 2013 at 11:34:05PM +0200, Michael Opdenacker wrote:
> Hi,
> 
> I'm wondering why there is no include/linux/usb_ids.h (or
> include/linux/usb/ids.h) file in the same way there is a
> include/linux/pci_ids.h for PCI.

Because that way lies madness, we have learned from our mistakes and do
not want to repeat them again :)

It turns out that the pci_ids file isn't a good idea, it's a merge mess,
and only really works when you have ids that are shared across different
drivers.  In the end, that is a very small number, and it's just not
worth the time and effort to do this in a centralized way.

Hope this helps explain things, if you want more details, dig into the
linux usb mailing list about 10-15 years ago when this decision was
made.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3] ARM: add missing linker section markup to head-common.S

2013-07-05 Thread Russell King - ARM Linux

On Fri, Jul 05, 2013 at 12:10:55PM -0600, Stephen Warren wrote:
> From: Stephen Warren 
> 
> Macro __INIT is used to place various code in head-common.S into the init
> section. This should be matched by a closing __FINIT. Also, add an
> explicit ".text" to ensure subsequent code is placed into the correct
> section; __FINIT is simply a closing marker to match __INIT and doesn't
> guarantee to revert to .text.
> 
> This historically caused no problem, because macro __CPUINIT was used at
> the exact location where __FINIT was missing, which then placed following
> code into the cpuinit section. However, with commit 22f0a2736 "init.h:
> remove __cpuinit sections from the kernel" applied, __CPUINIT becomes a
> no-op, thus leaving all this code in the init section, rather than the
> regular text section. This caused issues such as secondary CPU boot
> failures or crashes.
> 
> Signed-off-by: Stephen Warren 
> Acked-by: Paul Gortmaker 
> ---
> v3: Added .text after __FINIT to force the correct section.
> v2: Moved __FINIT after lookup_processor_type, to correctly match the
> location of __CPUINIT.

Much better, thanks.  Please put it in the patch system and I'll send it
along, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Why no USB id list in the kernel sources?

2013-07-05 Thread Michael Opdenacker

Hi,

I'm wondering why there is no include/linux/usb_ids.h (or
include/linux/usb/ids.h) file in the same way there is a
include/linux/pci_ids.h for PCI.

I don't expect all product ids to be listed (the
http://www.linux-usb.org/usb.ids list is pretty big), but if we could
have at least vendor ids, it would make device tables cleaner and easier
to read, as we have in most PCI drivers. Here would be an example:

diff --git a/drivers/media/usb/gspca/pac207.c
b/drivers/media/usb/gspca/pac207.c
index 83519be..ce8c975 100644
--- a/drivers/media/usb/gspca/pac207.c
+++ b/drivers/media/usb/gspca/pac207.c
@@ -449,19 +449,19 @@ static const struct sd_desc sd_desc = {
 
 /* -- module initialisation -- */
 static const struct usb_device_id device_table[] = {
-   {USB_DEVICE(0x041e, 0x4028)},
-   {USB_DEVICE(0x093a, 0x2460)},
-   {USB_DEVICE(0x093a, 0x2461)},
-   {USB_DEVICE(0x093a, 0x2463)},
-   {USB_DEVICE(0x093a, 0x2464)},
-   {USB_DEVICE(0x093a, 0x2468)},
-   {USB_DEVICE(0x093a, 0x2470)},
-   {USB_DEVICE(0x093a, 0x2471)},
-   {USB_DEVICE(0x093a, 0x2472)},
-   {USB_DEVICE(0x093a, 0x2474)},
-   {USB_DEVICE(0x093a, 0x2476)},
-   {USB_DEVICE(0x145f, 0x013a)},
-   {USB_DEVICE(0x2001, 0xf115)},
+   {USB_DEVICE(USB_VENDOR_ID_CREATIVE, 0x4028)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2460)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2461)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2463)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2464)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2468)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2470)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2471)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2472)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2474)},
+   {USB_DEVICE(USB_VENDOR_ID_PIXART, 0x2476)},
+   {USB_DEVICE(USB_VENDOR_ID_TRUST, 0x013a)},
+   {USB_DEVICE(USB_VENDOR_ID_DLINK, 0xf115)},
{}
 };
 MODULE_DEVICE_TABLE(usb, device_table);

Does this happen because nobody had the time to create and maintain such
a usb_ids.h file, or would there be any other reasons?

I'm asking because I don't like raw values in situations when a symbolic
name would be more self-explanatory ;)

Thanks,

Cheers,

Michael.

-- 
Michael Opdenacker, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
+33 484 258 098

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] i915: Don't provide ACPI backlight interface if firmware expects Windows 8

2013-07-05 Thread Rafael J. Wysocki

On Friday, July 05, 2013 10:00:55 PM Rafael J. Wysocki wrote:
> On Friday, July 05, 2013 02:20:14 PM Rafael J. Wysocki wrote:
> > On Sunday, June 09, 2013 07:01:39 PM Matthew Garrett wrote:
> > > Windows 8 leaves backlight control up to individual graphics drivers 
> > > rather
> > > than making ACPI calls itself. There's plenty of evidence to suggest that
> > > the Intel driver for Windows doesn't use the ACPI interface, including the
> > > fact that it's broken on a bunch of machines when the OS claims to support
> > > Windows 8. The simplest thing to do appears to be to disable the ACPI
> > > backlight interface on these systems.
> > > 
> > > Signed-off-by: Matthew Garrett 
> > > ---
> > >  drivers/gpu/drm/i915/i915_dma.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_dma.c 
> > > b/drivers/gpu/drm/i915/i915_dma.c
> > > index 3b315ba..23b6292 100644
> > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > @@ -1661,6 +1661,9 @@ int i915_driver_load(struct drm_device *dev, 
> > > unsigned long flags)
> > >   /* Must be done after probing outputs */
> > >   intel_opregion_init(dev);
> > >   acpi_video_register();
> > > + /* Don't use ACPI backlight functions on Windows 8 platforms */
> > > + if (acpi_osi_version() >= ACPI_OSI_WIN_8)
> > > + acpi_video_backlight_unregister();
> > >   }
> > >  
> > >   if (IS_GEN5(dev))
> > > 
> > 
> > Well, this causes build failures to happen when the ACPI video driver is
> > modular and the graphics driver is not.
> > 
> > I'm not sure how to resolve that, so suggestions are welcome.
> 
> Actually, that happened with the radeon patch.
> 
> That said, ACPI_OSI_WIN_8 doesn't make much sense for !CONFIG_ACPI, for
> example.
> 
> What about making acpi_video_register() do the quirk instead?  We could add an
> argument to it indicating whether or not quirks should be applied.

Actually, I wonder what about the appended patch (on top of the Aaron's
https://patchwork.kernel.org/patch/2812951/) instead of [1-3/3] from this 
series.

Thanks,
Rafael


---
 drivers/acpi/video_detect.c |   11 +--
 include/linux/acpi.h|1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-pm/drivers/acpi/video_detect.c
===
--- linux-pm.orig/drivers/acpi/video_detect.c
+++ linux-pm/drivers/acpi/video_detect.c
@@ -203,6 +203,9 @@ long acpi_video_get_capabilities(acpi_ha
 */
 
dmi_check_system(video_detect_dmi_table);
+
+   if (acpi_gbl_osi_data >= ACPI_OSI_WIN_8)
+   acpi_video_support |= ACPI_VIDEO_FORCE_NO_BACKLIGHT;
} else {
status = acpi_bus_get_device(graphics_handle, _dev);
if (ACPI_FAILURE(status)) {
@@ -258,13 +261,17 @@ int acpi_video_backlight_support(void)
 {
acpi_video_caps_check();
 
-   /* First check for boot param -> highest prio */
+   /* First, check if no backlight support has been forced upon us. */
+   if (acpi_video_support & ACPI_VIDEO_FORCE_NO_BACKLIGHT)
+   return 0;
+
+   /* Next check for boot param -> second highest prio */
if (acpi_video_support & ACPI_VIDEO_BACKLIGHT_FORCE_VENDOR)
return 0;
else if (acpi_video_support & ACPI_VIDEO_BACKLIGHT_FORCE_VIDEO)
return 1;
 
-   /* Then check for DMI blacklist -> second highest prio */
+   /* Then check for DMI blacklist -> third highest prio */
if (acpi_video_support & ACPI_VIDEO_BACKLIGHT_DMI_VENDOR)
return 0;
else if (acpi_video_support & ACPI_VIDEO_BACKLIGHT_DMI_VIDEO)
Index: linux-pm/include/linux/acpi.h
===
--- linux-pm.orig/include/linux/acpi.h
+++ linux-pm/include/linux/acpi.h
@@ -191,6 +191,7 @@ extern bool wmi_has_guid(const char *gui
 #define ACPI_VIDEO_BACKLIGHT_DMI_VIDEO 0x0200
 #define ACPI_VIDEO_OUTPUT_SWITCHING_DMI_VENDOR 0x0400
 #define ACPI_VIDEO_OUTPUT_SWITCHING_DMI_VIDEO  0x0800
+#define ACPI_VIDEO_FORCE_NO_BACKLIGHT  0x1000
 
 #if defined(CONFIG_ACPI_VIDEO) || defined(CONFIG_ACPI_VIDEO_MODULE)
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

2013-07-05 Thread Peter Zijlstra

On Fri, Jul 05, 2013 at 05:24:09PM +0200, Thomas Gleixner wrote:
> See arch/x86/kernel/tsc.c
> 
> We disable the watchdog for the TSC when tsc_clocksource_reliable is
> set.
> 
> tsc_clocksource_reliable is set when:
> 
>  - you add tsc=reliable to the kernel command line

Ah, I didn't know about that one, useful.

>  - boot_cpu_has(X86_FEATURE_TSC_RELIABLE)
>  
>X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
>moorsetown. So all other machines keep the watchdog enabled.

Right.. I knew it was enabled on my machines even though they normally
have usable TSC.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ASoC: Allow imx-pcm-{dma,fiq}.c to be modules

2013-07-05 Thread Mark Brown

On Fri, Jul 05, 2013 at 10:55:10PM +0200, Arnd Bergmann wrote:
> On Friday 05 July 2013, Mark Brown wrote:

> > Is this actually OK with the FIQ APIs?

> I don't know. Why wouldn't it?

It was the only reason I could think of why that'd have been done.

> Other users of the same interfaces (mx1_camera, spi-s3c24xx) can also be
> modules, so I wouldn't expect a fundamental issue.

OK.


signature.asc
Description: Digital signature

Re: [PATCH] clocksource/cadence_ttc: Reuse clocksource as sched_clock

2013-07-05 Thread Sören Brinkmann

On Fri, Jul 05, 2013 at 10:59:53PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, Sören Brinkmann wrote:
> > On Fri, Jul 05, 2013 at 10:42:03PM +0200, Thomas Gleixner wrote:
> > > We have a mechanism for that in place, if stuff goes cross trees. One
> > > of the trees provides a set of commit for the other tree to pull, so
> > > we do not end up with merge dependencies and conflicts.
> > > 
> > > Why is clocksource stuff going through armsoc unless it contains
> > > related modifications in arch/arm/
> > We overhauled Zynq's common clock code and migrated all Zynq drivers to
> > it - the TTC being one. I'm pretty sure you were on the CC list for that
> > change.
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=30e1e28598c2674c133148d8aec6d431d7acd314
> > 
> 
> That's fine as it has no dependencies on any of the clocksource core
> changes AFAICT.
Right, shouldn't be a big deal. Just for this patch, I currently have the
choice of working with the old TTC but the new clocksource core or the
other way around. But since everything is going to meet in Linus' tree
soon, I just wait a bit until tip/timers is merged.

Sören


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTD EEPROM support and driver integration

2013-07-05 Thread Arnd Bergmann

On Friday 05 July 2013, Maxime Ripard wrote:
> Hi everyone,
> 
> In the last weeks, we've drivers coming up both about mostly some very
> simple drivers that expose to the userspace a few bytes of memory-mapped
> IO. Both will probably live under drivers/misc/eeprom, where there's
> support already for other kind of what could be assimilated as eeproms
> (AT24, AT25, etc.).
> 
> Now, besides the code duplication every driver has to make to register
> the sysfs attributes, it wouldn't cause much of a problem.
> 
> Except that these EEPROMs could store values useful for other drivers in
> Linux. For example:
>   - imx28 OCOTP (the latest of the two EEPROM drivers recently submitted)
> use it most of the time to store its MAC addresses
>   - Allwinner SoCs SID (the other latest driver submitted) use it
> sometime to store a MAC address, and also the SoC ID, which could be
> useful to implement a SoC bus driver.
>   - Some Allwinner boards (and presumably others) use an AT24 to store
> the MAC address in it.
> 
> For now, everyone comes up with a different solution:
>   - imx28 has a hook in mach-mxs to patch the device tree at runtime and
> add the values retrieved from the OCOTP to it.
>   - AT24 seem to have some function exported (at24_macc_(read|write)) so
> that other part of the kernel can use them to retrieve values from
> such an EEPROM.
>   - Allwinner SoCs have, well, basically nothing for now, which is why I
> send this email.
> 
> The current way of working has several flaws imho:
>   - The code is heavily duplicated: the sysfs registering is common to
> every driver, and at the end of the day, every driver should only
> give a read/write callback, and that's it.
>   - The "consumer" drivers should not have to worry at all about the
> EEPROM type it should retrieve the needed value from, let alone
> dealing with the number of instances of such an EEPROM.
> 
> To solve this issues, I think I have some solution. Would merging this
> drivers into MTD make some sense? It seems like there is already some
> EEPROM drivers into drivers/mtd (such as an AT25 one, which also have a
> drivers under drivers/misc/eeprom), so I guess it does, but I'd like to
> have your opinion on this.

I always felt that we should eventually move the eeprom drivers out of
drivers/misc into their own subsystem. Moving them under drivers/mtd
also seems reasonable.

Having a common API sounds like a good idea, and we should probably
spend some time comparing the options.
 
> If so, would some kind of MTD in-kernel API to retrieve values from MTD
> devices would be acceptable to you? I mostly have DT in mind, so I'm
> thinking of having DT bindings to that API such as
> 
>   mac-storage = < 0xoffset 0xsize>
> 
> to describe which device to get a value from, and where in that device.
> 
> That would allow consumer drivers to only have to call a function like
> of_mtd_get_value and let the MTD subsystem do the hard work.
> 
> What's your feeling on this?

My first thought is that it should be more generic than that and not
have the mac address hardcoded as the purpose. We could possibly use
regmap as the in-kernel interface, and come up with a more generic
way of referring to registers in another device node.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND v10 2/2] watchdog: Sysfs interface for MEN A21 watchdog

2013-07-05 Thread Wim Van Sebroeck

Hi Johannes,

> This patch adds a sysfs interface for the watchdog
> device found on MEN A21 Boards.
> 
> The newly generated files are:
> * rebootcause:
> Can be one of:
> Power on Reset,
> CPU Reset Request,
> Push Button,
> FPGA Reset Request,
> Watchdog,
> Local Power Bad,
> Invalid or
> BDI
> and shows the reason of the boards last reboot.
> 
> * active:
> Shows if the watchdog CPLD is actually running
> 
> * allow_disable:
> Shows if the watchdog is allowed to be disabled (NOWAYOUT disabled)

allow_disable should be nowayout because that is the general watchdog
parameter we will have for all watchdog drivers.

> * fastmode:
> Shows if the CPLD is running in fast mode (1s timeout), once it is in
> fastmode it can't be switched back to slow mode (30s timeout) until the
> next reboot.
> 
> Signed-off-by: Johannes Thumshirn 

I will not add this patch and put it in my waiting queue.
Reason: we should first do the sysfs stuff for the wwatchdog_core (since
active and nowayout are parameters that will be in the sysfs watchdog core).

Kind regards,
Wim.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Staging: csr: csr_wifi_router_sef.c: fixed a brace coding style issue

2013-07-05 Thread Aldo Iljazi

Fixed a coding style issue.

Signed-off-by: Aldo Iljazi 
---
 drivers/staging/csr/csr_wifi_router_sef.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/staging/csr/csr_wifi_router_sef.c 
b/drivers/staging/csr/csr_wifi_router_sef.c
index 45a10fb..bdb7d3b 100644
--- a/drivers/staging/csr/csr_wifi_router_sef.c
+++ b/drivers/staging/csr/csr_wifi_router_sef.c
@@ -9,8 +9,7 @@
  */
 #include "csr_wifi_router_sef.h"
 
-const CsrWifiRouterStateHandlerType 
CsrWifiRouterDownstreamStateHandlers[CSR_WIFI_ROUTER_PRIM_DOWNSTREAM_COUNT] =
-{
+const CsrWifiRouterStateHandlerType 
CsrWifiRouterDownstreamStateHandlers[CSR_WIFI_ROUTER_PRIM_DOWNSTREAM_COUNT] = {
 /* 0x */ CsrWifiRouterMaPacketSubscribeReqHandler,
 /* 0x0001 */ CsrWifiRouterMaPacketUnsubscribeReqHandler,
 /* 0x0002 */ CsrWifiRouterMaPacketReqHandler,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND v10 1/2] watchdog: New watchdog driver for MEN A21 watchdogs

2013-07-05 Thread Wim Van Sebroeck

Hi Johannes,

> This patch adds the driver for the watchdog devices found on MEN Mikro
> Elektronik A21 VMEbus CPU Carrier Boards. It has DT-support and uses the
> watchdog framework.
> 
> Signed-off-by: Johannes Thumshirn 
> Reviewed-by: Guenter Roeck 

I added this patch to linux-watchdog-next.
I am still thinking about wether or not we shouldn't add a timer to this
watchdog device driver. But that's something we can change later on.

Kind regards,
Wim.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clocksource/cadence_ttc: Reuse clocksource as sched_clock

2013-07-05 Thread Thomas Gleixner

On Fri, 5 Jul 2013, Sören Brinkmann wrote:
> On Fri, Jul 05, 2013 at 10:42:03PM +0200, Thomas Gleixner wrote:
> > We have a mechanism for that in place, if stuff goes cross trees. One
> > of the trees provides a set of commit for the other tree to pull, so
> > we do not end up with merge dependencies and conflicts.
> > 
> > Why is clocksource stuff going through armsoc unless it contains
> > related modifications in arch/arm/
> We overhauled Zynq's common clock code and migrated all Zynq drivers to
> it - the TTC being one. I'm pretty sure you were on the CC list for that
> change.
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=30e1e28598c2674c133148d8aec6d431d7acd314
> 

That's fine as it has no dependencies on any of the clocksource core
changes AFAICT.

Thanks,

tglx

[PATCH 1/4] PF: Add FAULT_FLAG_RETRY_NOWAIT for guest fault

2013-07-05 Thread Dominik Dingel

In case of a fault retry exit sie64() with gmap_fault indication for the
running thread set. This makes it possible to handle async page faults
without the need for mm notifiers.

Based on a patch from Martin Schwidefsky.

Signed-off-by: Dominik Dingel 
---
 arch/s390/include/asm/pgtable.h   |  2 ++
 arch/s390/include/asm/processor.h |  1 +
 arch/s390/kvm/kvm-s390.c  | 13 +
 arch/s390/mm/fault.c  | 26 ++
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0ea4e59..4a4cc64 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -740,6 +740,7 @@ static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
  * @table: pointer to the page directory
  * @asce: address space control element for gmap page table
  * @crst_list: list of all crst tables used in the guest address space
+ * @pfault_enabled: defines if pfaults are applicable for the guest
  */
 struct gmap {
struct list_head list;
@@ -748,6 +749,7 @@ struct gmap {
unsigned long asce;
void *private;
struct list_head crst_list;
+   unsigned long pfault_enabled;
 };
 
 /**
diff --git a/arch/s390/include/asm/processor.h 
b/arch/s390/include/asm/processor.h
index 6b49987..4fa96ca 100644
--- a/arch/s390/include/asm/processor.h
+++ b/arch/s390/include/asm/processor.h
@@ -77,6 +77,7 @@ struct thread_struct {
 unsigned long ksp;  /* kernel stack pointer */
mm_segment_t mm_segment;
unsigned long gmap_addr;/* address of last gmap fault. */
+   unsigned int gmap_pfault;   /* signal of a pending guest pfault */
struct per_regs per_user;   /* User specified PER registers */
struct per_event per_event; /* Cause of the last PER trap */
unsigned long per_flags;/* Flags to control debug behavior */
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ba694d2..702daca 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -682,6 +682,15 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
return 0;
 }
 
+static void kvm_arch_fault_in_sync(struct kvm_vcpu *vcpu)
+{
+   hva_t fault = gmap_fault(current->thread.gmap_addr, vcpu->arch.gmap);
+   struct mm_struct *mm = current->mm;
+   down_read(>mmap_sem);
+   get_user_pages(current, mm, fault, 1, 1, 0, NULL, NULL);
+   up_read(>mmap_sem);
+}
+
 static int __vcpu_run(struct kvm_vcpu *vcpu)
 {
int rc;
@@ -715,6 +724,10 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
if (rc < 0) {
if (kvm_is_ucontrol(vcpu->kvm)) {
rc = SIE_INTERCEPT_UCONTROL;
+   } else if (current->thread.gmap_pfault) {
+   kvm_arch_fault_in_sync(vcpu);
+   current->thread.gmap_pfault = 0;
+   rc = 0;
} else {
VCPU_EVENT(vcpu, 3, "%s", "fault in sie instruction");
trace_kvm_s390_sie_fault(vcpu);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 047c3e4..7d4c4b1 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -50,6 +50,7 @@
 #define VM_FAULT_BADMAP0x02
 #define VM_FAULT_BADACCESS 0x04
 #define VM_FAULT_SIGNAL0x08
+#define VM_FAULT_PFAULT0x10
 
 static unsigned long store_indication __read_mostly;
 
@@ -232,6 +233,7 @@ static noinline void do_fault_error(struct pt_regs *regs, 
int fault)
return;
}
case VM_FAULT_BADCONTEXT:
+   case VM_FAULT_PFAULT:
do_no_context(regs);
break;
case VM_FAULT_SIGNAL:
@@ -269,6 +271,9 @@ static noinline void do_fault_error(struct pt_regs *regs, 
int fault)
  */
 static inline int do_exception(struct pt_regs *regs, int access)
 {
+#ifdef CONFIG_PGSTE
+   struct gmap *gmap;
+#endif
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
@@ -307,9 +312,10 @@ static inline int do_exception(struct pt_regs *regs, int 
access)
down_read(>mmap_sem);
 
 #ifdef CONFIG_PGSTE
-   if ((current->flags & PF_VCPU) && S390_lowcore.gmap) {
-   address = __gmap_fault(address,
-(struct gmap *) S390_lowcore.gmap);
+   gmap = (struct gmap *)
+   ((current->flags & PF_VCPU) ? S390_lowcore.gmap : 0);
+   if (gmap) {
+   address = __gmap_fault(address, gmap);
if (address == -EFAULT) {
fault = VM_FAULT_BADMAP;
goto out_up;
@@ -318,6 +324,8 @@ static inline int do_exception(struct pt_regs *regs, int 
access)
fault = VM_FAULT_OOM;
goto out_up;
}
+

1 2 3 4 5 6 7 8 9 >

1 - 100 of 812 matches

Mail list logo