date:20130125

[PATCH]smp: Fix send func call IPI to empty cpu mask

2013-01-25 Thread Wang YanQing

I get below warning every day with 3.7,
one or two times per day.

[ 2235.186027] WARNING: at 
/mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 
default_send_IPI_mask_logical+0x2f/0xb8()
[ 2235.186030] Hardware name: Aspire 4741
[ 2235.186032] empty IPI mask
[ 2235.186034] Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) 
vboxdrv(O) nvidia(PO) wl(O)
[ 2235.186046] Pid: 5542, comm: pool Tainted: P   O 3.7.2+ #41
[ 2235.186049] Call Trace:
[ 2235.186059]  [] warn_slowpath_common+0x65/0x7a
[ 2235.186064]  [] ? default_send_IPI_mask_logical+0x2f/0xb8
[ 2235.186069]  [] warn_slowpath_fmt+0x26/0x2a
[ 2235.186074]  [] default_send_IPI_mask_logical+0x2f/0xb8
[ 2235.186079]  [] native_send_call_func_ipi+0x4f/0x57
[ 2235.186087]  [] smp_call_function_many+0x191/0x1a9
[ 2235.186092]  [] ? do_flush_tlb_all+0x3f/0x3f
[ 2235.186097]  [] native_flush_tlb_others+0x21/0x24
[ 2235.186101]  [] flush_tlb_page+0x63/0x89
[ 2235.186105]  [] ptep_set_access_flags+0x20/0x26
[ 2235.186111]  [] do_wp_page+0x234/0x502
[ 2235.186117]  [] ? T.2009+0x31/0x35
[ 2235.186121]  [] handle_pte_fault+0x50d/0x54c
[ 2235.186128]  [] ? irq_exit+0x5f/0x61
[ 2235.186133]  [] ? smp_call_function_interrupt+0x2c/0x2e
[ 2235.186143]  [] ? call_function_interrupt+0x2d/0x34
[ 2235.186148]  [] handle_mm_fault+0xd0/0xe2
[ 2235.186153]  [] __do_page_fault+0x411/0x42d
[ 2235.186158]  [] ? sys_futex+0xa9/0xee
[ 2235.186162]  [] ? __do_page_fault+0x42d/0x42d
[ 2235.186166]  [] do_page_fault+0x8/0xa
[ 2235.186170]  [] error_code+0x5a/0x60
[ 2235.186174]  [] ? __do_page_fault+0x42d/0x42d
[ 2235.186177] ---[ end trace 089b20858c3cb340 ]---

This patch fix it.

This patch also fix some system hang problem:
If the data->cpumask been cleared after pass

if (WARN_ONCE(!mask, "empty IPI mask"))
return;
then the problem 83d349f3 fix will happen again.

Signed-off-by: Wang YanQing 
---
 kernel/smp.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 29dd40a..7c56aba 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -33,6 +33,7 @@ struct call_function_data {
struct call_single_data csd;
atomic_trefs;
cpumask_var_t   cpumask;
+   cpumask_var_t   cpumask_ipi;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
@@ -526,6 +527,13 @@ void smp_call_function_many(const struct cpumask *mask,
return;
}
 
+   /*
+* After we put entry into list, data->cpumask
+* may be cleared when others cpu respone other
+* IPI for call function, then data->cpumask will
+* be zero.
+*/
+   cpumask_copy(data->cpumask_ipi, data->cpumask);
raw_spin_lock_irqsave(_function.lock, flags);
/*
 * Place entry at the _HEAD_ of the list, so that any cpu still
@@ -549,7 +557,7 @@ void smp_call_function_many(const struct cpumask *mask,
smp_mb();
 
/* Send a message to all CPUs in the map */
-   arch_send_call_function_ipi_mask(data->cpumask);
+   arch_send_call_function_ipi_mask(data->cpumask_ipi);
 
/* Optionally wait for the CPUs to complete */
if (wait)
-- 
1.7.11.1.116.g8228a23
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] edac: test correct variable in ->store function

2013-01-25 Thread Dan Carpenter

We're testing for ->show but calling ->store().

Signed-off-by: Dan Carpenter 

diff --git a/drivers/edac/edac_pci_sysfs.c b/drivers/edac/edac_pci_sysfs.c
index 7684426..e8658e4 100644
--- a/drivers/edac/edac_pci_sysfs.c
+++ b/drivers/edac/edac_pci_sysfs.c
@@ -256,7 +256,7 @@ static ssize_t edac_pci_dev_store(struct kobject *kobj,
struct edac_pci_dev_attribute *edac_pci_dev;
edac_pci_dev = (struct edac_pci_dev_attribute *)attr;
 
-   if (edac_pci_dev->show)
+   if (edac_pci_dev->store)
return edac_pci_dev->store(edac_pci_dev->value, buffer, count);
return -EIO;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread Jonathan Nieder

Hi Paul,

Ben Hutchings wrote:

> If you can identify where it was fixed then your patch for older
> versions should go to stable with a reference to the upstream fix (see
> Documentation/stable_kernel_rules.txt).

How about this patch?

It was applied in mainline during the 3.3 merge window, so kernels
newer than 3.2.y shouldn't need it.

-- >8 --
From: Johannes Weiner 
Date: Tue, 10 Jan 2012 15:07:42 -0800
Subject: mm: exclude reserved pages from dirtyable memory

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d upstream.

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

   +--+ ---
   |   anon   |  |
   +--+  |
   |  |  |
   |  |  -- dirty limit new-- flusher new
   |   file   |  | |
   |  |  | |
   |  |  -- dirty limit old-- flusher old
   |  ||
   +--+   --- reclaim
   | reserved |
   +--+
   |  kernel  |
   +--+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages.  The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free.  We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix.  In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[a...@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Michal Hocko 
Reviewed-by: Minchan Kim 
Acked-by: Mel Gorman 
Cc: KAMEZAWA Hiroyuki 
Cc: Christoph Hellwig 
Cc: Wu Fengguang 
Cc: Dave Chinner 
Cc: Jan Kara 
Cc: Shaohua Li 
Cc: Chris Mason 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Jonathan Nieder 
---
 include/linux/mmzone.h |  6 ++
 include/linux/swap.h   |  1 +
 mm/page-writeback.c|  5 +++--
 mm/page_alloc.c| 19 +++
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 25842b6e72e1..a594af3278bc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -319,6 +319,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];

+   /*
+* This is a per-zone reserve of pages that should not be
+* considered dirtyable memory.
+*/
+   unsigned long   dirty_balance_reserve;
+
 #ifdef CONFIG_NUMA
int node;
/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 67b3fa308988..3e60228e7299 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -207,6 +207,7 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
+extern unsigned long dirty_balance_reserve;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 50f08241f981..f620e7b0dc26 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -320,7 +320,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long 
total)
_DATA(node)->node_zones[ZONE_HIGHMEM];

x += zone_page_state(z, NR_FREE_PAGES) +
-zone_reclaimable_pages(z);
+zone_reclaimable_pages(z) - z->dirty_balance_reserve;
}
/*
 * Make sure that the number of highmem pages is never larger
@@ -344,7 +344,8 @@ unsigned long determine_dirtyable_memory(void)
 {
unsigned long x;

-   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+   dirty_balance_reserve;

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c

Re: [PATCH 1/2] spi: spi-gpio: Add checks for the dt properties

2013-01-25 Thread Mark Brown

On Fri, Jan 25, 2013 at 09:39:34AM +0100, Maxime Ripard wrote:
> The bindings assumed that the gpios properties were always there, which
> made the NO_TX and NO_RX mode not usable from device tree. Add extra
> checks to make sure that the driver can work if either MOSI or MISO is
> not used.

Applied, thanks.


signature.asc
Description: Digital signature

Re: [PATCH] regulators/db8500: Fix compile failure for drivers/regulator/dbx500-prcmu.c

2013-01-25 Thread Mark Brown

On Thu, Jan 24, 2013 at 10:29:26AM -0500, Steven Rostedt wrote:
> Building for the snowball board, I ran into this compile failure:

Applied, thanks.  Please use subject lines appropriate for the subsystem
(I see I let the original one through).


signature.asc
Description: Digital signature

Re: [PATCH v3 04/10] spi/pxa2xx: convert to the common clk framework

2013-01-25 Thread Mark Brown

On Tue, Jan 22, 2013 at 12:26:27PM +0200, Mika Westerberg wrote:
> Convert clk_enable() to clk_prepare_enable() and clk_disable() to
> clk_disable_unprepare() respectively in order to support the common clk
> framework. Otherwise we get warnings on the console as the clock is not
> prepared before it is enabled.

Applied, thanks.


signature.asc
Description: Digital signature

Re: [PATCH v3 03/10] spi/pxa2xx: convert to the pump message infrastructure

2013-01-25 Thread Mark Brown

On Tue, Jan 22, 2013 at 12:26:26PM +0200, Mika Westerberg wrote:
> The SPI core provides infrastructure for standard message queueing so use
> that instead of handling everything in the driver. This simplifies the
> driver.

Applied, thanks.


signature.asc
Description: Digital signature

Re: [PATCH v3 02/10] spi/pxa2xx: fix warnings when compiling a 64-bit kernel

2013-01-25 Thread Mark Brown

On Tue, Jan 22, 2013 at 12:26:25PM +0200, Mika Westerberg wrote:
> Fix following warnings seen when compiling 64-bit:

Applied, thanks.


signature.asc
Description: Digital signature

Re: [PATCH v3 01/10] spi/pxa2xx: allow building on a 64-bit kernel

2013-01-25 Thread Mark Brown

On Tue, Jan 22, 2013 at 12:26:24PM +0200, Mika Westerberg wrote:
> We are going to use it on 64-bit kernel on Intel Lynxpoint so make sure we
> can build it into such kernel.

Applied, thanks.


signature.asc
Description: Digital signature

Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-25 Thread Mike Galbraith

On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:

> The performance issue observed with AIM7 is still a mystery.

Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box.
Stock scheduler knobs are too preempt happy, produce unstable results.
I twiddle them as below to stabilize results.

I'm testing a load balancing series from Alex Shi with AIM7 and whatnot,
added your series on top of it and retested.  What I see is
improvement. 

Oodles of numbers follow.  Sorry that your numbers are mixed in with my
numbers, but this is just an excerpt from my test log, and I'm too lazy
to reformat and filter.  You can save wear and tear on your eyeballs by
just poking 'D'.  There does appear to be evidence that your patch set
improved this load though, so in case you want to see numbers, here come
a bunch, a quick scroll-by may be worth it.

The very heavy load end did not improve, which seems odd, but whatever.
Numbers...

sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

aim7 compute
 3.8.0-performance   3.8.0-balance  
 3.8.0-powersaving
Tasksjobs/min  jti  jobs/min/task  real   cpujobs/min  jti  
jobs/min/task  real   cpujobs/min  jti  jobs/min/task  real 
  cpu
1  432.86  100   432.8571 14.00  3.99  433.48  100  
 433.4764 13.98  3.97  433.17  100   433.1665 13.99 
 3.98
1  437.23  100   437.2294 13.86  3.85  436.60  100  
 436.5994 13.88  3.86  435.66  100   435.6578 13.91 
 3.90
1  434.10  100   434.0974 13.96  3.95  436.29  100  
 436.2851 13.89  3.89  436.29  100   436.2851 13.89 
 3.87
5 2400.95   99   480.1902 12.62 12.49 2554.81   98  
 510.9612 11.86  7.55 2487.68   98   497.5369 12.18 
 8.22
5 2341.58   99   468.3153 12.94 13.95 2578.72   99  
 515.7447 11.75  7.25 2527.11   99   505.4212 11.99 
 7.90
5 2350.66   99   470.1319 12.89 13.66 2600.86   99  
 520.1717 11.65  7.09 2508.28   98   501.6556 12.08 
 8.24
   10 4291.78   99   429.1785 14.12 40.14 5334.51   99  
 533.4507 11.36 11.13 5183.92   98   518.3918 11.69 
12.15
   10 4334.76   99   433.4764 13.98 38.70 5311.13   99  
 531.1131 11.41 11.23 5215.15   99   521.5146 11.62 
12.53
   10 4273.62   99   427.3625 14.18 40.29 5287.96   99  
 528.7958 11.46 11.46 5144.31   98   514.4312 11.78 
12.32
   20 8487.39   94   424.3697 14.28 63.1410594.41   99  
 529.7203 11.44 23.7210575.92   99   528.7958 11.46 
22.08
   20 8387.54   97   419.3772 14.45 77.0110575.92   98  
 528.7958 11.46 23.4110520.83   99   526.0417 11.52 
21.88
   20 8713.16   95   435.6578 13.91 55.1010659.63   99  
 532.9815 11.37 24.1710539.13   99   526.9565 11.50 
22.13
   4016786.70   99   419.6676 14.44170.0819469.88   98  
 486.7470 12.45 60.7819967.05   98   499.1763 12.14 
51.40
   4016728.78   99   418.2195 14.49172.9619627.53   98  
 490.6883 12.35 65.2620386.88   98   509.6720 11.89 
46.91
   4016763.49   99   419.0871 14.46171.4220033.06   98  
 500.8264 12.10 51.4420682.59   98   517.0648 11.72 
42.45
   8033024.52   98   412.8065 14.68355.1033205.48   98  
 415.0685 14.60336.9033690.06   97   421.1258 14.39 
   248.91
   8033002.04   99   412.5255 14.69356.2733949.58   96  
 424.3697 14.28283.8733160.05   97   414.5007 14.62 
   264.85
   8033047.03   99   413.0879 14.67355.2233137.39   98  
 414.2174 14.63338.9233526.97   97   419.0871 14.46 
   257.31
  16064254.47   98   401.5905 15.09391.3064000.00   98  
 400. 15.15396.8765073.83   97   406.7114 14.90 
   371.09
  16064468.09   98   402.9255 15.04390.2864553.93   98  
 403.4621 15.02389.4964640.00   98   404. 15.00 
   379.82
  16064297.08   98   401.8568 15.08389.4564856.19   98  
 405.3512 14.95383.6464683.12   98   404.2695 14.99 
   379.43
  320   121579.94   98

Re: [PATCH RESEND] ARM: dts: max77686: Add DTS file for max77686 PMIC

2013-01-25 Thread Dongjin Kim

Hello Mark,

Yes, this is not ARM-specific chip at all. Just wanted to be reviewed
by you and others if the format is ok before integrating to my board
file. I had sent similar one before,
https://patchwork.kernel.org/patch/1287711, and you advised that was
too board specific. And plan to integrate like OMAP boards have with
twl6030.dtsi and twl6040.dtsi.

If would be nice if somewhere you specify a directory for such device files.

Regards,
Dongjin.

On Sat, Jan 26, 2013 at 2:06 PM, Mark Brown
 wrote:
> On Fri, Jan 25, 2013 at 03:46:08AM +0900, Dongjin Kim wrote:
>
>> ---
>>  arch/arm/boot/dts/max77686.dtsi |  156 
>> +++
>
> Why is this in arch/arm?  This isn't an ARM-specific chip.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/4] lib: add support for LZ4-compressed kernels

2013-01-25 Thread Kyungsik Lee

This patch adds support for extracting LZ4-compressed kernel images,
as well as LZ4-compressed ramdisk images in the kernel boot process.

This depends on the patch below
decompressors: add lz4 decompressor module

Signed-off-by: Kyungsik Lee 
---
 include/linux/decompress/unlz4.h |  10 ++
 init/Kconfig |  13 ++-
 lib/Kconfig  |   7 ++
 lib/Makefile |   2 +
 lib/decompress.c |   5 +
 lib/decompress_unlz4.c   | 199 +++
 lib/lz4/Makefile |   1 +
 lib/lz4/lz4_decompress.c |   2 +-
 scripts/Makefile.lib |   5 +
 usr/Kconfig  |   9 ++
 10 files changed, 251 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/decompress/unlz4.h
 create mode 100644 lib/decompress_unlz4.c
 create mode 100644 lib/lz4/Makefile

diff --git a/include/linux/decompress/unlz4.h b/include/linux/decompress/unlz4.h
new file mode 100644
index 000..d5b68bf
--- /dev/null
+++ b/include/linux/decompress/unlz4.h
@@ -0,0 +1,10 @@
+#ifndef DECOMPRESS_UNLZ4_H
+#define DECOMPRESS_UNLZ4_H
+
+int unlz4(unsigned char *inbuf, int len,
+   int(*fill)(void*, unsigned int),
+   int(*flush)(void*, unsigned int),
+   unsigned char *output,
+   int *pos,
+   void(*error)(char *x));
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 1aefe1a..be3753e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -102,10 +102,13 @@ config HAVE_KERNEL_XZ
 config HAVE_KERNEL_LZO
bool
 
+config HAVE_KERNEL_LZ4
+   bool
+
 choice
prompt "Kernel compression mode"
default KERNEL_GZIP
-   depends on HAVE_KERNEL_GZIP || HAVE_KERNEL_BZIP2 || HAVE_KERNEL_LZMA || 
HAVE_KERNEL_XZ || HAVE_KERNEL_LZO
+   depends on HAVE_KERNEL_GZIP || HAVE_KERNEL_BZIP2 || HAVE_KERNEL_LZMA || 
HAVE_KERNEL_XZ || HAVE_KERNEL_LZO || HAVE_KERNEL_LZ4
help
  The linux kernel is a kind of self-extracting executable.
  Several compression algorithms are available, which differ
@@ -172,6 +175,14 @@ config KERNEL_LZO
  size is about 10% bigger than gzip; however its speed
  (both compression and decompression) is the fastest.
 
+config KERNEL_LZ4
+   bool "LZ4"
+   depends on HAVE_KERNEL_LZ4
+   help
+ Its compression ratio is worse than LZO. The size of the kernel
+ is about 5% bigger than LZO. But the decompression speed is
+ faster than LZO.
+
 endchoice
 
 config DEFAULT_HOSTNAME
diff --git a/lib/Kconfig b/lib/Kconfig
index 75cdb77..b108047 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -189,6 +189,9 @@ config LZO_COMPRESS
 config LZO_DECOMPRESS
tristate
 
+config LZ4_DECOMPRESS
+   tristate
+
 source "lib/xz/Kconfig"
 
 #
@@ -213,6 +216,10 @@ config DECOMPRESS_LZO
select LZO_DECOMPRESS
tristate
 
+config DECOMPRESS_LZ4
+   select LZ4_DECOMPRESS
+   tristate
+
 #
 # Generic allocator support is selected if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index 02ed6c0..c2073bf 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_REED_SOLOMON) += reed_solomon/
 obj-$(CONFIG_BCH) += bch.o
 obj-$(CONFIG_LZO_COMPRESS) += lzo/
 obj-$(CONFIG_LZO_DECOMPRESS) += lzo/
+obj-$(CONFIG_LZ4_DECOMPRESS) += lz4/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
 
@@ -80,6 +81,7 @@ lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
 lib-$(CONFIG_DECOMPRESS_LZMA) += decompress_unlzma.o
 lib-$(CONFIG_DECOMPRESS_XZ) += decompress_unxz.o
 lib-$(CONFIG_DECOMPRESS_LZO) += decompress_unlzo.o
+lib-$(CONFIG_DECOMPRESS_LZ4) += decompress_unlz4.o
 
 obj-$(CONFIG_TEXTSEARCH) += textsearch.o
 obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
diff --git a/lib/decompress.c b/lib/decompress.c
index 31a8042..c70810e 100644
--- a/lib/decompress.c
+++ b/lib/decompress.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -31,6 +32,9 @@
 #ifndef CONFIG_DECOMPRESS_LZO
 # define unlzo NULL
 #endif
+#ifndef CONFIG_DECOMPRESS_LZ4
+# define unlz4 NULL
+#endif
 
 struct compress_format {
unsigned char magic[2];
@@ -45,6 +49,7 @@ static const struct compress_format compressed_formats[] 
__initdata = {
{ {0x5d, 0x00}, "lzma", unlzma },
{ {0xfd, 0x37}, "xz", unxz },
{ {0x89, 0x4c}, "lzo", unlzo },
+   { {0x02, 0x21}, "lz4", unlz4 },
{ {0, 0}, NULL, NULL }
 };
 
diff --git a/lib/decompress_unlz4.c b/lib/decompress_unlz4.c
new file mode 100644
index 000..6b6a8d0
--- /dev/null
+++ b/lib/decompress_unlz4.c
@@ -0,0 +1,199 @@
+/*
+ * LZ4 decompressor for the Linux kernel.
+ *
+ * Linux kernel adaptation:
+ * Copyright (C) 2013, LG Electronics, Kyungsik Lee 
+ *
+ * Based on LZ4 implementation by Yann Collet.
+ *
+ * LZ4 - Fast LZ compression algorithm
+ * Copyright (C) 2011-2012, Yann Collet.
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution

[RFC PATCH 3/4] arm: add support for LZ4-compressed kernels

2013-01-25 Thread Kyungsik Lee

This patch integrates the LZ4 decompression code to the arm pre-boot code.
And it depends on two patchs below

lib: add support for LZ4-compressed kernels
decompressors: add lz4 decompressor module

Signed-off-by: Kyungsik Lee 
---
 arch/arm/Kconfig  | 1 +
 arch/arm/boot/compressed/.gitignore   | 1 +
 arch/arm/boot/compressed/Makefile | 3 ++-
 arch/arm/boot/compressed/decompress.c | 4 
 arch/arm/boot/compressed/piggy.lz4.S  | 6 ++
 5 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/boot/compressed/piggy.lz4.S

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 91f8d78..1b3621d 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -37,6 +37,7 @@ config ARM
select HAVE_HW_BREAKPOINT if (PERF_EVENTS && (CPU_V6 || CPU_V6K || 
CPU_V7))
select HAVE_IDE if PCI || ISA || PCMCIA
select HAVE_KERNEL_GZIP
+   select HAVE_KERNEL_LZ4
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
select HAVE_KERNEL_XZ
diff --git a/arch/arm/boot/compressed/.gitignore 
b/arch/arm/boot/compressed/.gitignore
index f79a08e..47279aa 100644
--- a/arch/arm/boot/compressed/.gitignore
+++ b/arch/arm/boot/compressed/.gitignore
@@ -6,6 +6,7 @@ piggy.gzip
 piggy.lzo
 piggy.lzma
 piggy.xzkern
+piggy.lz4
 vmlinux
 vmlinux.lds
 
diff --git a/arch/arm/boot/compressed/Makefile 
b/arch/arm/boot/compressed/Makefile
index 5cad8a6..8b5c79a 100644
--- a/arch/arm/boot/compressed/Makefile
+++ b/arch/arm/boot/compressed/Makefile
@@ -88,6 +88,7 @@ suffix_$(CONFIG_KERNEL_GZIP) = gzip
 suffix_$(CONFIG_KERNEL_LZO)  = lzo
 suffix_$(CONFIG_KERNEL_LZMA) = lzma
 suffix_$(CONFIG_KERNEL_XZ)   = xzkern
+suffix_$(CONFIG_KERNEL_LZ4)  = lz4
 
 # Borrowed libfdt files for the ATAG compatibility mode
 
@@ -112,7 +113,7 @@ targets   := vmlinux vmlinux.lds \
 font.o font.c head.o misc.o $(OBJS)
 
 # Make sure files are removed during clean
-extra-y   += piggy.gzip piggy.lzo piggy.lzma piggy.xzkern \
+extra-y   += piggy.gzip piggy.lzo piggy.lzma piggy.xzkern piggy.lz4 \
 lib1funcs.S ashldi3.S $(libfdt) $(libfdt_hdrs)
 
 ifeq ($(CONFIG_FUNCTION_TRACER),y)
diff --git a/arch/arm/boot/compressed/decompress.c 
b/arch/arm/boot/compressed/decompress.c
index 9deb56a..a95f071 100644
--- a/arch/arm/boot/compressed/decompress.c
+++ b/arch/arm/boot/compressed/decompress.c
@@ -53,6 +53,10 @@ extern char * strstr(const char * s1, const char *s2);
 #include "../../../../lib/decompress_unxz.c"
 #endif
 
+#ifdef CONFIG_KERNEL_LZ4
+#include "../../../../lib/decompress_unlz4.c"
+#endif
+
 int do_decompress(u8 *input, int len, u8 *output, void (*error)(char *x))
 {
return decompress(input, len, NULL, NULL, output, NULL, error);
diff --git a/arch/arm/boot/compressed/piggy.lz4.S 
b/arch/arm/boot/compressed/piggy.lz4.S
new file mode 100644
index 000..3d9a575
--- /dev/null
+++ b/arch/arm/boot/compressed/piggy.lz4.S
@@ -0,0 +1,6 @@
+   .section .piggydata,#alloc
+   .globl  input_data
+input_data:
+   .incbin "arch/arm/boot/compressed/piggy.lz4"
+   .globl  input_data_end
+input_data_end:
-- 
1.8.0.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/4] decompressors: add lz4 decompressor module

2013-01-25 Thread Kyungsik Lee

This patch adds support for LZ4 decompression in the kernel.
LZ4 Decompression APIs for kernel are based on LZ4 implementation
by Yann Collet.

LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html
LZ4 source repository : http://code.google.com/p/lz4/

Signed-off-by: Kyungsik Lee 
---
 include/linux/lz4.h  |  62 +++
 lib/lz4/lz4_decompress.c | 199 +++
 lib/lz4/lz4defs.h| 129 ++
 3 files changed, 390 insertions(+)
 create mode 100644 include/linux/lz4.h
 create mode 100644 lib/lz4/lz4_decompress.c
 create mode 100644 lib/lz4/lz4defs.h

diff --git a/include/linux/lz4.h b/include/linux/lz4.h
new file mode 100644
index 000..df03dd8
--- /dev/null
+++ b/include/linux/lz4.h
@@ -0,0 +1,62 @@
+#ifndef __LZ4_H__
+#define __LZ4_H__
+/*
+ * LZ4 Decompressor Kernel Interface
+ *
+ * Copyright (C) 2013, LG Electronics, Kyungsik Lee 
+ * Based on LZ4 implementation by Yann Collet.
+ *
+ * LZ4 - Fast LZ compression algorithm
+ * Copyright (C) 2011-2012, Yann Collet.
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html
+ * - LZ4 source repository : http://code.google.com/p/lz4/
+ */
+
+
+/*
+ * LZ4_COMPRESSBOUND()
+ * Provides the maximum size that LZ4 may output in a "worst case" scenario
+ * (input data not compressible)
+ */
+#define LZ4_COMPRESSBOUND(isize) (isize + ((isize)/255) + 16)
+
+/*
+ * lz4_decompress()
+ * src : source address of the compressed data
+ * src_len : is the input size, therefore the compressed size
+ * dest: output buffer address of the decompressed data
+ * dest_len: is the size of the destination buffer
+ * (which must be already allocated)
+ * return  : Success if return 0
+ *   Error if return (< 0)
+ * note :  Destination buffer must be already allocated.
+ */
+int lz4_decompress(const char *src, size_t src_len, char *dest,
+   size_t *dest_len);
+#endif
diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c
new file mode 100644
index 000..e8beb6b
--- /dev/null
+++ b/lib/lz4/lz4_decompress.c
@@ -0,0 +1,199 @@
+/*
+ * LZ4 Decompressor for Linux kernel
+ *
+ * Copyright (C) 2013 LG Electronics Co., Ltd. (http://www.lge.com/)
+ *
+ * Based on LZ4 implementation by Yann Collet.
+ *
+ * LZ4 - Fast LZ compression algorithm
+ * Copyright (C) 2011-2012, Yann Collet.
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,

[RFC PATCH 4/4] x86: add support for LZ4-compressed kernels

2013-01-25 Thread Kyungsik Lee

This patch integrates the LZ4 decompression code to the x86 pre-boot code.
And it depends on two patchs below

lib: add support for LZ4-compressed kernels
decompressors: add lz4 decompressor module

Signed-off-by: Kyungsik Lee 
---
 arch/x86/Kconfig  | 1 +
 arch/x86/boot/compressed/Makefile | 5 -
 arch/x86/boot/compressed/misc.c   | 4 
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8c185d0..7142bef 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -62,6 +62,7 @@ config X86
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_XZ
select HAVE_KERNEL_LZO
+   select HAVE_KERNEL_LZ4
select HAVE_HW_BREAKPOINT
select HAVE_MIXED_BREAKPOINTS_REGS
select PERF_EVENTS
diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 8a84501..c275db5 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -4,7 +4,7 @@
 # create a compressed vmlinux image from the original vmlinux
 #
 
-targets := vmlinux.lds vmlinux vmlinux.bin vmlinux.bin.gz vmlinux.bin.bz2 
vmlinux.bin.lzma vmlinux.bin.xz vmlinux.bin.lzo head_$(BITS).o misc.o string.o 
cmdline.o early_serial_console.o piggy.o
+targets := vmlinux.lds vmlinux vmlinux.bin vmlinux.bin.gz vmlinux.bin.bz2 
vmlinux.bin.lzma vmlinux.bin.xz vmlinux.bin.lzo vmlinux.bin.lz4 head_$(BITS).o 
misc.o string.o cmdline.o early_serial_console.o piggy.o
 
 KBUILD_CFLAGS := -m$(BITS) -D__KERNEL__ $(LINUX_INCLUDE) -O2
 KBUILD_CFLAGS += -fno-strict-aliasing -fPIC
@@ -64,12 +64,15 @@ $(obj)/vmlinux.bin.xz: $(vmlinux.bin.all-y) FORCE
$(call if_changed,xzkern)
 $(obj)/vmlinux.bin.lzo: $(vmlinux.bin.all-y) FORCE
$(call if_changed,lzo)
+$(obj)/vmlinux.bin.lz4: $(vmlinux.bin.all-y) FORCE
+   $(call if_changed,lz4)
 
 suffix-$(CONFIG_KERNEL_GZIP)   := gz
 suffix-$(CONFIG_KERNEL_BZIP2)  := bz2
 suffix-$(CONFIG_KERNEL_LZMA)   := lzma
 suffix-$(CONFIG_KERNEL_XZ) := xz
 suffix-$(CONFIG_KERNEL_LZO):= lzo
+suffix-$(CONFIG_KERNEL_LZ4):= lz4
 
 quiet_cmd_mkpiggy = MKPIGGY $@
   cmd_mkpiggy = $(obj)/mkpiggy $< > $@ || ( rm -f $@ ; false )
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 88f7ff6..166a0a8 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -145,6 +145,10 @@ static int lines, cols;
 #include "../../../../lib/decompress_unlzo.c"
 #endif
 
+#ifdef CONFIG_KERNEL_LZ4
+#include "../../../../lib/decompress_unlz4.c"
+#endif
+
 static void scroll(void)
 {
int i;
-- 
1.8.0.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2] ARM: mxs: dt: Add Crystalfontz CFA-10037 device tree support

2013-01-25 Thread Shawn Guo

On Fri, Jan 25, 2013 at 10:00:35AM +0100, Maxime Ripard wrote:
> The CFA-10037 is another expansion board for the CFA-10036 module, with
> only a USB Host, a Ethernet device and a lot of gpios.
> 
> Signed-off-by: Maxime Ripard 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] pwm: Add pwm_cansleep() as exported API to users

2013-01-25 Thread Thierry Reding

On Fri, Jan 25, 2013 at 02:44:29PM +0100, Florian Vaussard wrote:
> Calls to some external PWM chips can sleep. To help users,
> add pwm_cansleep() API.
> 
> Signed-off-by: Florian Vaussard 
> ---
>  drivers/pwm/core.c  |   12 
>  include/linux/pwm.h |   10 ++
>  2 files changed, 22 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
> index 4a13da4..e737f5f 100644
> --- a/drivers/pwm/core.c
> +++ b/drivers/pwm/core.c
> @@ -763,6 +763,18 @@ void devm_pwm_put(struct device *dev, struct pwm_device 
> *pwm)
>  }
>  EXPORT_SYMBOL_GPL(devm_pwm_put);
>  
> +/**
> +  * pwm_cansleep() - report whether pwm access will sleep

"... whether PWM access..." please.

> +  * @pwm: PWM device
> +  *
> +  * It returns nonzero if accessing the PWM can sleep.
> +  */
> +int pwm_cansleep(struct pwm_device *pwm)

I actually liked pwm_can_sleep() better. I find it to be more consistent
with the naming of other function names. It would furthermore match the
field name.

> +{
> + return pwm->chip->can_sleep;
> +}
> +EXPORT_SYMBOL_GPL(pwm_cansleep);

Would it make sense to check for NULL pointers here? I guess that
passing NULL into the function could be considered a programming error
and an oops would be okay, but in that case there's no point in making
the function return an int. Also see my next comment.

> +
>  #ifdef CONFIG_DEBUG_FS
>  static void pwm_dbg_show(struct pwm_chip *chip, struct seq_file *s)
>  {
> diff --git a/include/linux/pwm.h b/include/linux/pwm.h
> index 70655a2..e2cb5c7 100644
> --- a/include/linux/pwm.h
> +++ b/include/linux/pwm.h
> @@ -146,6 +146,8 @@ struct pwm_ops {
>   * @base: number of first PWM controlled by this chip
>   * @npwm: number of PWMs controlled by this chip
>   * @pwms: array of PWM devices allocated by the framework
> + * @can_sleep: flag must be set iff config()/enable()/disable() methods 
> sleep,
> + *  as they must while accessing PWM chips over I2C or SPI
>   */
>  struct pwm_chip {
>   struct device   *dev;
> @@ -159,6 +161,7 @@ struct pwm_chip {
>   struct pwm_device * (*of_xlate)(struct pwm_chip *pc,
>   const struct of_phandle_args *args);
>   unsigned intof_pwm_n_cells;
> + unsigned intcan_sleep:1;

What's the reason for making this a bitfield? Couldn't we just use a
bool instead?

Thierry


pgpfBhH2wr1GB.pgp
Description: PGP signature

[PATCH] regulator: lp8755: Use LP8755_BUCK_MAX instead of magic number

2013-01-25 Thread Axel Lin

Signed-off-by: Axel Lin 
---
 drivers/regulator/lp8755.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/regulator/lp8755.c b/drivers/regulator/lp8755.c
index 8b1ce0f..f0f6ea0 100644
--- a/drivers/regulator/lp8755.c
+++ b/drivers/regulator/lp8755.c
@@ -373,7 +373,7 @@ static irqreturn_t lp8755_irq_handler(int irq, void *data)
goto err_i2c;
 
/* sent power fault detection event to specific regulator */
-   for (icnt = 0; icnt < 6; icnt++)
+   for (icnt = 0; icnt < LP8755_BUCK_MAX; icnt++)
if ((flag0 & (0x4 << icnt))
&& (pchip->irqmask & (0x04 << icnt))
&& (pchip->rdev[icnt] != NULL))
@@ -508,7 +508,7 @@ err_irq:
 
 err_regulator:
/* output disable */
-   for (icnt = 0; icnt < 0x06; icnt++)
+   for (icnt = 0; icnt < LP8755_BUCK_MAX; icnt++)
lp8755_write(pchip, icnt, 0x00);
 
return ret;
@@ -522,7 +522,7 @@ static int lp8755_remove(struct i2c_client *client)
for (icnt = 0; icnt < mphase_buck[pchip->mphase].nreg; icnt++)
regulator_unregister(pchip->rdev[icnt]);
 
-   for (icnt = 0; icnt < 0x06; icnt++)
+   for (icnt = 0; icnt < LP8755_BUCK_MAX; icnt++)
lp8755_write(pchip, icnt, 0x00);
 
if (pchip->irq != 0)
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mmc: fix to refer NULL pointer

2013-01-25 Thread Joonyoung Shim

Check whether host->sdio_irq_thread is NULL before wake_up_process() is
called about host->sdio_irq_thread.

Signed-off-by: Joonyoung Shim 
---
Currently the kernel panic to refer NULL pointer about
host->sdio_irq_thread are occuring at the trats board using Samsung
SDHCI driver.

 include/linux/mmc/host.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 61a10c1..2950fea 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -372,7 +372,8 @@ static inline void mmc_signal_sdio_irq(struct mmc_host 
*host)
 {
host->ops->enable_sdio_irq(host, 0);
host->sdio_irq_pending = true;
-   wake_up_process(host->sdio_irq_thread);
+   if (host->sdio_irq_thread)
+   wake_up_process(host->sdio_irq_thread);
 }
 
 #ifdef CONFIG_REGULATOR
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] ARM: regulator: add tps6507x device tree data

2013-01-25 Thread Mark Brown

On Fri, Jan 25, 2013 at 06:29:49AM +, Vishwanathrao Badarkhe, Manish wrote:
> On Thu, Jan 24, 2013 at 17:30:51, Mark Brown wrote:

> I too doubt that whether it should be in architecture specific folder,

> My code is in reference to below patch:
> arm/dts: regulator: Add tps65910 device tree 
> data(d5d08e2e1672da627d7c9d34a9dc1089c653e23a)

> Could you please suggest me if it can be moved somewhere else?

We should have somewhere to put this sort of generic stuff, yes.  Not
sure where, possibly under drivers/of or some non-drivers part of the
tree.


signature.asc
Description: Digital signature

Re: [PATCH RESEND] ARM: dts: max77686: Add DTS file for max77686 PMIC

2013-01-25 Thread Mark Brown

On Fri, Jan 25, 2013 at 03:46:08AM +0900, Dongjin Kim wrote:

> ---
>  arch/arm/boot/dts/max77686.dtsi |  156 
> +++

Why is this in arch/arm?  This isn't an ARM-specific chip.


signature.asc
Description: Digital signature

[PATCH v2] mm: clean up soft_offline_page()

2013-01-25 Thread Naoya Horiguchi

Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from get_any_page().
This function should only get page refcount as the name implies, but it does
some page isolating actions like SetPageHWPoison() and dequeuing hugepage.
This patch corrects it and introduces some internal subroutines to make
soft offlining code more readable and maintainable.

ChangeLog v2:
  - receive returned value from __soft_offline_page and soft_offline_huge_page
  - place __soft_offline_page after soft_offline_page to reduce the diff
  - rebased onto mmotm-2013-01-23-17-04
  - add comment on double checks of PageHWpoison

Signed-off-by: Naoya Horiguchi 
---
 mm/memory-failure.c | 154 
 1 file changed, 83 insertions(+), 71 deletions(-)

diff --git mmotm-2013-01-23-17-04.orig/mm/memory-failure.c 
mmotm-2013-01-23-17-04/mm/memory-failure.c
index c95e19a..302625b 100644
--- mmotm-2013-01-23-17-04.orig/mm/memory-failure.c
+++ mmotm-2013-01-23-17-04/mm/memory-failure.c
@@ -1368,7 +1368,7 @@ static struct page *new_page(struct page *p, unsigned 
long private, int **x)
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int get_any_page(struct page *p, unsigned long pfn, int flags)
+static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
int ret;
 
@@ -1393,11 +1393,9 @@ static int get_any_page(struct page *p, unsigned long 
pfn, int flags)
if (!get_page_unless_zero(compound_head(p))) {
if (PageHuge(p)) {
pr_info("%s: %#lx free huge page\n", __func__, pfn);
-   ret = dequeue_hwpoisoned_huge_page(compound_head(p));
+   ret = 0;
} else if (is_free_buddy_page(p)) {
pr_info("%s: %#lx free buddy page\n", __func__, pfn);
-   /* Set hwpoison bit while page is still isolated */
-   SetPageHWPoison(p);
ret = 0;
} else {
pr_info("%s: %#lx: unknown zero refcount page type 
%lx\n",
@@ -1413,42 +1411,62 @@ static int get_any_page(struct page *p, unsigned long 
pfn, int flags)
return ret;
 }
 
+static int get_any_page(struct page *page, unsigned long pfn, int flags)
+{
+   int ret = __get_any_page(page, pfn, flags);
+
+   if (ret == 1 && !PageHuge(page) && !PageLRU(page)) {
+   /*
+* Try to free it.
+*/
+   put_page(page);
+   shake_page(page, 1);
+
+   /*
+* Did it turn free?
+*/
+   ret = __get_any_page(page, pfn, 0);
+   if (!PageLRU(page)) {
+   pr_info("soft_offline: %#lx: unknown non LRU page type 
%lx\n",
+   pfn, page->flags);
+   return -EIO;
+   }
+   }
+   return ret;
+}
+
 static int soft_offline_huge_page(struct page *page, int flags)
 {
int ret;
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);
 
+   /*
+* This double-check of PageHWPoison is to avoid the race with
+* memory_failure(). See also comment in __soft_offline_page().
+*/
+   lock_page(hpage);
if (PageHWPoison(hpage)) {
+   unlock_page(hpage);
+   put_page(hpage);
pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-   ret = -EBUSY;
-   goto out;
+   return -EBUSY;
}
-
-   ret = get_any_page(page, pfn, flags);
-   if (ret < 0)
-   goto out;
-   if (ret == 0)
-   goto done;
+   unlock_page(hpage);
 
/* Keep page count to indicate a given hugepage is isolated. */
ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false,
MIGRATE_SYNC);
put_page(hpage);
-   if (ret) {
+   if (ret)
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
-   goto out;
-   }
-done:
/* keep elevated page count for bad page */
-   atomic_long_add(1 << compound_trans_order(hpage), _poisoned_pages);
-   set_page_hwpoison_huge_page(hpage);
-   dequeue_hwpoisoned_huge_page(hpage);
-out:
return ret;
 }
 
+static int __soft_offline_page(struct page *page, int flags);
+
 /**
  * soft_offline_page - Soft offline a page.
  * @page: page to offline
@@ -1477,62 +1495,60 @@ int soft_offline_page(struct page *page, int flags)
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_trans_head(page);
 
-   if (PageHuge(page)) {
-   ret =

Re: [PATCH 1/2] i2c-core: Add gpio based bus arbitration implementation

2013-01-25 Thread Mark Brown

On Thu, Jan 24, 2013 at 12:39:48PM +0100, Wolfram Sang wrote:
> On Thu, Jan 24, 2013 at 07:18:47PM +0800, Mark Brown wrote:

> > A read is typically implemented as a write of the register address
> > followed by a read of the value, usually with the ability to free the
> > bus in between.  If two devices attempt to access the register map
> > simultaneously this results in the address going wrong.

> Could happen. But in what situations will one not use repeated start
> here? Especially when designing a multi-master bus?

Well, you're depending on the specific drivers doing things that way and
it's actually quite rare for the controller drivers in Linux to support
I2C_M_NOSTART which discourages this.


signature.asc
Description: Digital signature

Re: [PATCH 11/19] regmap: avoid undefined return from regmap_read_debugfs

2013-01-25 Thread Mark Brown

On Sat, Jan 26, 2013 at 12:42:26PM +0800, Mark Brown wrote:
> On Fri, Jan 25, 2013 at 02:14:28PM +, Arnd Bergmann wrote:
> > Gcc warns about the case where regmap_read_debugfs tries

> Are you sure about that function name?

> > to walk an empty map->debugfs_off_cache list, which results
> > in uninitialized variable getting returned.

> > Setting this variable to 0 first avoids the warning and
> > the potentially undefined value.

> This probably won't apply against current code as there's already a
> better fix there, in general just picking a value to initialise masks
> errors.

Resending with corrected list address; to be clear please don't send
this.


signature.asc
Description: Digital signature

Re: [PATCH 11/19] regmap: avoid undefined return from regmap_read_debugfs

2013-01-25 Thread Mark Brown

On Fri, Jan 25, 2013 at 02:14:28PM +, Arnd Bergmann wrote:
> Gcc warns about the case where regmap_read_debugfs tries

Are you sure about that function name?

> to walk an empty map->debugfs_off_cache list, which results
> in uninitialized variable getting returned.

> Setting this variable to 0 first avoids the warning and
> the potentially undefined value.

This probably won't apply against current code as there's already a
better fix there, in general just picking a value to initialise masks
errors.

signature.asc
Description: Digital signature

Re: [PATCH 1/2] media: add support for decoder subdevs along with sensor and others

2013-01-25 Thread Prabhakar Lad

Hi Sylwester,

On Sat, Jan 26, 2013 at 1:24 AM, Sylwester Nawrocki
 wrote:
> Hi Prahakar,
>
>
> On 01/25/2013 08:01 AM, Prabhakar Lad wrote:
>>
>> From: Manjunath Hadli
>>
>> A lot of SOCs including Texas Instruments Davinci family mainly use
>> video decoders as input devices. Here the initial subdevice node
>> from where the input really comes is this decoder, for which support
>> is needed as part of the Media Controller infrastructure. This patch
>> adds an additional flag to include the decoders along with others,
>> such as the sensor and lens.
>>
>> Signed-off-by: Manjunath Hadli
>> Signed-off-by: Lad, Prabhakar
>> ---
>>   include/uapi/linux/media.h |1 +
>>   1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
>> index 0ef8833..fa44ed9 100644
>> --- a/include/uapi/linux/media.h
>> +++ b/include/uapi/linux/media.h
>> @@ -56,6 +56,7 @@ struct media_device_info {
>>   #define MEDIA_ENT_T_V4L2_SUBDEV_SENSOR(MEDIA_ENT_T_V4L2_SUBDEV +
>> 1)
>>   #define MEDIA_ENT_T_V4L2_SUBDEV_FLASH (MEDIA_ENT_T_V4L2_SUBDEV + 2)
>>   #define MEDIA_ENT_T_V4L2_SUBDEV_LENS  (MEDIA_ENT_T_V4L2_SUBDEV + 3)
>> +#define MEDIA_ENT_T_V4L2_SUBDEV_DECODER(MEDIA_ENT_T_V4L2_SUBDEV +
>> 4)
>
>
> Such a new entity type needs to be documented in the media DocBook [1].
> It probably also deserves a comment here, as DECODER isn't that obvious
> like the other already existing entity types. I heard people referring
> to a device that encodes analog (composite) video signal into its digital
> representation as an ENCODER. :)
>
>
Thanks for pointing it :), I'll document it and post a v2.

Regards,
--Prabhakar Lad

> [1] http://hverkuil.home.xs4all.nl/spec/media.html#media-ioc-enum-entities
>
> --
>
> Regards,
> Sylwester
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 7/8] fat (exportfs): rebuild directory-inode if fat_dget() fails

2013-01-25 Thread OGAWA Hirofumi

Namjae Jeon  writes:

> 2013/1/20, OGAWA Hirofumi :
>> Namjae Jeon  writes:
>>
>>> We rewrite patch as your suggestion using dummy inode. Would please
>>> you review below patch code ?
>>
>> Looks like good as initial. Clean and shorter.
>>
>> Next is, we have to think about race. I.e. if real inode was made, what
>> happens? Is there no race?
> Hi OGAWA.
>
> Although checking several routines to check hang case you said, I
> didn't find anything.
> And There is no any race on test result also. Am I missing something ?
> Let me know your opinion.

Hm, it's read-only. So, there may not be race for now, I'm sure there is
race on write path though.

Thanks.
-- 
OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-25 Thread paul . szabo

Dear Fengguang (et al),

> There are 260MB reclaimable slab pages in the normal zone, however we
> somehow failed to reclaim them. ...

Could the problem be that without CONFIG_NUMA, zone_reclaim_mode stays
at zero and anyway zone_reclaim() does nothing in include/linux/swap.h ?

Though... there is no CONFIG_NUMA nor /proc/sys/vm/zone_reclaim_mode in
the Ubuntu non-PAE "plain" HIGHMEM4G kernel, and still it handles the
"sleep test" just fine.

Where does reclaiming happen (or meant to happen)?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match rules

2013-01-25 Thread Greg KH

On Fri, Jan 25, 2013 at 07:10:29PM -0800, Matthew Dharm wrote:
> I suggest one of two options:
> 
> 1) Setup an alternative mail client.  There are many to choose from
> which will not damage your patches.  I personally like 'mutt' (which
> you should be able to install on your linux machine).   Others may be
> able to recommend ones that work for them; in general, I think you
> will find that most e-mail clients that run on Linux will be suitable.

The file, Documentation/email_clients.txt will help out here.

> 2) If you plan on contributing to the linux kernel in the future, it
> may be worth your time to setup a repo on github that Greg can then
> directly pull from.  All you would need to do is send Greg a "pull
> request" indicating the URL of the branch in your repo that he should
> pull from.  Greg can then pull directly from your repo, bypassing this
> issue entirely.

No, sorry, I only pull trees from a _very_ few people, patches are what
I prefer for almost all stuff.  Only subsystem maintainers who I have
been working with for many years will I pull trees from.

sorry,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread Ben Hutchings

On Sat, 2013-01-26 at 14:07 +1100, paul.sz...@sydney.edu.au wrote:
> Dear Ben,
> 
> > ... the mm maintainers are probably much better placed ...
> 
> Exactly. Now I wonder: are you one of them?

Hah, no.

Ben.

-- 
Ben Hutchings
Any smoothly functioning technology is indistinguishable from a rigged demo.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match rules

2013-01-25 Thread Matthew Dharm

On Fri, Jan 25, 2013 at 6:05 PM, Greg KH  wrote:
> On Sat, Jan 26, 2013 at 01:39:50AM +, Fangxiaozhi (Franko) wrote:
>>
>>
>> > -Original Message-
>> > From: Greg KH [mailto:g...@kroah.com]
>> > Sent: Saturday, January 26, 2013 1:45 AM
>> > To: Fangxiaozhi (Franko)
>> > Cc: Sergei Shtylyov; linux-...@vger.kernel.org; 
>> > linux-kernel@vger.kernel.org;
>> > Xueguiying (Zihan); Linlei (Lei Lin); Yili (Neil); Wangyuhua (Roger, 
>> > Credit);
>> > Huqiao (C); ba...@ti.com; mdharm-...@one-eyed-alien.net;
>> > sebast...@breakpoint.cc
>> > Subject: Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match
>> > rules
>> >
>> > On Fri, Jan 25, 2013 at 04:18:34PM +0400, Sergei Shtylyov wrote:
>> > > Hello.
>> > >
>> > > On 25-01-2013 6:44, fangxiaozhi 00110321 wrote:
>> > >
>> > > >From: fangxiaozhi 
>> > >
>> > > >1. Define a new macro for USB storage match rules:
>> > > > matching with Vendor ID and interface descriptors.
>> > >
>> > > >Signed-off-by: fangxiaozhi 
>> > > >
>> > > >
>> > > >  diff -uprN linux-3.8-rc4_orig/drivers/usb/storage/usb.c
>> > > >linux-3.8-rc4/drivers/usb/storage/usb.c
>> > > >--- linux-3.8-rc4_orig/drivers/usb/storage/usb.c 2013-01-22
>> > > >14:12:42.595238727 +0800
>> > > >+++ linux-3.8-rc4/drivers/usb/storage/usb.c 2013-01-22
>> > > >+++ 14:16:01.398250305 +0800
>> > > >@@ -120,6 +120,17 @@ MODULE_PARM_DESC(quirks, "supplemental l
>> > > >   .useTransport = use_transport, \
>> > > >  }
>> > > >
>> > > >+#define UNUSUAL_VENDOR_INTF(idVendor, cl, sc, pr, \
>> > > >+ vendor_name, product_name, use_protocol, use_transport, \
>> > > >+ init_function, Flags) \
>> > > >+{ \
>> > > >+ .vendorName = vendor_name, \
>> > > >+ .productName = product_name, \
>> > > >+ .useProtocol = use_protocol, \
>> > > >+ .useTransport = use_transport, \
>> > > >+ .initFunction = init_function, \
>> > > >+}
>> > >
>> > >   Shouldn't the field initilaizers be indented with tab, not space?
>> >
>> > Yes it must.  fangxiaozhi, please always run your patches through the
>> > scripts/checkpatch.pl tool before sending them out (note, you will have to
>> > ignore the CamelCase warnings your patch produces, but not the other
>> > ones.)
>> >
>> -What's wrong with it?
>> -I have checked the patches with scripts/checkpatch.pl before sending.
>> -There is no other warning or error in my patches except CamelCase 
>> warnings.
>> -So what's wrong now?
>
> Then your email client messed up the patches and put spaces in the code
> instead of tabs.  Try looking at the message on the mailing list and run
> that through checkpatch, it will show you the problems.
>
> What I received isn't ok, sorry.

Fangxiaozhi --

According to the headers of your E-mail, you are using MS Outlook to
send your patches.  Outlook commonly mangles patches, unfortunately.
It is not a very good e-mail client.

I suggest one of two options:

1) Setup an alternative mail client.  There are many to choose from
which will not damage your patches.  I personally like 'mutt' (which
you should be able to install on your linux machine).   Others may be
able to recommend ones that work for them; in general, I think you
will find that most e-mail clients that run on Linux will be suitable.

2) If you plan on contributing to the linux kernel in the future, it
may be worth your time to setup a repo on github that Greg can then
directly pull from.  All you would need to do is send Greg a "pull
request" indicating the URL of the branch in your repo that he should
pull from.  Greg can then pull directly from your repo, bypassing this
issue entirely.

Matt


--
Matthew Dharm
Maintainer, USB Mass Storage driver for Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo

Dear Ben,

> ... the mm maintainers are probably much better placed ...

Exactly. Now I wonder: are you one of them?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] extcon: arizona: Use regulated mode for microphone supply when detecting

2013-01-25 Thread Chanwoo Choi


On 01/25/2013 06:16 PM, Mark Brown wrote:

When starting microphone detection some headsets should be exposed to
the fully regulated microphone bias in order to ensure that they behave
in an optimal fashion.

Signed-off-by: Mark Brown
---
  drivers/extcon/Kconfig  |2 +-
  drivers/extcon/extcon-arizona.c |   62 +++
  2 files changed, 63 insertions(+), 1 deletion(-)

Applied it.

Thanks,
Chanwoo Choi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] userns: improve uid/gid map collision detection

2013-01-25 Thread Eric W. Biederman

Aristeu Rozanski  writes:

> On Thu, Jan 24, 2013 at 04:46:12PM -0800, Andrew Morton wrote:
>> eek, a macro!  Macros are always bad.
>> 
>> This one is bad because
>> 
>> a) it's a macro
>> 
>> b) it evaluates its args multiple times and hence will cause nasty
>>bugs if called with expressions-with-side-effects.
>> 
>> c) it evaluates its args multiple times and if called with
>>non-trivial expressions the compiler might not be able to CSE those
>>expressions, leading to code bloat.
>> 
>> Add lo, this patch:
>> 
>> --- 
>> a/kernel/user_namespace.c~userns-improve-uid-gid-map-collision-detection-fix
>> +++ a/kernel/user_namespace.c
>> @@ -521,7 +521,11 @@ struct seq_operations proc_projid_seq_op
>>  
>>  static DEFINE_MUTEX(id_map_mutex);
>>  
>> -#define in_range(b,first,len) ((b)>=(first)&&(b)<(first)+(len))
>> +static bool in_range(u32 b, u32 first, u32 len)
>> +{
>> +return b >= first && b < first + len;
>> +}
>> +
>>  static inline int extent_collision(struct uid_gid_map *new_map,
>> struct uid_gid_extent *extent)
>>  {
>> 
>> reduces the user_namespace.o text from 4822 bytes to 4727 with
>> gcc-4.4.4.  This is a remarkably large difference.
>
> thanks Andrew
>
> (I see Eric already answered about the config option)

Aritsteu after looking at both my version and yours I am going with
mine.  While my code is a little wordier I have half the number of
comparisons your code does, and I took the time to kill the variable
introducing a function to test for range collisions makes unnecessary.
On Andrews size metric my version seems noticably smaller as well.

 size $PWD-build/kernel/user_namespace.o
   textdata bss dec hex filename
   4376 144   0452011a8 
/home/eric/projects/linux/linux-userns-devel-build/kernel/user_namespace.o


Short of something unexpected I plan to push all my code to linux-next
sometime tomorrow.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 6/6] userns: Allow the userns root to mount tmpfs.

2013-01-25 Thread Eric W. Biederman


There is no backing store to tmpfs and file creation rules are the
same as for any other filesystem so it is semantically safe to allow
unprivileged users to mount it.  ramfs is safe for the same reasons so
allow either flavor of tmpfs to be mounted by a user namespace root
user.

The memory control group successfully limits how much memory tmpfs can
consume on any system that cares about a user namespace root using
tmpfs to exhaust memory the memory control group can be deployed.

Signed-off-by: "Eric W. Biederman" 
---
 mm/shmem.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5c90d84..197ca5e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2766,6 +2766,7 @@ static struct file_system_type shmem_fs_type = {
.name   = "tmpfs",
.mount  = shmem_mount,
.kill_sb= kill_litter_super,
+   .fs_flags   = FS_USERNS_MOUNT,
 };
 
 int __init shmem_init(void)
@@ -2823,6 +2824,7 @@ static struct file_system_type shmem_fs_type = {
.name   = "tmpfs",
.mount  = ramfs_mount,
.kill_sb= kill_litter_super,
+   .fs_flags   = FS_USERNS_MOUNT,
 };
 
 int __init shmem_init(void)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 5/6] userns: Allow the userns root to mount ramfs.

2013-01-25 Thread Eric W. Biederman


There is no backing store to ramfs and file creation
rules are the same as for any other filesystem so
it is semantically safe to allow unprivileged users
to mount it.

The memory control group successfully limits how much
memory ramfs can consume on any system that cares about
a user namespace root using ramfs to exhaust memory
the memory control group can be deployed.

Signed-off-by: "Eric W. Biederman" 
---
 fs/ramfs/inode.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index eab8c09..c24f1e1 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -260,6 +260,7 @@ static struct file_system_type ramfs_fs_type = {
.name   = "ramfs",
.mount  = ramfs_mount,
.kill_sb= ramfs_kill_sb,
+   .fs_flags   = FS_USERNS_MOUNT,
 };
 static struct file_system_type rootfs_fs_type = {
.name   = "rootfs",
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 4/6] userns: Allow the userns root to mount of devpts

2013-01-25 Thread Eric W. Biederman


- The context in which devpts is mounted has no effect on the creation
  of ptys as the /dev/ptmx interface has been used by unprivileged
  users for many years.

- Only support unprivileged mounts in combination with the newinstance
  option to ensure that mounting of /dev/pts in a user namespace will
  not allow the options of an existing mount of devpts to be modified.

- Create /dev/pts/ptmx as the root user in the user namespace that
  mounts devpts so that it's permissions to be changed.

Signed-off-by: "Eric W. Biederman" 
---
 fs/devpts/inode.c |   18 ++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 472e6be..073d30b 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -243,6 +243,13 @@ static int mknod_ptmx(struct super_block *sb)
struct dentry *root = sb->s_root;
struct pts_fs_info *fsi = DEVPTS_SB(sb);
struct pts_mount_opts *opts = >mount_opts;
+   kuid_t root_uid;
+   kgid_t root_gid;
+
+   root_uid = make_kuid(current_user_ns(), 0);
+   root_gid = make_kgid(current_user_ns(), 0);
+   if (!uid_valid(root_uid) || !gid_valid(root_gid))
+   return -EINVAL;
 
mutex_lock(>d_inode->i_mutex);
 
@@ -273,6 +280,8 @@ static int mknod_ptmx(struct super_block *sb)
 
mode = S_IFCHR|opts->ptmxmode;
init_special_inode(inode, mode, MKDEV(TTYAUX_MAJOR, 2));
+   inode->i_uid = root_uid;
+   inode->i_gid = root_gid;
 
d_add(dentry, inode);
 
@@ -438,6 +447,12 @@ static struct dentry *devpts_mount(struct file_system_type 
*fs_type,
if (error)
return ERR_PTR(error);
 
+   /* Require newinstance for all user namespace mounts to ensure
+* the mount options are not changed.
+*/
+   if ((current_user_ns() != _user_ns) && !opts.newinstance)
+   return ERR_PTR(-EINVAL);
+
if (opts.newinstance)
s = sget(fs_type, NULL, set_anon_super, flags, NULL);
else
@@ -491,6 +506,9 @@ static struct file_system_type devpts_fs_type = {
.name   = "devpts",
.mount  = devpts_mount,
.kill_sb= devpts_kill_sb,
+#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+   .fs_flags   = FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT,
+#endif
 };
 
 /*
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-25 Thread Eric W. Biederman


In the help text describing user namespaces recommend use of memory
control groups.  In many cases memory control groups are the only
mechanism there is to limit how much memory a user who can create
user namespaces can use.

Signed-off-by: "Eric W. Biederman" 
---
 Documentation/namespaces/resource-control.txt |   10 ++
 init/Kconfig  |7 +++
 2 files changed, 17 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/resource-control.txt

diff --git a/Documentation/namespaces/resource-control.txt 
b/Documentation/namespaces/resource-control.txt
new file mode 100644
index 000..3d8178a
--- /dev/null
+++ b/Documentation/namespaces/resource-control.txt
@@ -0,0 +1,10 @@
+There are a lot of kinds of objects in the kernel that don't have
+individual limits or that have limits that are ineffective when a set
+of processes is allowed to switch user ids.  With user namespaces
+enabled in a kernel for people who don't trust their users or their
+users programs to play nice this problems becomes more acute.
+
+Therefore it is recommended that memory control groups be enabled in
+kernels that enable user namespaces, and it is further recommended
+that userspace configure memory control groups to limit how much
+memory users they don't trust to play nice can use.
diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..c8c58bd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1035,6 +1035,13 @@ config USER_NS
help
  This allows containers, i.e. vservers, to use user namespaces
  to provide different user info for different servers.
+
+ When user namespaces are enabled in the kernel it is
+ recommended that the MEMCG and MEMCG_KMEM options also be
+ enabled and that user-space use the memory control groups to
+ limit the amount of memory a memory unprivileged users can
+ use.
+
  If unsure, say N.
 
 config PID_NS
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 2/6] userns: Allow any uid or gid mappings that don't overlap.

2013-01-25 Thread Eric W. Biederman


When I initially wrote the code for /proc//uid_map.  I was lazy
and avoided duplicate mappings by the simple expedient of ensuring the
first number in a new extent was greater than any number in the
previous extent.

Unfortunately that precludes a number of valid mappings, and someone
noticed and complained.  So use a simple check to ensure that ranges
in the mapping extents don't overlap.

Signed-off-by: "Eric W. Biederman" 
---
 kernel/user_namespace.c |   45 +++--
 1 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 24f8ec3..8b65083 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -520,6 +520,42 @@ struct seq_operations proc_projid_seq_operations = {
.show = projid_m_show,
 };
 
+static bool mappings_overlap(struct uid_gid_map *new_map, struct 
uid_gid_extent *extent)
+{
+   u32 upper_first, lower_first, upper_last, lower_last;
+   unsigned idx;
+
+   upper_first = extent->first;
+   lower_first = extent->lower_first;
+   upper_last = upper_first + extent->count - 1;
+   lower_last = lower_first + extent->count - 1;
+
+   for (idx = 0; idx < new_map->nr_extents; idx++) {
+   u32 prev_upper_first, prev_lower_first;
+   u32 prev_upper_last, prev_lower_last;
+   struct uid_gid_extent *prev;
+
+   prev = _map->extent[idx];
+
+   prev_upper_first = prev->first;
+   prev_lower_first = prev->lower_first;
+   prev_upper_last = prev_upper_first + prev->count - 1;
+   prev_lower_last = prev_lower_first + prev->count - 1;
+
+   /* Does the upper range intersect a previous extent? */
+   if ((prev_upper_first <= upper_last) &&
+   (prev_upper_last >= upper_first))
+   return true;
+
+   /* Does the lower range intersect a previous extent? */
+   if ((prev_lower_first <= lower_last) &&
+   (prev_lower_last >= lower_first))
+   return true;
+   }
+   return false;
+}
+
+
 static DEFINE_MUTEX(id_map_mutex);
 
 static ssize_t map_write(struct file *file, const char __user *buf,
@@ -532,7 +568,7 @@ static ssize_t map_write(struct file *file, const char 
__user *buf,
struct user_namespace *ns = seq->private;
struct uid_gid_map new_map;
unsigned idx;
-   struct uid_gid_extent *extent, *last = NULL;
+   struct uid_gid_extent *extent = NULL;
unsigned long page = 0;
char *kbuf, *pos, *next_line;
ssize_t ret = -EINVAL;
@@ -635,14 +671,11 @@ static ssize_t map_write(struct file *file, const char 
__user *buf,
if ((extent->lower_first + extent->count) <= 
extent->lower_first)
goto out;
 
-   /* For now only accept extents that are strictly in order */
-   if (last &&
-   (((last->first + last->count) > extent->first) ||
-((last->lower_first + last->count) > extent->lower_first)))
+   /* Do the ranges in extent overlap any previous extents? */
+   if (mappings_overlap(_map, extent))
goto out;
 
new_map.nr_extents++;
-   last = extent;
 
/* Fail if the file contains too many extents */
if ((new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) &&
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 1/6] userns: Avoid recursion in put_user_ns

2013-01-25 Thread Eric W. Biederman


When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.

Remove struct kref and use a plain atomic_t.  Making the code more
flexible and easier to comprehend.  Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.

I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death.  With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.

Pointed-out-by: Vasily Kulikov 
Signed-off-by: "Eric W. Biederman" 
---
 include/linux/user_namespace.h |   10 +-
 kernel/user.c  |4 +---
 kernel/user_namespace.c|   17 +
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b9bd2e6..4ce0093 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -21,7 +21,7 @@ struct user_namespace {
struct uid_gid_map  uid_map;
struct uid_gid_map  gid_map;
struct uid_gid_map  projid_map;
-   struct kref kref;
+   atomic_tcount;
struct user_namespace   *parent;
kuid_t  owner;
kgid_t  group;
@@ -35,18 +35,18 @@ extern struct user_namespace init_user_ns;
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
if (ns)
-   kref_get(>kref);
+   atomic_inc(>count);
return ns;
 }
 
 extern int create_user_ns(struct cred *new);
 extern int unshare_userns(unsigned long unshare_flags, struct cred **new_cred);
-extern void free_user_ns(struct kref *kref);
+extern void free_user_ns(struct user_namespace *ns);
 
 static inline void put_user_ns(struct user_namespace *ns)
 {
-   if (ns)
-   kref_put(>kref, free_user_ns);
+   if (ns && atomic_dec_and_test(>count))
+   free_user_ns(ns);
 }
 
 struct seq_operations;
diff --git a/kernel/user.c b/kernel/user.c
index 33acb5e..57ebfd4 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -47,9 +47,7 @@ struct user_namespace init_user_ns = {
.count = 4294967295U,
},
},
-   .kref = {
-   .refcount   = ATOMIC_INIT(3),
-   },
+   .count = ATOMIC_INIT(3),
.owner = GLOBAL_ROOT_UID,
.group = GLOBAL_ROOT_GID,
.proc_inum = PROC_USER_INIT_INO,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2b042c4..24f8ec3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -78,7 +78,7 @@ int create_user_ns(struct cred *new)
return ret;
}
 
-   kref_init(>kref);
+   atomic_set(>count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
ns->owner = owner;
@@ -104,15 +104,16 @@ int unshare_userns(unsigned long unshare_flags, struct 
cred **new_cred)
return create_user_ns(cred);
 }
 
-void free_user_ns(struct kref *kref)
+void free_user_ns(struct user_namespace *ns)
 {
-   struct user_namespace *parent, *ns =
-   container_of(kref, struct user_namespace, kref);
+   struct user_namespace *parent;
 
-   parent = ns->parent;
-   proc_free_inum(ns->proc_inum);
-   kmem_cache_free(user_ns_cachep, ns);
-   put_user_ns(parent);
+   do {
+   parent = ns->parent;
+   proc_free_inum(ns->proc_inum);
+   kmem_cache_free(user_ns_cachep, ns);
+   ns = parent;
+   } while (atomic_dec_and_test(>count));
 }
 EXPORT_SYMBOL(free_user_ns);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH review 0/6] miscelaneous user namespace patches

2013-01-25 Thread Eric W. Biederman


Now that I have done my worst to infect user space with some
basic tools for using user namespaces, this is my first round of patches
aimed at the 3.9 merge window.

This documents that if you care about limit resources you want
to configure the memory control group when user namespaces are
enabled.

This enables the user namespace root to mount devpts, ramfs and tmpfs.
Functionality that is needed for practical uses of the user namespace.

This includes my patch to enable more flexibility into the input
allowed in uid_map and gid_map.

 Documentation/namespaces/resource-control.txt |   10 
 fs/devpts/inode.c |   18 +++
 fs/ramfs/inode.c  |1 +
 include/linux/user_namespace.h|   10 ++--
 init/Kconfig  |7 +++
 kernel/user.c |4 +-
 kernel/user_namespace.c   |   62 +++--
 mm/shmem.c|2 +
 8 files changed, 92 insertions(+), 22 deletions(-)

Eric W. Biederman (6):
  userns: Avoid recursion in put_user_ns
  userns: Allow any uid or gid mappings that don't overlap.
  userns: Recommend use of memory control groups.
  userns: Allow the userns root to mount of devpts
  userns: Allow the userns root to mount ramfs.
  userns: Allow the userns root to mount tmpfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/11] ksm: stop hotremove lockdep warning

2013-01-25 Thread Hugh Dickins

Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and
holds it until the ksm_memory_callback(MEM_OFFLINE): that appears
to be a problem because notifier callbacks are made under down_read
of blocking_notifier_head->rwsem (so first the mutex is taken while
holding the rwsem, then later the rwsem is taken while still holding
the mutex); but is not in fact a problem because mem_hotplug_mutex
is held throughout the dance.

There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.

I had hoped to eradicate this issue in extending KSM page migration not
to need the ksm_thread_mutex.  But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
vanish, and get_ksm_page()'s accesses to them become a violation.

So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep.
This is less elegant for KSM, but it's more important to keep lockdep
useful to other users - and I apologize for how long it took to fix.

Reported-by: Gerald Schaefer 
Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |   55 +++--
 1 file changed, 41 insertions(+), 14 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:37:06.880206290 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:38:53.984208836 -0800
@@ -226,7 +226,9 @@ static unsigned int ksm_merge_across_nod
 #define KSM_RUN_STOP   0
 #define KSM_RUN_MERGE  1
 #define KSM_RUN_UNMERGE2
-static unsigned int ksm_run = KSM_RUN_STOP;
+#define KSM_RUN_OFFLINE4
+static unsigned long ksm_run = KSM_RUN_STOP;
+static void wait_while_offlining(void);
 
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
@@ -1700,6 +1702,7 @@ static int ksm_scan_thread(void *nothing
 
while (!kthread_should_stop()) {
mutex_lock(_thread_mutex);
+   wait_while_offlining();
if (ksmd_should_run())
ksm_do_scan(ksm_thread_pages_to_scan);
mutex_unlock(_thread_mutex);
@@ -2056,6 +2059,22 @@ void ksm_migrate_page(struct page *newpa
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
+static int just_wait(void *word)
+{
+   schedule();
+   return 0;
+}
+
+static void wait_while_offlining(void)
+{
+   while (ksm_run & KSM_RUN_OFFLINE) {
+   mutex_unlock(_thread_mutex);
+   wait_on_bit(_run, ilog2(KSM_RUN_OFFLINE),
+   just_wait, TASK_UNINTERRUPTIBLE);
+   mutex_lock(_thread_mutex);
+   }
+}
+
 static void ksm_check_stable_tree(unsigned long start_pfn,
  unsigned long end_pfn)
 {
@@ -2098,15 +2117,15 @@ static int ksm_memory_callback(struct no
switch (action) {
case MEM_GOING_OFFLINE:
/*
-* Keep it very simple for now: just lock out ksmd and
-* MADV_UNMERGEABLE while any memory is going offline.
-* mutex_lock_nested() is necessary because lockdep was alarmed
-* that here we take ksm_thread_mutex inside notifier chain
-* mutex, and later take notifier chain mutex inside
-* ksm_thread_mutex to unlock it.   But that's safe because both
-* are inside mem_hotplug_mutex.
+* Prevent ksm_do_scan(), unmerge_and_remove_all_rmap_items()
+* and remove_all_stable_nodes() while memory is going offline:
+* it is unsafe for them to touch the stable tree at this time.
+* But unmerge_ksm_pages(), rmap lookups and other entry points
+* which do not need the ksm_thread_mutex are all safe.
 */
-   mutex_lock_nested(_thread_mutex, SINGLE_DEPTH_NESTING);
+   mutex_lock(_thread_mutex);
+   ksm_run |= KSM_RUN_OFFLINE;
+   mutex_unlock(_thread_mutex);
break;
 
case MEM_OFFLINE:
@@ -2122,11 +2141,20 @@ static int ksm_memory_callback(struct no
/* fallthrough */
 
case MEM_CANCEL_OFFLINE:
+   mutex_lock(_thread_mutex);
+   ksm_run &= ~KSM_RUN_OFFLINE;
mutex_unlock(_thread_mutex);
+
+   smp_mb();   /* wake_up_bit advises this */
+   wake_up_bit(_run, ilog2(KSM_RUN_OFFLINE));
break;
}
return NOTIFY_OK;
 }
+#else
+static void wait_while_offlining(void)
+{
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 #ifdef CONFIG_SYSFS
@@ -2189,7

[PATCH 10/11] mm: remove offlining arg to migrate_pages

2013-01-25 Thread Hugh Dickins

No functional change, but the only purpose of the offlining argument
to migrate_pages() etc, was to ensure that __unmap_and_move() could
migrate a KSM page for memory hotremove (which took ksm_thread_mutex)
but not for other callers.  Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins 
---
 include/linux/migrate.h |   14 ++
 mm/compaction.c |2 +-
 mm/memory-failure.c |7 +++
 mm/memory_hotplug.c |3 +--
 mm/mempolicy.c  |8 +++-
 mm/migrate.c|   35 +--
 mm/page_alloc.c |6 ++
 7 files changed, 29 insertions(+), 46 deletions(-)

--- mmotm.orig/include/linux/migrate.h  2013-01-24 12:28:38.740127550 -0800
+++ mmotm/include/linux/migrate.h   2013-01-25 14:38:51.468208776 -0800
@@ -40,11 +40,9 @@ extern void putback_movable_pages(struct
 extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
-   unsigned long private, bool offlining,
-   enum migrate_mode mode, int reason);
+   unsigned long private, enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
-   unsigned long private, bool offlining,
-   enum migrate_mode mode);
+   unsigned long private, enum migrate_mode mode);
 
 extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -62,11 +60,11 @@ extern int migrate_huge_page_move_mappin
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
-   unsigned long private, bool offlining,
-   enum migrate_mode mode, int reason) { return -ENOSYS; }
+   unsigned long private, enum migrate_mode mode, int reason)
+   { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
-   unsigned long private, bool offlining,
-   enum migrate_mode mode) { return -ENOSYS; }
+   unsigned long private, enum migrate_mode mode)
+   { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
--- mmotm.orig/mm/compaction.c  2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/compaction.c   2013-01-25 14:38:51.472208776 -0800
@@ -980,7 +980,7 @@ static int compact_zone(struct zone *zon
 
nr_migrate = cc->nr_migratepages;
err = migrate_pages(>migratepages, compaction_alloc,
-   (unsigned long)cc, false,
+   (unsigned long)cc,
cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
MR_COMPACTION);
update_nr_listpages(cc);
--- mmotm.orig/mm/memory-failure.c  2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/memory-failure.c   2013-01-25 14:38:51.472208776 -0800
@@ -1432,7 +1432,7 @@ static int soft_offline_huge_page(struct
goto done;
 
/* Keep page count to indicate a given hugepage is isolated. */
-   ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false,
+   ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC);
put_page(hpage);
if (ret) {
@@ -1564,11 +1564,10 @@ int soft_offline_page(struct page *page,
if (!ret) {
LIST_HEAD(pagelist);
inc_zone_page_state(page, NR_ISOLATED_ANON +
-   page_is_file_cache(page));
+   page_is_file_cache(page));
list_add(>lru, );
ret = migrate_pages(, new_page, MPOL_MF_MOVE_ALL,
-   false, MIGRATE_SYNC,
-   MR_MEMORY_FAILURE);
+   MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
putback_lru_pages();
pr_info("soft offline: %#lx: migration failed %d, type 
%lx\n",
--- mmotm.orig/mm/memory_hotplug.c  2013-01-24 12:28:38.740127550 -0800
+++ mmotm/mm/memory_hotplug.c   2013-01-25 14:38:51.472208776 -0800
@@ -1283,8 +1283,7 @@ do_migrate_range(unsigned long start_pfn
 * migrate_pages returns # of failed pages.
 */
ret = migrate_pages(, alloc_migrate_target, 0,
-   true, MIGRATE_SYNC,
-   MR_MEMORY_HOTPLUG);
+   MIGRATE_SYNC,

[PATCH 9/11] ksm: enable KSM page migration

2013-01-25 Thread Hugh Dickins

Migration of KSM pages is now safe: remove the PageKsm restrictions from
mempolicy.c and migrate.c.

But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
are irrelevant to KSM: it looks as if that code was preventing hotremove
migration of KSM pages, unless they happened to be in swapcache.

There is some question as to whether enforcing a NUMA mempolicy migration
ought to migrate KSM pages, mapped into entirely unrelated processes; but
moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
any area where this is a worry.

Signed-off-by: Hugh Dickins 
---
 mm/mempolicy.c |3 +--
 mm/migrate.c   |   21 +++--
 2 files changed, 4 insertions(+), 20 deletions(-)

--- mmotm.orig/mm/mempolicy.c   2013-01-24 12:28:38.848127553 -0800
+++ mmotm/mm/mempolicy.c2013-01-25 14:38:49.596208731 -0800
@@ -496,9 +496,8 @@ static int check_pte_range(struct vm_are
/*
 * vm_normal_page() filters out zero pages, but there might
 * still be PageReserved pages to skip, perhaps in a VDSO.
-* And we cannot move PageKsm pages sensibly or safely yet.
 */
-   if (PageReserved(page) || PageKsm(page))
+   if (PageReserved(page))
continue;
nid = page_to_nid(page);
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
--- mmotm.orig/mm/migrate.c 2013-01-25 14:37:03.832206218 -0800
+++ mmotm/mm/migrate.c  2013-01-25 14:38:49.596208731 -0800
@@ -731,20 +731,6 @@ static int __unmap_and_move(struct page
lock_page(page);
}
 
-   /*
-* Only memory hotplug's offline_pages() caller has locked out KSM,
-* and can safely migrate a KSM page.  The other cases have skipped
-* PageKsm along with PageReserved - but it is only now when we have
-* the page lock that we can be certain it will not go KSM beneath us
-* (KSM will not upgrade a page from PageAnon to PageKsm when it sees
-* its pagecount raised, but only here do we take the page lock which
-* serializes that).
-*/
-   if (PageKsm(page) && !offlining) {
-   rc = -EBUSY;
-   goto unlock;
-   }
-
/* charge against new page */
mem_cgroup_prepare_migration(page, newpage, );
 
@@ -771,7 +757,7 @@ static int __unmap_and_move(struct page
 * File Caches may use write_page() or lock_page() in migration, then,
 * just care Anon page here.
 */
-   if (PageAnon(page)) {
+   if (PageAnon(page) && !PageKsm(page)) {
/*
 * Only page_lock_anon_vma_read() understands the subtleties of
 * getting a hold on an anon_vma from outside one of its mms.
@@ -851,7 +837,6 @@ uncharge:
mem_cgroup_end_migration(mem, page, newpage,
 (rc == MIGRATEPAGE_SUCCESS ||
  rc == MIGRATEPAGE_BALLOON_SUCCESS));
-unlock:
unlock_page(page);
 out:
return rc;
@@ -1156,7 +1141,7 @@ static int do_move_page_to_node_array(st
goto set_status;
 
/* Use PageReserved to check for zero page */
-   if (PageReserved(page) || PageKsm(page))
+   if (PageReserved(page))
goto put_and_set;
 
pp->page = page;
@@ -1318,7 +1303,7 @@ static void do_pages_stat_array(struct m
 
err = -ENOENT;
/* Use PageReserved to check for zero page */
-   if (!page || PageReserved(page) || PageKsm(page))
+   if (!page || PageReserved(page))
goto set_status;
 
err = page_to_nid(page);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/11] ksm: make !merge_across_nodes migration safe

2013-01-25 Thread Hugh Dickins

The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree?  And what if that
collides with an existing stable node?

ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex.  So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until
the next pass of ksmd, just be careful not to merge other pages on to a
misplaced page.  Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.

A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at
a matching node for another page, and this node page is found misplaced.

In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree.  If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(),
in order to handle these insertions of migrated nodes.

remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself.  Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.

Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |  164 +++--
 1 file changed, 134 insertions(+), 30 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:37:03.832206218 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:37:06.880206290 -0800
@@ -122,13 +122,25 @@ struct ksm_scan {
 /**
  * struct stable_node - node of the stable rbtree
  * @node: rb node of this ksm page in the stable tree
+ * @head: (overlaying parent) _nodes indicates temporarily on that list
+ * @list: linked into migrate_nodes, pending placement in the proper node tree
  * @hlist: hlist head of rmap_items using this ksm page
- * @kpfn: page frame number of this ksm page
+ * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
+ * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
  */
 struct stable_node {
-   struct rb_node node;
+   union {
+   struct rb_node node;/* when node of stable tree */
+   struct {/* when listed for migration */
+   struct list_head *head;
+   struct list_head list;
+   };
+   };
struct hlist_head hlist;
unsigned long kpfn;
+#ifdef CONFIG_NUMA
+   int nid;
+#endif
 };
 
 /**
@@ -169,6 +181,9 @@ struct rmap_item {
 static struct rb_root root_unstable_tree[MAX_NUMNODES];
 static struct rb_root root_stable_tree[MAX_NUMNODES];
 
+/* Recently migrated nodes of stable tree, pending proper placement */
+static LIST_HEAD(migrate_nodes);
+
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -311,11 +326,6 @@ static void insert_to_mm_slots_hash(stru
hash_add(mm_slots_hash, _slot->link, (unsigned long)mm);
 }
 
-static inline int in_stable_tree(struct rmap_item *rmap_item)
-{
-   return rmap_item->address & STABLE_FLAG;
-}
-
 /*
  * ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's
  * page tables after it has passed through ksm_exit() - which, if necessary,
@@ -476,7 +486,6 @@ static void remove_node_from_stable_tree
 {
struct rmap_item *rmap_item;
struct hlist_node *hlist;
-   int nid;
 
hlist_for_each_entry(rmap_item, hlist, _node->hlist, hlist) {
if (rmap_item->hlist.next)
@@ -488,8 +497,11 @@ static void remove_node_from_stable_tree
cond_resched();
}
 
-   nid = get_kpfn_nid(stable_node->kpfn);
-   rb_erase(_node->node, _stable_tree[nid]);
+   if (stable_node->head == _nodes)
+   list_del(_node->list);
+   else
+   rb_erase(_node->node,
+_stable_tree[NUMA(stable_node->nid)]);
free_stable_node(stable_node);
 }
 
@@ -712,6 +724,7 @@ static int remove_stable_node(struct sta
 static int remove_all_stable_nodes(void)
 {
struct stable_node *stable_node;
+   struct list_head *this, *next;

Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match rules

2013-01-25 Thread Greg KH

On Sat, Jan 26, 2013 at 01:39:50AM +, Fangxiaozhi (Franko) wrote:
> 
> 
> > -Original Message-
> > From: Greg KH [mailto:g...@kroah.com]
> > Sent: Saturday, January 26, 2013 1:45 AM
> > To: Fangxiaozhi (Franko)
> > Cc: Sergei Shtylyov; linux-...@vger.kernel.org; 
> > linux-kernel@vger.kernel.org;
> > Xueguiying (Zihan); Linlei (Lei Lin); Yili (Neil); Wangyuhua (Roger, 
> > Credit);
> > Huqiao (C); ba...@ti.com; mdharm-...@one-eyed-alien.net;
> > sebast...@breakpoint.cc
> > Subject: Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match
> > rules
> > 
> > On Fri, Jan 25, 2013 at 04:18:34PM +0400, Sergei Shtylyov wrote:
> > > Hello.
> > >
> > > On 25-01-2013 6:44, fangxiaozhi 00110321 wrote:
> > >
> > > >From: fangxiaozhi 
> > >
> > > >1. Define a new macro for USB storage match rules:
> > > > matching with Vendor ID and interface descriptors.
> > >
> > > >Signed-off-by: fangxiaozhi 
> > > >
> > > >
> > > >  diff -uprN linux-3.8-rc4_orig/drivers/usb/storage/usb.c
> > > >linux-3.8-rc4/drivers/usb/storage/usb.c
> > > >--- linux-3.8-rc4_orig/drivers/usb/storage/usb.c 2013-01-22
> > > >14:12:42.595238727 +0800
> > > >+++ linux-3.8-rc4/drivers/usb/storage/usb.c 2013-01-22
> > > >+++ 14:16:01.398250305 +0800
> > > >@@ -120,6 +120,17 @@ MODULE_PARM_DESC(quirks, "supplemental l
> > > >   .useTransport = use_transport, \
> > > >  }
> > > >
> > > >+#define UNUSUAL_VENDOR_INTF(idVendor, cl, sc, pr, \
> > > >+ vendor_name, product_name, use_protocol, use_transport, \
> > > >+ init_function, Flags) \
> > > >+{ \
> > > >+ .vendorName = vendor_name, \
> > > >+ .productName = product_name, \
> > > >+ .useProtocol = use_protocol, \
> > > >+ .useTransport = use_transport, \
> > > >+ .initFunction = init_function, \
> > > >+}
> > >
> > >   Shouldn't the field initilaizers be indented with tab, not space?
> > 
> > Yes it must.  fangxiaozhi, please always run your patches through the
> > scripts/checkpatch.pl tool before sending them out (note, you will have to
> > ignore the CamelCase warnings your patch produces, but not the other
> > ones.)
> > 
> -What's wrong with it?
> -I have checked the patches with scripts/checkpatch.pl before sending.
> -There is no other warning or error in my patches except CamelCase 
> warnings.
> -So what's wrong now?

Then your email client messed up the patches and put spaces in the code
instead of tabs.  Try looking at the message on the mailing list and run
that through checkpatch, it will show you the problems.

What I received isn't ok, sorry.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/11] ksm: make KSM page migration possible

2013-01-25 Thread Hugh Dickins

KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.

But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree?  And what if
that collides with an existing stable node?

So far there's no provision for that, and this patch does not attempt
to deal with it either.  But how will I test a solution, when I don't
know how to hotremove memory?  The best answer is to enable KSM page
migration in all cases now, and test more common cases.  With THP and
compaction added since KSM came in, page migration is now mainstream,
and it's a shame that a KSM page can frustrate freeing a page block.

Without worrying about merge_across_nodes 0 for now, this patch gets
KSM page migration working reliably for default merge_across_nodes 1
(but leave the patch enabling it until near the end of the series).

It's much simpler than I'd originally imagined, and does not require
an additional tier of locking: page migration relies on the page lock,
KSM page reclaim relies on the page lock, the page lock is enough for
KSM page migration too.

Almost all the care has to be in get_ksm_page(): that's the function
which worries about when a stable node is stale and should be freed,
now it also has to worry about the KSM page being migrated.

The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here.  That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if
not swapcache); but this works well, I've no urge to pull it apart now.

(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)

You might worry about mremap, and whether page migration's rmap_walk
to remove migration entries will find all the KSM locations where it
inserted earlier: that should already be handled, by the satisfyingly
heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).

Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |   94 ++---
 mm/migrate.c |5 ++
 2 files changed, 77 insertions(+), 22 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:37:00.768206145 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:37:03.832206218 -0800
@@ -499,6 +499,7 @@ static void remove_node_from_stable_tree
  * In which case we can trust the content of the page, and it
  * returns the gotten page; but if the page has now been zapped,
  * remove the stale node from the stable tree and return NULL.
+ * But beware, the stable node's page might be being migrated.
  *
  * You would expect the stable_node to hold a reference to the ksm page.
  * But if it increments the page's count, swapping out has to wait for
@@ -509,44 +510,77 @@ static void remove_node_from_stable_tree
  * pointing back to this stable node.  This relies on freeing a PageAnon
  * page to reset its page->mapping to NULL, and relies on no other use of
  * a page to put something that might look like our key in page->mapping.
- *
- * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
- * but this is different - made simpler by ksm_thread_mutex being held, but
- * interesting for assuming that no other use of the struct page could ever
- * put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping).
- *
- * Note: it is possible that get_ksm_page() will return NULL one moment,
- * then page the next, if the page is in between page_freeze_refs() and
- * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
 static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
 {
struct page *page;
void *expected_mapping;
+   unsigned long kpfn;
 
-   page = pfn_to_page(stable_node->kpfn);
expected_mapping = (void *)stable_node +
(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
-   if (page->mapping != expected_mapping)
-   goto stale;
-   if (!get_page_unless_zero(page))
+again:
+   kpfn = ACCESS_ONCE(stable_node->kpfn);
+   page = pfn_to_page(kpfn);
+
+   /*
+* page is computed from kpfn, so on most architectures reading
+* page->mapping is naturally ordered after reading node->kpfn,
+* but on Alpha we need to be more careful.
+*/
+   smp_read_barrier_depends();
+   if

[PATCH 6/11] ksm: remove old stable nodes more thoroughly

2013-01-25 Thread Hugh Dickins

Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree.  It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.

How can this happen?  We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.

Three causes:

1. The old stable tree (built according to the inverse merge_across_nodes)
has not been fully torn down.  A stable node lingers until get_ksm_page()
notices that the page it references no longer references it: but the page
is not necessarily freed as soon as expected, particularly when swapcache.

Fix this with a pass through the old stable tree, applying get_ksm_page()
to each of the remaining nodes (most found stale and removed immediately),
with forced removal of any left over.  Unless the page is still mapped:
I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE
and EBUSY than BUG.

2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it.  Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor.  Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.

3. It is possible for a KSM page to be faulted back from swapcache into
an mm, just after unmerge_and_remove_all_rmap_items() scanned past it.
Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private
to ksm.c, so dissolve the distinction between ksm_might_need_to_copy()
and ksm_does_need_to_copy(), doing it all in the one call into ksm.c.

A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying
I/O error when read in from swap) to a page which it then marks Uptodate.
Fix this case by not copying, letting do_swap_page() discover the error.

Signed-off-by: Hugh Dickins 
---
 include/linux/ksm.h |   18 ++---
 mm/ksm.c|   83 +++---
 mm/memory.c |   19 -
 3 files changed, 92 insertions(+), 28 deletions(-)

--- mmotm.orig/include/linux/ksm.h  2013-01-25 14:27:58.220193250 -0800
+++ mmotm/include/linux/ksm.h   2013-01-25 14:37:00.764206145 -0800
@@ -16,9 +16,6 @@
 struct stable_node;
 struct mem_cgroup;
 
-struct page *ksm_does_need_to_copy(struct page *page,
-   struct vm_area_struct *vma, unsigned long address);
-
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
unsigned long end, int advice, unsigned long *vm_flags);
@@ -73,15 +70,8 @@ static inline void set_page_stable_node(
  * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
  * but what if the vma was unmerged while the page was swapped out?
  */
-static inline int ksm_might_need_to_copy(struct page *page,
-   struct vm_area_struct *vma, unsigned long address)
-{
-   struct anon_vma *anon_vma = page_anon_vma(page);
-
-   return anon_vma &&
-   (anon_vma->root != vma->anon_vma->root ||
-page->index != linear_page_index(vma, address));
-}
+struct page *ksm_might_need_to_copy(struct page *page,
+   struct vm_area_struct *vma, unsigned long address);
 
 int page_referenced_ksm(struct page *page,
struct mem_cgroup *memcg, unsigned long *vm_flags);
@@ -113,10 +103,10 @@ static inline int ksm_madvise(struct vm_
return 0;
 }
 
-static inline int ksm_might_need_to_copy(struct page *page,
+static inline struct page *ksm_might_need_to_copy(struct page *page,
struct vm_area_struct *vma, unsigned long address)
 {
-   return 0;
+   return page;
 }
 
 static inline int page_referenced_ksm(struct page *page,
--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:58.856206099 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:37:00.768206145 -0800
@@ -644,6 +644,57 @@ static int unmerge_ksm_pages(struct vm_a
 /*
  * Only called through the sysfs control interface:
  */
+static int remove_stable_node(struct stable_node *stable_node)
+{
+   struct page *page;
+   int err;
+
+   page = get_ksm_page(stable_node, true);
+   if (!page) {
+   /*
+* get_ksm_page did remove_node_from_stable_tree itself.
+*/
+   return 0;
+   }
+
+   if (WARN_ON_ONCE(page_mapped(page)))
+   err = -EBUSY;
+   else {
+   /*
+* This page might be in a pagevec waiting to be freed,
+* or it might be PageSwapCache (perhaps under writeback),
+* or it might have been

[PATCH 5/11] ksm: get_ksm_page locked

2013-01-25 Thread Hugh Dickins

In some places where get_ksm_page() is used, we need the page to be locked.

When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us (raised page count is
only effective when there is serialization to make sure migration notices).
Whereas when navigating through the stable tree, we certainly do not want
to lock each node (raised page count is enough to guarantee the memcmps,
even if page is migrated to another node).

Since we're about to add another use case, add the locked argument to
get_ksm_page() now.

Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
really got the wrong end of the stick on that!  There's a configuration
in which page_cache_get_speculative() can do something cheaper than
get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
disabled preemption for it.  There's no need for rcu_read_lock() around
get_page_unless_zero() (and mapping checks) here.  Cut out that
silliness before making this any harder to understand.

Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |   23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:53.244205966 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:58.856206099 -0800
@@ -514,15 +514,14 @@ static void remove_node_from_stable_tree
  * but this is different - made simpler by ksm_thread_mutex being held, but
  * interesting for assuming that no other use of the struct page could ever
  * put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping).  The RCU calls are not for KSM at all, but
- * to keep the page_count protocol described with page_cache_get_speculative.
+ * coincides with page->mapping).
  *
  * Note: it is possible that get_ksm_page() will return NULL one moment,
  * then page the next, if the page is in between page_freeze_refs() and
  * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
-static struct page *get_ksm_page(struct stable_node *stable_node)
+static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
 {
struct page *page;
void *expected_mapping;
@@ -530,7 +529,6 @@ static struct page *get_ksm_page(struct
page = pfn_to_page(stable_node->kpfn);
expected_mapping = (void *)stable_node +
(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
-   rcu_read_lock();
if (page->mapping != expected_mapping)
goto stale;
if (!get_page_unless_zero(page))
@@ -539,10 +537,16 @@ static struct page *get_ksm_page(struct
put_page(page);
goto stale;
}
-   rcu_read_unlock();
+   if (locked) {
+   lock_page(page);
+   if (page->mapping != expected_mapping) {
+   unlock_page(page);
+   put_page(page);
+   goto stale;
+   }
+   }
return page;
 stale:
-   rcu_read_unlock();
remove_node_from_stable_tree(stable_node);
return NULL;
 }
@@ -558,11 +562,10 @@ static void remove_rmap_item_from_tree(s
struct page *page;
 
stable_node = rmap_item->head;
-   page = get_ksm_page(stable_node);
+   page = get_ksm_page(stable_node, true);
if (!page)
goto out;
 
-   lock_page(page);
hlist_del(_item->hlist);
unlock_page(page);
put_page(page);
@@ -1042,7 +1045,7 @@ static struct page *stable_tree_search(s
 
cond_resched();
stable_node = rb_entry(node, struct stable_node, node);
-   tree_page = get_ksm_page(stable_node);
+   tree_page = get_ksm_page(stable_node, false);
if (!tree_page)
return NULL;
 
@@ -1086,7 +1089,7 @@ static struct stable_node *stable_tree_i
 
cond_resched();
stable_node = rb_entry(*new, struct stable_node, node);
-   tree_page = get_ksm_page(stable_node);
+   tree_page = get_ksm_page(stable_node, false);
if (!tree_page)
return NULL;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/11] ksm: reorganize ksm_check_stable_tree

2013-01-25 Thread Hugh Dickins

Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
(restarting whenever it finds a stale node to remove), but rearrange
so that at least it does not needlessly restart from nid 0 each time.
And add a couple of comments: here is why we keep pfn instead of page.

Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |   38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:52.152205940 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:53.244205966 -0800
@@ -1830,31 +1830,36 @@ void ksm_migrate_page(struct page *newpa
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
-unsigned long end_pfn)
+static void ksm_check_stable_tree(unsigned long start_pfn,
+ unsigned long end_pfn)
 {
+   struct stable_node *stable_node;
struct rb_node *node;
int nid;
 
-   for (nid = 0; nid < nr_node_ids; nid++)
-   for (node = rb_first(_stable_tree[nid]); node;
-   node = rb_next(node)) {
-   struct stable_node *stable_node;
-
+   for (nid = 0; nid < nr_node_ids; nid++) {
+   node = rb_first(_stable_tree[nid]);
+   while (node) {
stable_node = rb_entry(node, struct stable_node, node);
if (stable_node->kpfn >= start_pfn &&
-   stable_node->kpfn < end_pfn)
-   return stable_node;
+   stable_node->kpfn < end_pfn) {
+   /*
+* Don't get_ksm_page, page has already gone:
+* which is why we keep kpfn instead of page*
+*/
+   remove_node_from_stable_tree(stable_node);
+   node = rb_first(_stable_tree[nid]);
+   } else
+   node = rb_next(node);
+   cond_resched();
}
-
-   return NULL;
+   }
 }
 
 static int ksm_memory_callback(struct notifier_block *self,
   unsigned long action, void *arg)
 {
struct memory_notify *mn = arg;
-   struct stable_node *stable_node;
 
switch (action) {
case MEM_GOING_OFFLINE:
@@ -1874,11 +1879,12 @@ static int ksm_memory_callback(struct no
/*
 * Most of the work is done by page migration; but there might
 * be a few stable_nodes left over, still pointing to struct
-* pages which have been offlined: prune those from the tree.
+* pages which have been offlined: prune those from the tree,
+* otherwise get_ksm_page() might later try to access a
+* non-existent struct page.
 */
-   while ((stable_node = ksm_check_stable_tree(mn->start_pfn,
-   mn->start_pfn + mn->nr_pages)) != NULL)
-   remove_node_from_stable_tree(stable_node);
+   ksm_check_stable_tree(mn->start_pfn,
+ mn->start_pfn + mn->nr_pages);
/* fallthrough */
 
case MEM_CANCEL_OFFLINE:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/11] ksm: trivial tidyups

2013-01-25 Thread Hugh Dickins

Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs
(but indeed we don't want to expand struct rmap_item by nid when not NUMA).
Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere.
Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out.
Use ?: in get_kpfn_nid().  Adjust a few comments noticed in ongoing work.

Leave stable_tree_insert()'s rb_linkage until after the node has been set
up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock
make either way safe, but we're going to copy and I prefer this precedent.

Signed-off-by: Hugh Dickins 
---
 mm/ksm.c |   48 ++--
 1 file changed, 22 insertions(+), 26 deletions(-)

--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:38.608205618 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:52.152205940 -0800
@@ -41,6 +41,14 @@
 #include 
 #include "internal.h"
 
+#ifdef CONFIG_NUMA
+#define NUMA(x)(x)
+#define DO_NUMA(x) (x)
+#else
+#define NUMA(x)(0)
+#define DO_NUMA(x) do { } while (0)
+#endif
+
 /*
  * A few notes about the KSM scanning process,
  * to make it easier to understand the data structures below:
@@ -130,6 +138,7 @@ struct stable_node {
  * @mm: the memory structure this rmap_item is pointing into
  * @address: the virtual address this rmap_item tracks (+ flags in low bits)
  * @oldchecksum: previous checksum of the page at that virtual address
+ * @nid: NUMA node id of unstable tree in which linked (may not match page)
  * @node: rb node of this rmap_item in the unstable tree
  * @head: pointer to stable_node heading this list in the stable tree
  * @hlist: link into hlist of rmap_items hanging off that stable_node
@@ -141,7 +150,7 @@ struct rmap_item {
unsigned long address;  /* + low bits used for flags below */
unsigned int oldchecksum;   /* when unstable */
 #ifdef CONFIG_NUMA
-   unsigned int nid;
+   int nid;
 #endif
union {
struct rb_node node;/* when node of unstable tree */
@@ -192,8 +201,12 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+#ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
+#else
+#define ksm_merge_across_nodes 1U
+#endif
 
 #define KSM_RUN_STOP   0
 #define KSM_RUN_MERGE  1
@@ -456,10 +469,7 @@ out:   page = NULL;
  */
 static inline int get_kpfn_nid(unsigned long kpfn)
 {
-   if (ksm_merge_across_nodes)
-   return 0;
-   else
-   return pfn_to_nid(kpfn);
+   return ksm_merge_across_nodes ? 0 : pfn_to_nid(kpfn);
 }
 
 static void remove_node_from_stable_tree(struct stable_node *stable_node)
@@ -479,7 +489,6 @@ static void remove_node_from_stable_tree
}
 
nid = get_kpfn_nid(stable_node->kpfn);
-
rb_erase(_node->node, _stable_tree[nid]);
free_stable_node(stable_node);
 }
@@ -578,13 +587,8 @@ static void remove_rmap_item_from_tree(s
age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
BUG_ON(age > 1);
if (!age)
-#ifdef CONFIG_NUMA
rb_erase(_item->node,
-   _unstable_tree[rmap_item->nid]);
-#else
-   rb_erase(_item->node, _unstable_tree[0]);
-#endif
-
+_unstable_tree[NUMA(rmap_item->nid)]);
ksm_pages_unshared--;
rmap_item->address &= PAGE_MASK;
}
@@ -604,7 +608,7 @@ static void remove_trailing_rmap_items(s
 }
 
 /*
- * Though it's very tempting to unmerge in_stable_tree(rmap_item)s rather
+ * Though it's very tempting to unmerge rmap_items from stable tree rather
  * than check every pte of a given vma, the locking doesn't quite work for
  * that - an rmap_item is assigned to the stable tree after inserting ksm
  * page and upping mmap_sem.  Nor does it fit with the way we skip dup'ing
@@ -1058,7 +1062,7 @@ static struct page *stable_tree_search(s
 }
 
 /*
- * stable_tree_insert - insert rmap_item pointing to new ksm page
+ * stable_tree_insert - insert stable tree node pointing to new ksm page
  * into the stable tree.
  *
  * This function returns the stable tree node just allocated on success,
@@ -1108,13 +1112,11 @@ static struct stable_node *stable_tree_i
if (!stable_node)
return NULL;
 
-   rb_link_node(_node->node, parent, new);
-   rb_insert_color(_node->node, _stable_tree[nid]);
-
INIT_HLIST_HEAD(_node->hlist);
-
stable_node->kpfn = kpfn;
set_page_stable_node(kpage, stable_node);
+   rb_link_node(_node->node, parent, new);
+   rb_insert_color(_node->node, _stable_tree[nid]);
 
return stable_node;
 }
@@ -1170,8 +1172,6 @@ struct rmap_item *unstable_tree_search_i
 *

[PATCH 2/11] ksm: add sysfs ABI Documentation

2013-01-25 Thread Hugh Dickins

From: Petr Holasek 

This patch adds sysfs documentation for Kernel Samepage Merging (KSM)
including new merge_across_nodes knob.

Signed-off-by: Petr Holasek 
Signed-off-by: Hugh Dickins 
---
 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   52 
 1 file changed, 52 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-ksm

--- /dev/null   1970-01-01 00:00:00.0 +
+++ mmotm/Documentation/ABI/testing/sysfs-kernel-mm-ksm 2013-01-25 
14:36:50.660205905 -0800
@@ -0,0 +1,52 @@
+What:  /sys/kernel/mm/ksm
+Date:  September 2009
+KernelVersion: 2.6.32
+Contact:   Linux memory management mailing list 
+Description:   Interface for Kernel Samepage Merging (KSM)
+
+What:  /sys/kernel/mm/ksm/full_scans
+What:  /sys/kernel/mm/ksm/pages_shared
+What:  /sys/kernel/mm/ksm/pages_sharing
+What:  /sys/kernel/mm/ksm/pages_to_scan
+What:  /sys/kernel/mm/ksm/pages_unshared
+What:  /sys/kernel/mm/ksm/pages_volatile
+What:  /sys/kernel/mm/ksm/run
+What:  /sys/kernel/mm/ksm/sleep_millisecs
+Date:  September 2009
+Contact:   Linux memory management mailing list 
+Description:   Kernel Samepage Merging daemon sysfs interface
+
+   full_scans: how many times all mergeable areas have been
+   scanned.
+
+   pages_shared: how many shared pages are being used.
+
+   pages_sharing: how many more sites are sharing them i.e. how
+   much saved.
+
+   pages_to_scan: how many present pages to scan before ksmd goes
+   to sleep.
+
+   pages_unshared: how many pages unique but repeatedly checked
+   for merging.
+
+   pages_volatile: how many pages changing too fast to be placed
+   in a tree.
+
+   run: write 0 to disable ksm, read 0 while ksm is disabled.
+   write 1 to run ksm, read 1 while ksm is running.
+   write 2 to disable ksm and unmerge all its pages.
+
+   sleep_millisecs: how many milliseconds ksm should sleep between
+   scans.
+
+   See Documentation/vm/ksm.txt for more information.
+
+What:  /sys/kernel/mm/ksm/merge_across_nodes
+Date:  January 2013
+KernelVersion: 3.9
+Contact:   Linux memory management mailing list 
+Description:   Control merging pages across different NUMA nodes.
+
+   When it is set to 0 only pages from the same node are merged,
+   otherwise pages from all nodes can be merged together (default).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86, kvm: Fix kvm's use of __pa() on percpu areas

2013-01-25 Thread tip-bot for Dave Hansen

Commit-ID:  5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d
Gitweb: http://git.kernel.org/tip/5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d
Author: Dave Hansen 
AuthorDate: Tue, 22 Jan 2013 13:24:35 -0800
Committer:  H. Peter Anvin 
CommitDate: Fri, 25 Jan 2013 16:34:55 -0800

x86, kvm: Fix kvm's use of __pa() on percpu areas

In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[0.00] BUG: unable to handle kernel paging request at 

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen 
Link: http://lkml.kernel.org/r/20130122212435.49056...@kernel.stglabs.ibm.com
Acked-by: Rik van Riel 
Reviewed-by: Marcelo Tosatti 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/kvm.c  | 9 +
 arch/x86/kernel/kvmclock.c | 4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 9c2bd8b..aa7e58b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -297,9 +297,9 @@ static void kvm_register_steal_time(void)
 
memset(st, 0, sizeof(*st));
 
-   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+   wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
-   cpu, __pa(st));
+   cpu, slow_virt_to_phys(st));
 }
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -324,7 +324,7 @@ void __cpuinit kvm_guest_cpu_init(void)
return;
 
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-   u64 pa = __pa(&__get_cpu_var(apf_reason));
+   u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
 
 #ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -340,7 +340,8 @@ void __cpuinit kvm_guest_cpu_init(void)
/* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__get_cpu_var(kvm_apic_eoi) = 0;
-   pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+   pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+   | KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}
 
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 220a360..9f966dc 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
int low, high, ret;
struct pvclock_vcpu_time_info *src = _clock[cpu].pvti;
 
-   low = (int)__pa(src) | 1;
-   high = ((u64)__pa(src) >> 32);
+   low = (int)slow_virt_to_phys(src) | 1;
+   high = ((u64)slow_virt_to_phys(src) >> 32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
   cpu, high, low, txt);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86, mm: Create slow_virt_to_phys()

2013-01-25 Thread tip-bot for Dave Hansen

Commit-ID:  d765653445129b7c476758040e3079480775f80a
Gitweb: http://git.kernel.org/tip/d765653445129b7c476758040e3079480775f80a
Author: Dave Hansen 
AuthorDate: Tue, 22 Jan 2013 13:24:33 -0800
Committer:  H. Peter Anvin 
CommitDate: Fri, 25 Jan 2013 16:33:23 -0800

x86, mm: Create slow_virt_to_phys()

This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen 
Link: http://lkml.kernel.org/r/20130122212433.4d1fc...@kernel.stglabs.ibm.com
Acked-by: Rik van Riel 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/pgtable_types.h |  1 +
 arch/x86/mm/pageattr.c   | 31 +++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 6c297e7..9f82690 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned 
long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 2a5c9ab..6d13d2a 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long address, unsigned int 
*level)
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+   unsigned long virt_addr = (unsigned long)__virt_addr;
+   phys_addr_t phys_addr;
+   unsigned long offset;
+   enum pg_level level;
+   unsigned long psize;
+   unsigned long pmask;
+   pte_t *pte;
+
+   pte = lookup_address(virt_addr, );
+   BUG_ON(!pte);
+   psize = page_level_size(level);
+   pmask = page_level_mask(level);
+   offset = virt_addr & ~pmask;
+   phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+   return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/11] ksm: allow trees per NUMA node

2013-01-25 Thread Hugh Dickins

From: Petr Holasek 

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes.
When it is set to zero only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine
and cpus from more distant nodes would have significant increase
of access latency to the merged ksm page. Sysfs knob was choosen
for higher variability when some users still prefers higher amount
of saved physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%

merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
   on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
   here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek 
Signed-off-by: Hugh Dickins 
Acked-by: Rik van Riel 
---
 Documentation/vm/ksm.txt |7 +
 mm/ksm.c |  151 -
 2 files changed, 139 insertions(+), 19 deletions(-)

--- mmotm.orig/Documentation/vm/ksm.txt 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/Documentation/vm/ksm.txt  2013-01-25 14:36:38.608205618 -0800
@@ -58,6 +58,13 @@ sleep_millisecs  - how many milliseconds
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
 
+merge_across_nodes - specifies if pages from different numa nodes can be 
merged.
+   When set to 0, ksm merges only pages which physically
+   reside in the memory area of same NUMA node. It brings
+   lower latency to access to shared page. Value can be
+   changed only when there is no ksm shared pages in system.
+   Default: 1
+
 run  - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
--- mmotm.orig/mm/ksm.c 2013-01-25 14:36:31.724205455 -0800
+++ mmotm/mm/ksm.c  2013-01-25 14:36:38.608205618 -0800
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "internal.h"
@@ -139,6 +140,9 @@ struct rmap_item {
struct mm_struct *mm;
unsigned long address;  /* + low bits used for flags below */
unsigned int oldchecksum;   /* when unstable */
+#ifdef CONFIG_NUMA
+   unsigned int nid;
+#endif
union {
struct rb_node node;/* when node of unstable tree */
struct {/* when listed from stable tree */
@@ -153,8 +157,8 @@ struct rmap_item {
 #define STABLE_FLAG0x200   /* is listed from the stable tree */
 
 /* The stable and unstable tree heads */
-static struct rb_root root_stable_tree = RB_ROOT;
-static struct rb_root root_unstable_tree = RB_ROOT;
+static struct rb_root root_unstable_tree[MAX_NUMNODES];
+static struct rb_root root_stable_tree[MAX_NUMNODES];
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -188,6 +192,9 @@ static unsigned int ksm_thread_pages_to_
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Zeroed when merging across nodes is not allowed */
+static unsigned int ksm_merge_across_nodes = 1;
+
 #define KSM_RUN_STOP   0
 #define KSM_RUN_MERGE  1
 #define KSM_RUN_UNMERGE2
@@ -441,10 +448,25 @@ out:  page = NULL;
return page;
 }
 
+/*
+ * This helper is used for getting right index into array of tree roots.
+ * When merge_across_nodes knob is set to 1, there are only two rb-trees for
+ * stable and unstable pages from all nodes with roots in index 0.

[tip:x86/mm] x86, mm: Use new pagetable helpers in try_preserve_large_page()

2013-01-25 Thread tip-bot for Dave Hansen

Commit-ID:  f3c4fbb68e93b10c781c0cc462a9d80770244da6
Gitweb: http://git.kernel.org/tip/f3c4fbb68e93b10c781c0cc462a9d80770244da6
Author: Dave Hansen 
AuthorDate: Tue, 22 Jan 2013 13:24:32 -0800
Committer:  H. Peter Anvin 
CommitDate: Fri, 25 Jan 2013 16:33:23 -0800

x86, mm: Use new pagetable helpers in try_preserve_large_page()

try_preserve_large_page() can be slightly simplified by using
the new page_level_*() helpers.  This also moves the 'level'
over to the new pg_level enum type.

Signed-off-by: Dave Hansen 
Link: http://lkml.kernel.org/r/20130122212432.14f3d...@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/mm/pageattr.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 40f92f3..2a5c9ab 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -396,7 +396,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
pte_t new_pte, old_pte, *tmp;
pgprot_t old_prot, new_prot, req_prot;
int i, do_split = 1;
-   unsigned int level;
+   enum pg_level level;
 
if (cpa->force_split)
return 1;
@@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, unsigned long 
address,
 
switch (level) {
case PG_LEVEL_2M:
-   psize = PMD_PAGE_SIZE;
-   pmask = PMD_PAGE_MASK;
-   break;
 #ifdef CONFIG_X86_64
case PG_LEVEL_1G:
-   psize = PUD_PAGE_SIZE;
-   pmask = PUD_PAGE_MASK;
-   break;
 #endif
+   psize = page_level_size(level);
+   pmask = page_level_mask(level);
+   break;
default:
do_split = -EINVAL;
goto out_unlock;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/11] ksm: NUMA trees and page migration

2013-01-25 Thread Hugh Dickins

Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.

(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense
in preferring to allocate the new pages local to the caller's node.)

Petr, I have intentionally changed the titles of yours: partly because
your "sysfs knob" understated it, but mainly because I think gmail is
liable to assign 1/11 and 2/11 to your earlier December thread, making
them vanish from this series.  I hope a change of title prevents that.

 1 ksm: allow trees per NUMA node
 2 ksm: add sysfs ABI Documentation
 3 ksm: trivial tidyups
 4 ksm: reorganize ksm_check_stable_tree
 5 ksm: get_ksm_page locked
 6 ksm: remove old stable nodes more thoroughly
 7 ksm: make KSM page migration possible
 8 ksm: make !merge_across_nodes migration safe
 9 mm: enable KSM page migration
10 mm: remove offlining arg to migrate_pages
11 ksm: stop hotremove lockdep warning

 Documentation/ABI/testing/sysfs-kernel-mm-ksm |   52 +
 Documentation/vm/ksm.txt  |7 
 include/linux/ksm.h   |   18 
 include/linux/migrate.h   |   14 
 mm/compaction.c   |2 
 mm/ksm.c  |  566 +---
 mm/memory-failure.c   |7 
 mm/memory.c   |   19 
 mm/memory_hotplug.c   |3 
 mm/mempolicy.c|   11 
 mm/migrate.c  |   61 -
 mm/page_alloc.c   |6 
 12 files changed, 580 insertions(+), 186 deletions(-)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86, mm: Pagetable level size/shift/mask helpers

2013-01-25 Thread tip-bot for Dave Hansen

Commit-ID:  4cbeb51b860c57ba8b2ae50c4016ee7a41f5fbd5
Gitweb: http://git.kernel.org/tip/4cbeb51b860c57ba8b2ae50c4016ee7a41f5fbd5
Author: Dave Hansen 
AuthorDate: Tue, 22 Jan 2013 13:24:31 -0800
Committer:  H. Peter Anvin 
CommitDate: Fri, 25 Jan 2013 16:33:22 -0800

x86, mm: Pagetable level size/shift/mask helpers

I plan to use lookup_address() to walk the kernel pagetables
in a later patch.  It returns a "pte" and the level in the
pagetables where the "pte" was found.  The level is just an
enum and needs to be converted to a useful value in order to
do address calculations with it.  These helpers will be used
in at least two places.

This also gives the anonymous enum a real name so that no one
gets confused about what they should be passing in to these
helpers.

"PTE_SHIFT" was chosen for naming consistency with the other
pagetable levels (PGD/PUD/PMD_SHIFT).

Cc: H. Peter Anvin 
Signed-off-by: Dave Hansen 
Link: http://lkml.kernel.org/r/20130122212431.405d3...@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/include/asm/pgtable.h   | 14 ++
 arch/x86/include/asm/pgtable_types.h |  2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5199db2..bc28e6f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long vaddr);
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 
 static inline int pte_none(pte_t pte)
 {
@@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, 
int count)
memcpy(dst, src, count * sizeof(pgd_t));
 }
 
+#define PTE_SHIFT ilog2(PTRS_PER_PTE)
+static inline int page_level_shift(enum pg_level level)
+{
+   return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT;
+}
+static inline unsigned long page_level_size(enum pg_level level)
+{
+   return 1UL << page_level_shift(level);
+}
+static inline unsigned long page_level_mask(enum pg_level level)
+{
+   return ~(page_level_size(level) - 1);
+}
 
 #include 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 3c32db8..6c297e7 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -331,7 +331,7 @@ extern void native_pagetable_init(void);
 struct seq_file;
 extern void arch_report_meminfo(struct seq_file *m);
 
-enum {
+enum pg_level {
PG_LEVEL_NONE,
PG_LEVEL_4K,
PG_LEVEL_2M,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] Drivers: scsi: storvsc: Initialize the sglist

2013-01-25 Thread K. Y. Srinivasan

Properly initialize scatterlist before using it.

Signed-off-by: K. Y. Srinivasan 
Cc: sta...@vger.kernel.org
---
 drivers/scsi/storvsc_drv.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 270b3cf..5ada1d0 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -467,6 +467,7 @@ static struct scatterlist *create_bounce_buffer(struct 
scatterlist *sgl,
if (!bounce_sgl)
return NULL;
 
+   sg_init_table(bounce_sgl, num_pages);
for (i = 0; i < num_pages; i++) {
page_buf = alloc_page(GFP_ATOMIC);
if (!page_buf)
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/mm] x86, mm: Make DEBUG_VIRTUAL work earlier in boot

2013-01-25 Thread tip-bot for Dave Hansen

Commit-ID:  a25b9316841c5afa226f8f70a457861b35276a92
Gitweb: http://git.kernel.org/tip/a25b9316841c5afa226f8f70a457861b35276a92
Author: Dave Hansen 
AuthorDate: Tue, 22 Jan 2013 13:24:30 -0800
Committer:  H. Peter Anvin 
CommitDate: Fri, 25 Jan 2013 16:33:22 -0800

x86, mm: Make DEBUG_VIRTUAL work earlier in boot

The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data.  Those data are not in an area on which using
__pa() is valid.  However, they are also called early enough in
boot that __vmalloc_start_set is not set, and thus the
CONFIG_DEBUG_VIRTUAL debugging does not catch them.

This adds a check to also verify __pa() calls against max_low_pfn,
which we can use earler in boot than is_vmalloc_addr().  However,
if we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set before
we try to use it.

With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing (and fix later in this series).

I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one.  Anybody
have ideas on a way to do this?

Signed-off-by: Dave Hansen 
Link: http://lkml.kernel.org/r/20130122212430.f46f8...@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/mm/numa.c | 2 +-
 arch/x86/mm/pat.c  | 4 ++--
 arch/x86/mm/physaddr.c | 9 -
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..76604eb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -219,7 +219,7 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
 */
nd = alloc_remap(nid, nd_size);
if (nd) {
-   nd_pa = __pa(nd);
+   nd_pa = __phys_addr_nodebug(nd);
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 0eb572e..2610bd9 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, unsigned long size, 
unsigned long flags)
 {
unsigned long id_sz;
 
-   if (base >= __pa(high_memory))
+   if (base > __pa(high_memory-1))
return 0;
 
-   id_sz = (__pa(high_memory) < base + size) ?
+   id_sz = (__pa(high_memory-1) <= base + size) ?
__pa(high_memory) - base :
size;
 
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index c73fedd..e666cbb 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -68,10 +69,16 @@ EXPORT_SYMBOL(__virt_addr_valid);
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
+   unsigned long phys_addr = x - PAGE_OFFSET;
/* VMALLOC_* aren't constants  */
VIRTUAL_BUG_ON(x < PAGE_OFFSET);
VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
-   return x - PAGE_OFFSET;
+   /* max_low_pfn is set early, but not _that_ early */
+   if (max_low_pfn) {
+   VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+   BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+   }
+   return phys_addr;
 }
 EXPORT_SYMBOL(__phys_addr);
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 7/7] sched: consider runnable load average in effective_load

2013-01-25 Thread Alex Shi

effective_load calculates the load change as seen from the
root_task_group. It needs to engage the runnable average
of changed task.

Thanks for Morten Rasmussen's reminder of this.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84bb3f7..8066a61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2981,7 +2981,8 @@ static void task_waking_fair(struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates the runnable load average change as seen from
+ * the root_task_group
  *
  * Adding load to a group doesn't make a group heavier, but can cause movement
  * of group shares between cpus. Assuming the shares were perfectly aligned one
@@ -3029,6 +3030,9 @@ static void task_waking_fair(struct task_struct *p)
  * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
  * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
  * 4/7) times the weight of the group.
+ *
+ * After get effective_load of the load moving, will engaged the sched entity's
+ * runnable avg.
  */
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
@@ -3103,6 +3107,7 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
struct task_group *tg;
unsigned long weight;
int balanced;
+   int runnable_avg;
 
idx   = sd->wake_idx;
this_cpu  = smp_processor_id();
@@ -3118,13 +3123,19 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
if (sync) {
tg = task_group(current);
weight = current->se.load.weight;
+   runnable_avg = current->se.avg.runnable_avg_sum * NICE_0_LOAD
+   / (current->se.avg.runnable_avg_period + 1);
 
-   this_load += effective_load(tg, this_cpu, -weight, -weight);
-   load += effective_load(tg, prev_cpu, 0, -weight);
+   this_load += effective_load(tg, this_cpu, -weight, -weight)
+   * runnable_avg >> NICE_0_SHIFT;
+   load += effective_load(tg, prev_cpu, 0, -weight)
+   * runnable_avg >> NICE_0_SHIFT;
}
 
tg = task_group(p);
weight = p->se.load.weight;
+   runnable_avg = p->se.avg.runnable_avg_sum * NICE_0_LOAD
+   / (p->se.avg.runnable_avg_period + 1);
 
/*
 * In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -3136,16 +3147,18 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
 * task to be woken on this_cpu.
 */
if (this_load > 0) {
-   s64 this_eff_load, prev_eff_load;
+   s64 this_eff_load, prev_eff_load, tmp_eff_load;
 
this_eff_load = 100;
this_eff_load *= power_of(prev_cpu);
-   this_eff_load *= this_load +
-   effective_load(tg, this_cpu, weight, weight);
+   tmp_eff_load = effective_load(tg, this_cpu, weight, weight)
+   * runnable_avg >> NICE_0_SHIFT;
+   this_eff_load *= this_load + tmp_eff_load;
 
prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= power_of(this_cpu);
-   prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+   prev_eff_load *= load + (effective_load(tg, prev_cpu, 0, weight)
+   * runnable_avg >> NICE_0_SHIFT);
 
balanced = this_eff_load <= prev_eff_load;
} else
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 1/7] sched: give initial value for runnable avg of sched entities.

2013-01-25 Thread Alex Shi

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

Signed-off-by: Alex Shi 
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..66c1718 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+   p->se.avg.decay_count = 0;
+   p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
memset(>se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 5/7] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-01-25 Thread Alex Shi

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi 
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f4714e..5da13ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2539,7 +2539,7 @@ static void __update_cpu_load(struct rq *this_rq, 
unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-   unsigned long load = this_rq->load.weight;
+   unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
unsigned long pending_updates;
 
/*
@@ -2589,7 +2589,7 @@ static void update_cpu_load_active(struct rq *this_rq)
 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 */
this_rq->last_load_update_tick = jiffies;
-   __update_cpu_load(this_rq, this_rq->load.weight, 1);
+   __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
 
calc_load_account_active(this_rq);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 017e040..729221b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2905,7 +2905,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-   return cpu_rq(cpu)->load.weight;
+   return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
 /*
@@ -2952,7 +2952,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
if (nr_running)
-   return rq->load.weight / nr_running;
+   return (unsigned long)rq->cfs.runnable_load_avg / nr_running;
 
return 0;
 }
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 6/7] sched: consider runnable load average in move_tasks

2013-01-25 Thread Alex Shi

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 729221b..84bb3f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3978,6 +3978,15 @@ static unsigned long task_h_load(struct task_struct *p);
 
 static const unsigned int sched_nr_migrate_break = 32;
 
+static unsigned long task_h_load_avg(struct task_struct *p)
+{
+   u32 period = p->se.avg.runnable_avg_period;
+   if (!period)
+   return 0;
+
+   return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
+}
+
 /*
  * move_tasks tries to move up to imbalance weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -4013,7 +4022,7 @@ static int move_tasks(struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, 
env->dst_cpu))
goto next;
 
-   load = task_h_load(p);
+   load = task_h_load_avg(p);
 
if (sched_feat(LB_MIN) && load < 16 && 
!env->sd->nr_balance_failed)
goto next;
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 2/7] sched: set initial load avg of new forked task

2013-01-25 Thread Alex Shi

New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.

Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi 
Reviewed-by: Preeti U Murthy 
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  2 +-
 kernel/sched/fair.c   | 11 +--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fc8f45..b8738c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1069,6 +1069,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING 0
 #endif
+#define ENQUEUE_NEWTASK8
 
 #define DEQUEUE_SLEEP  1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c1718..66ce1f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1705,7 +1705,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
rq = __task_rq_lock(p);
-   activate_task(rq, p, 0);
+   activate_task(rq, p, ENQUEUE_NEWTASK);
p->on_rq = 1;
trace_sched_wakeup_new(p, true);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..1384297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, 
int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
  struct sched_entity *se,
- int wakeup)
+ int flags)
 {
+   int wakeup = flags & ENQUEUE_WAKEUP;
/*
 * We track migrations using entity decay_count <= 0, on a wake-up
 * migration we use a negative decay count to track the remote decays
@@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq 
*cfs_rq,
update_entity_load_avg(se, 0);
}
 
+   /*
+* set the initial load avg of new task same as its load
+* in order to avoid brust fork make few cpu too heavier
+*/
+   if (flags & ENQUEUE_NEWTASK)
+   se->avg.load_avg_contrib = se->load.weight;
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * Update run-time statistics of the 'current'.
 */
update_curr(cfs_rq);
-   enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+   enqueue_entity_load_avg(cfs_rq, se, flags);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 3/7] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

2013-01-25 Thread Alex Shi

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi 
---
 include/linux/sched.h |  8 +---
 kernel/sched/core.c   |  7 +--
 kernel/sched/fair.c   | 13 ++---
 kernel/sched/sched.h  |  9 +
 4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8738c0..e55fa95 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq   *my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-   /* Per-entity load-tracking */
+#ifdef CONFIG_SMP
struct sched_avgavg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66ce1f1..dbab4b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1550,12 +1550,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime  = 0;
INIT_LIST_HEAD(>se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
p->se.avg.decay_count = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1384297..017e040 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq 
*cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3410,12 +3409,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3438,7 +3431,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, _rq->removed_load);
}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6130,9 +6122,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq= migrate_task_rq_fair,
-#endif
+
.rq_online  = rq_online_fair,
.rq_offline = rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ae3511e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
 * CFS Load tracking
 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2 4/7] sched: update cpu load after task_tick.

2013-01-25 Thread Alex Shi

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbab4b3..4f4714e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2695,8 +2695,8 @@ void scheduler_tick(void)
 
raw_spin_lock(>lock);
update_rq_clock(rq);
-   update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+   update_cpu_load_active(rq);
raw_spin_unlock(>lock);
 
perf_event_task_tick();
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC patch v2] sched: use runnable load avg in cfs balance instead of instant load

2013-01-25 Thread Alex Shi

This patchset can be used, but causes burst waking benchmark aim9 drop 5~7%
on my 2 sockets machine. The reason is too light runnable load in early stage
of waked tasks causes imbalance in balancing.

So, it is immature and just a reference for guys who want to go gurther.

V2 change:
1, attached the 1~3 patches, which were sent in power awareness scheduling
2, remove CONFIG_FAIR_GROUP_SCHED mask in patch 5th.

Thanks Ingo's comments and testing provided by Fengguang's kbuild system.
Now it is indepent patchset bases on Linus' tree.

Thanks 
Alex

[RFC patch v2 1/7] sched: give initial value for runnable avg of
[RFC patch v2 2/7] sched: set initial load avg of new forked task
[RFC patch v2 3/7] Revert "sched: Introduce temporary
[RFC patch v2 4/7] sched: update cpu load after task_tick.
[RFC patch v2 5/7] sched: compute runnable load avg in cpu_load and
[RFC patch v2 6/7] sched: consider runnable load average in
[RFC patch v2 7/7] sched: consider runnable load average in
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ACPI scan handlers

2013-01-25 Thread Rafael J. Wysocki

On Friday, January 25, 2013 04:07:38 PM Toshi Kani wrote:
> On Fri, 2013-01-25 at 23:11 +0100, Rafael J. Wysocki wrote:
> > On Friday, January 25, 2013 09:52:21 AM Toshi Kani wrote:
> > > On Thu, 2013-01-24 at 01:26 +0100, Rafael J. Wysocki wrote:
>  :
> > > > 
> > > > I wonder if anyone is seeing any major problems with this at the high 
> > > > level.
> > 
> > First of all, thanks for the response. :-)
> > 
> > > I agree that the current model is mess.  As shown below, it requires
> > > that .add() at boot-time only performs acpi dev init, and .add() at
> > > hot-add needs both acpi dev init and device on-lining.
> > 
> > I'm not sure what you're talking about, though.
> > 
> > You seem to be confusing ACPI device nodes (i.e. things represented by 
> > struct
> > acpi_device objects) with devices, but they are different things.  They are
> > just used to store static information extracted from device objects in the
> > ACPI namespace and to expose those objects (and possibly some of their
> > properties) via sysfs.  Device objects in the ACPI namespace are not 
> > devices,
> > however, and they don't even need to represent devices (for example, the
> > _SB thing, which is represented by struct acpi_device, is hardly a device).
> > 
> > So the role of struct acpi_device things is analogous to the role of
> > struct device_node things in the Device Trees world.  In fact, no drivers
> > should ever bind to them and in my opinion it was a grievous mistake to
> > let them do that.  But I'm digressing.
> > 
> > So, when you're saying "acpi dev", I'm not sure if you think about a device 
> > node
> > or a device (possibly) represented by that node.  If you mean device node, 
> > then
> > I'm not sure what "acpi dev init" means, because device nodes by definition
> > don't require any initialization beyond what acpi_add_single_object() does
> > (and they don't require any off-lining beyod what acpi_device_unregister()
> > does, for that matter).  In turn, if you mean "device represented by the 
> > given
> > device node", then you can't even say "ACPI device" about it, because it 
> > very
> > well may be a PCI device, or a USB device, or a SATA device etc.
> 
> Let me clarify my point with the ACPI memory driver as an example since
> it is the one that has caused a problem in .remove().
> 
> acpi_memory_device_add() implements .add() and does two things below.
> 
>  1. Call _CRS and initialize a list of struct acpi_memory_info that is
> attached to acpi_device->driver_data.  This step is what I described as
> "acpi dev init".  ACPI drivers perform driver-specific initialization to
> ACPI device objects.
>
>  2. Call add_memory() to add a target memory range to the mm module.
> This step is what I described as "on-lining".  This step is not
> necessary at boot-time since the mm module has already on-lined the
> memory ranges at early boot-time.  At hot-add, however, it needs to call
> add_memory() with the current framework.

I see.

OK, so that does handle the "struct acpi_device has been registered" event,
both on boot and hot-add.  The interactions with mm are tricky, I agree, but
that's not what I want to address at this point.

> Similarly, acpi_memory_device_remove() implements .remove() and does two
> things below.
> 
>  1. Call remove_memory() to offline a target memory range.  This step,
> "off-lining", can fail since the mm module may or may not be able to
> delete non-movable ranges.  This failure cannot be handled properly and
> causes the system to crash at this point.

Well, if the system administrator wants to crash the system this way, it's
basically up to him.  So that should be done by .detach() anyway in that case.

>  2. Free up the list of struct acpi_memory_info.  This step deletes
> driver-specific data from an ACPI device object.

OK

> > That's part of the whole confusion, by the way.
> > 
> > If the device represented by an ACPI device node is on a natively enumerated
> > bus, like PCI, then its native bus' init code initializes the device and
> > creates a "physical" device object for it, like struct pci_dev, which is 
> > then
> > "glued" to the corresponding struct acpi_device by acpi_bind_one().  Then, 
> > it
> > is clear which is which and there's no confusion.  The confusion starts when
> > there's no native enumeration and we only have the struct acpi_device thing,
> > because then everybody seems to think "oh, there's no physical device object
> > now, so this must be something different", but the *only* difference is that
> > there is no native bus' init code now and we should still be creating a
> > "physical device" object for the device and we should "glue" it to the
> > existing struct acpi_device like in the natively enumerated case.
> > 
> > > It then requires .remove() to perform both off-lining and acpi dev
> > > delete.  .remove() must succeed, but off-lining can fail.  
> > >
> > >  acpi dev   online
> > > ||=|
> > > 
> > >add @ boot
> > >

RE: [PATCH 1/2]linux-usb:Define a new macro for USB storage match rules

2013-01-25 Thread Fangxiaozhi (Franko)



> -Original Message-
> From: Greg KH [mailto:g...@kroah.com]
> Sent: Saturday, January 26, 2013 1:45 AM
> To: Fangxiaozhi (Franko)
> Cc: Sergei Shtylyov; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org;
> Xueguiying (Zihan); Linlei (Lei Lin); Yili (Neil); Wangyuhua (Roger, Credit);
> Huqiao (C); ba...@ti.com; mdharm-...@one-eyed-alien.net;
> sebast...@breakpoint.cc
> Subject: Re: [PATCH 1/2]linux-usb:Define a new macro for USB storage match
> rules
> 
> On Fri, Jan 25, 2013 at 04:18:34PM +0400, Sergei Shtylyov wrote:
> > Hello.
> >
> > On 25-01-2013 6:44, fangxiaozhi 00110321 wrote:
> >
> > >From: fangxiaozhi 
> >
> > >1. Define a new macro for USB storage match rules:
> > > matching with Vendor ID and interface descriptors.
> >
> > >Signed-off-by: fangxiaozhi 
> > >
> > >
> > >  diff -uprN linux-3.8-rc4_orig/drivers/usb/storage/usb.c
> > >linux-3.8-rc4/drivers/usb/storage/usb.c
> > >--- linux-3.8-rc4_orig/drivers/usb/storage/usb.c 2013-01-22
> > >14:12:42.595238727 +0800
> > >+++ linux-3.8-rc4/drivers/usb/storage/usb.c 2013-01-22
> > >+++ 14:16:01.398250305 +0800
> > >@@ -120,6 +120,17 @@ MODULE_PARM_DESC(quirks, "supplemental l
> > >   .useTransport = use_transport, \
> > >  }
> > >
> > >+#define UNUSUAL_VENDOR_INTF(idVendor, cl, sc, pr, \
> > >+ vendor_name, product_name, use_protocol, use_transport, \
> > >+ init_function, Flags) \
> > >+{ \
> > >+ .vendorName = vendor_name, \
> > >+ .productName = product_name, \
> > >+ .useProtocol = use_protocol, \
> > >+ .useTransport = use_transport, \
> > >+ .initFunction = init_function, \
> > >+}
> >
> >   Shouldn't the field initilaizers be indented with tab, not space?
> 
> Yes it must.  fangxiaozhi, please always run your patches through the
> scripts/checkpatch.pl tool before sending them out (note, you will have to
> ignore the CamelCase warnings your patch produces, but not the other
> ones.)
> 
-What's wrong with it?
-I have checked the patches with scripts/checkpatch.pl before sending.
-There is no other warning or error in my patches except CamelCase warnings.
-So what's wrong now?

> Please do that on both of these patches and resend them.
> 
> thanks,
> 
> greg k-h

[PATCH 09/14] dlm: use idr_for_each_entry() in recover_idr_clear() error path

2013-01-25 Thread Tejun Heo

Convert recover_idr_clear() to use idr_for_each_entry() instead of
idr_for_each().  It's somewhat less efficient this way but it
shouldn't matter in an error path.  This is to help with deprecation
of idr_remove_all().

Only compile tested.

Signed-off-by: Tejun Heo 
Cc: Christine Caulfield 
Cc: David Teigland 
Cc: cluster-de...@redhat.com
---
This patch depends on an earlier idr patch and I think it would be
best to route these together through -mm.  Christine, David, can you
please ack this?

Thanks.

 fs/dlm/recover.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index aedea28..b2856e7 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -351,23 +351,20 @@ static struct dlm_rsb *recover_idr_find(struct dlm_ls 
*ls, uint64_t id)
return r;
 }
 
-static int recover_idr_clear_rsb(int id, void *p, void *data)
+static void recover_idr_clear(struct dlm_ls *ls)
 {
-   struct dlm_ls *ls = data;
-   struct dlm_rsb *r = p;
+   struct dlm_rsb *r;
+   int id;
 
-   r->res_id = 0;
-   r->res_recover_locks_count = 0;
-   ls->ls_recover_list_count--;
+   spin_lock(>ls_recover_idr_lock);
 
-   dlm_put_rsb(r);
-   return 0;
-}
+   idr_for_each_entry(>ls_recover_idr, r, id) {
+   r->res_id = 0;
+   r->res_recover_locks_count = 0;
+   ls->ls_recover_list_count--;
 
-static void recover_idr_clear(struct dlm_ls *ls)
-{
-   spin_lock(>ls_recover_idr_lock);
-   idr_for_each(>ls_recover_idr, recover_idr_clear_rsb, ls);
+   dlm_put_rsb(r);
+   }
idr_remove_all(>ls_recover_idr);
 
if (ls->ls_recover_list_count != 0) {
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/14] dlm: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.

The conversion isn't completely trivial for recover_idr_clear() as
it's the only place in kernel which makes legitimate use of
idr_remove_all() w/o idr_destroy().  Replace it with idr_remove() call
inside idr_for_each_entry() loop.  It goes on top so that it matches
the operation order in recover_idr_del().

Only compile tested.

Signed-off-by: Tejun Heo 
Cc: Christine Caulfield 
Cc: David Teigland 
Cc: cluster-de...@redhat.com
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 fs/dlm/lockspace.c | 1 -
 fs/dlm/recover.c   | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index 2e99fb0..3ca79d3 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -796,7 +796,6 @@ static int release_lockspace(struct dlm_ls *ls, int force)
 */
 
idr_for_each(>ls_lkbidr, lkb_idr_free, ls);
-   idr_remove_all(>ls_lkbidr);
idr_destroy(>ls_lkbidr);
 
/*
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index b2856e7..236d108 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -359,13 +359,13 @@ static void recover_idr_clear(struct dlm_ls *ls)
spin_lock(>ls_recover_idr_lock);
 
idr_for_each_entry(>ls_recover_idr, r, id) {
+   idr_remove(>ls_recover_idr, id);
r->res_id = 0;
r->res_recover_locks_count = 0;
ls->ls_recover_list_count--;
 
dlm_put_rsb(r);
}
-   idr_remove_all(>ls_recover_idr);
 
if (ls->ls_recover_list_count != 0) {
log_error(ls, "warning: recover_list_count %d",
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/14] atm/nicstar: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Chas Williams 
Cc: net...@vger.kernel.org
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/atm/nicstar.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/atm/nicstar.c b/drivers/atm/nicstar.c
index ed1d2b7..628787e 100644
--- a/drivers/atm/nicstar.c
+++ b/drivers/atm/nicstar.c
@@ -251,7 +251,6 @@ static void nicstar_remove_one(struct pci_dev *pcidev)
if (card->scd2vc[j] != NULL)
free_scq(card, card->scd2vc[j]->scq, 
card->scd2vc[j]->tx_vcc);
}
-   idr_remove_all(>idr);
idr_destroy(>idr);
pci_free_consistent(card->pcidev, NS_RSQSIZE + NS_RSQ_ALIGNMENT,
card->rsq.org, card->rsq.dma);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET] idr: deprecate idr_remove_all()

2013-01-25 Thread Tejun Heo

Hello,

(Andrew, I think this one is best routed through -mm.  Please read on)

idr is one of the areas with much higher concentration of bad
interface and implementation decisions.  This patchset removes one of
those oddities - idr_remove_all().  idr needs two steps for
destruction - idr_remove_all() followed by idr_destroy().
idr_remove_all() releases all IDs in use but doesn't release buffered
idr_layers.  idr_destroy() frees buffered idr_layers() but doesn't
bother with in-use idr_layers.

For added fun, calling idr_remove() on all allocated IDs doesn't
necessarily free all in-use idr_layers, so idr_for_each_entry()
idr_remove(); followed by idr_destroy() may still leak memory.

This confuses people.  Some correctly use both.  Many forget to call
idr_remove_all() and others forget idr_destroy() and they all leak
memory.  Even ida - something tightly coupled w/ idr - forgets to do
idr_remove_all() (although it's my fault).

This is just a bad interface.  While remove_all in itself might not be
that bad, there is only one legitimate user of idr_remove_all() which
can be converted to idr_remove() relatively easily, so I think it'd be
better to deprecate and later unexport it than keeping it around.

This patchset contains the following 14 patches.

 0001-idr-make-idr_destroy-imply-idr_remove_all.patch
 0002-atm-nicstar-don-t-use-idr_remove_all.patch
 0003-block-loop-don-t-use-idr_remove_all.patch
 0004-firewire-don-t-use-idr_remove_all.patch
 0005-drm-don-t-use-idr_remove_all.patch
 0006-dm-don-t-use-idr_remove_all.patch
 0007-remoteproc-don-t-use-idr_remove_all.patch
 0008-rpmsg-don-t-use-idr_remove_all.patch
 0009-dlm-use-idr_for_each_entry-in-recover_idr_clear-erro.patch
 0010-dlm-don-t-use-idr_remove_all.patch
 0011-nfs-idr_destroy-no-longer-needs-idr_remove_all.patch
 0012-inotify-don-t-use-idr_remove_all.patch
 0013-cgroup-don-t-use-idr_remove_all.patch
 0014-idr-deprecate-idr_remove_all.patch

0001 makes idr_destroy() imply idr_remove_all().  0002-0013 remove
uses of idr_remove_all().  0014 marks idr_remove_all() deprecated.
The patches are on top of the current linus#master 66e2d3e8c2 and also
applies on top of the current -mm.  It's available in the following
git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git 
deprecate-idr_remove_all

As changes to most are trivial and have dependency on the first patch,
I think it would be best to route these together.  The only
non-trivial change is 0009 and 0010 which converts idr_for_each() to
idr_for_each_entry() and then replaces idr_remove_all() with
idr_remove() inside the for_each_entry loop.  Defintely wanna get acks
from dlm people.

Andrew, once people agree with the series, can you please route these
through -mm?

diffstat follows.  Thanks.

 drivers/atm/nicstar.c   |1 -
 drivers/block/loop.c|1 -
 drivers/firewire/core-cdev.c|1 -
 drivers/gpu/drm/drm_context.c   |2 +-
 drivers/gpu/drm/drm_crtc.c  |1 -
 drivers/gpu/drm/drm_drv.c   |1 -
 drivers/gpu/drm/drm_gem.c   |2 --
 drivers/gpu/drm/exynos/exynos_drm_ipp.c |4 
 drivers/gpu/drm/sis/sis_drv.c   |1 -
 drivers/gpu/drm/via/via_map.c   |1 -
 drivers/md/dm.c |1 -
 drivers/remoteproc/remoteproc_core.c|1 -
 drivers/rpmsg/virtio_rpmsg_bus.c|1 -
 fs/dlm/lockspace.c  |1 -
 fs/dlm/recover.c|   25 +++--
 fs/nfs/client.c |1 -
 fs/notify/inotify/inotify_fsnotify.c|1 -
 include/linux/idr.h |   14 +-
 kernel/cgroup.c |4 +---
 lib/idr.c   |   28 +---
 20 files changed, 39 insertions(+), 53 deletions(-)

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/14] nfs: idr_destroy() no longer needs idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop reference to idr_remove_all().  Note that the code
wasn't completely correct before because idr_remove() on all entries
doesn't necessarily release all idr_layers which could lead to memory
leak.

Signed-off-by: Tejun Heo 
Cc: "J. Bruce Fields" 
Cc: linux-...@vger.kernel.org
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 fs/nfs/client.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 9f3c664..84d8eae 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -197,7 +197,6 @@ error_0:
 EXPORT_SYMBOL_GPL(nfs_alloc_client);
 
 #if IS_ENABLED(CONFIG_NFS_V4)
-/* idr_remove_all is not needed as all id's are removed by nfs_put_client */
 void nfs_cleanup_cb_ident_idr(struct net *net)
 {
struct nfs_net *nn = net_generic(net, nfs_net_id);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 13/14] cgroup: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: contain...@lists.linux-foundation.org
Cc: cgro...@vger.kernel.org
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 kernel/cgroup.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4855892..6b18c5c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4567,10 +4567,8 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
offline_css(ss, dummytop);
ss->active = 0;
 
-   if (ss->use_id) {
-   idr_remove_all(>idr);
+   if (ss->use_id)
idr_destroy(>idr);
-   }
 
/* deassign the subsys_id */
subsys[ss->subsys_id] = NULL;
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 14/14] idr: deprecate idr_remove_all()

2013-01-25 Thread Tejun Heo

There was only one legitimate use of idr_remove_all() and a lot more
of incorrect uses (or lack of it).  Now that idr_destroy() implies
idr_remove_all() and all the in-kernel users updated not to use it,
there's no reason to keep it around.  Mark it deprecated so that we
can later unexport it.

idr_remove_all() is made an inline function calling __idr_remove_all()
to avoid triggering deprecated warning on EXPORT_SYMBOL().

Signed-off-by: Tejun Heo 
---
 include/linux/idr.h | 14 +-
 lib/idr.c   | 10 +++---
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index de7e190..1b932e7 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -110,10 +110,22 @@ int idr_for_each(struct idr *idp,
 void *idr_get_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
-void idr_remove_all(struct idr *idp);
 void idr_destroy(struct idr *idp);
 void idr_init(struct idr *idp);
 
+void __idr_remove_all(struct idr *idp);/* don't use */
+
+/**
+ * idr_remove_all - remove all ids from the given idr tree
+ * @idp: idr handle
+ *
+ * If you're trying to destroy @idp, calling idr_destroy() is enough.
+ * This is going away.  Don't use.
+ */
+static inline void __deprecated idr_remove_all(struct idr *idp)
+{
+   __idr_remove_all(idp);
+}
 
 /*
  * IDA - IDR based id allocator, use when translation from id to
diff --git a/lib/idr.c b/lib/idr.c
index 1e47832..1408e93 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -433,11 +433,7 @@ void idr_remove(struct idr *idp, int id)
 }
 EXPORT_SYMBOL(idr_remove);
 
-/**
- * idr_remove_all - remove all ids from the given idr tree
- * @idp: idr handle
- */
-void idr_remove_all(struct idr *idp)
+void __idr_remove_all(struct idr *idp)
 {
int n, id, max;
int bt_mask;
@@ -470,7 +466,7 @@ void idr_remove_all(struct idr *idp)
}
idp->layers = 0;
 }
-EXPORT_SYMBOL(idr_remove_all);
+EXPORT_SYMBOL(__idr_remove_all);
 
 /**
  * idr_destroy - release all cached layers within an idr tree
@@ -487,7 +483,7 @@ EXPORT_SYMBOL(idr_remove_all);
  */
 void idr_destroy(struct idr *idp)
 {
-   idr_remove_all(idp);
+   __idr_remove_all(idp);
 
while (idp->id_free_cnt) {
struct idr_layer *p = get_from_free_list(idp);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/14] inotify: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: John McCutchan 
Cc: Robert Love 
Cc: Eric Paris 
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 fs/notify/inotify/inotify_fsnotify.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 871569c..4216308 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -197,7 +197,6 @@ static void inotify_free_group_priv(struct fsnotify_group 
*group)
 {
/* ideally the idr is empty and we won't hit the BUG in the callback */
idr_for_each(>inotify_data.idr, idr_callback, group);
-   idr_remove_all(>inotify_data.idr);
idr_destroy(>inotify_data.idr);
atomic_dec(>inotify_data.user->inotify_devs);
free_uid(group->inotify_data.user);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/14] dm: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Alasdair Kergon 
Cc: dm-de...@redhat.com
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/md/dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c72e4d5..ea1a6ca 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -318,7 +318,6 @@ static void __exit dm_exit(void)
/*
 * Should be empty by this point.
 */
-   idr_remove_all(&_minor_idr);
idr_destroy(&_minor_idr);
 }
 
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/14] rpmsg: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Ohad Ben-Cohen 
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/rpmsg/virtio_rpmsg_bus.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
index f1e3239..aa334b6 100644
--- a/drivers/rpmsg/virtio_rpmsg_bus.c
+++ b/drivers/rpmsg/virtio_rpmsg_bus.c
@@ -1036,7 +1036,6 @@ static void rpmsg_remove(struct virtio_device *vdev)
if (vrp->ns_ept)
__rpmsg_destroy_ept(vrp, vrp->ns_ept);
 
-   idr_remove_all(>endpoints);
idr_destroy(>endpoints);
 
vdev->config->del_vqs(vrp->vdev);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/14] drm: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

* drm_ctxbitmap_cleanup() was calling idr_remove_all() but forgetting
  idr_destroy() thus leaking all buffered free idr_layers.  Replace it
  with idr_destroy().

Signed-off-by: Tejun Heo 
Cc: David Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: Inki Dae 
Cc: Joonyoung Shim 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/gpu/drm/drm_context.c   | 2 +-
 drivers/gpu/drm/drm_crtc.c  | 1 -
 drivers/gpu/drm/drm_drv.c   | 1 -
 drivers/gpu/drm/drm_gem.c   | 2 --
 drivers/gpu/drm/exynos/exynos_drm_ipp.c | 4 
 drivers/gpu/drm/sis/sis_drv.c   | 1 -
 drivers/gpu/drm/via/via_map.c   | 1 -
 7 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_context.c b/drivers/gpu/drm/drm_context.c
index 45adf97..75f62c5 100644
--- a/drivers/gpu/drm/drm_context.c
+++ b/drivers/gpu/drm/drm_context.c
@@ -118,7 +118,7 @@ int drm_ctxbitmap_init(struct drm_device * dev)
 void drm_ctxbitmap_cleanup(struct drm_device * dev)
 {
mutex_lock(>struct_mutex);
-   idr_remove_all(>ctx_idr);
+   idr_destroy(>ctx_idr);
mutex_unlock(>struct_mutex);
 }
 
diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c
index f2d667b..9b39d1f 100644
--- a/drivers/gpu/drm/drm_crtc.c
+++ b/drivers/gpu/drm/drm_crtc.c
@@ -1102,7 +1102,6 @@ void drm_mode_config_cleanup(struct drm_device *dev)
crtc->funcs->destroy(crtc);
}
 
-   idr_remove_all(>mode_config.crtc_idr);
idr_destroy(>mode_config.crtc_idr);
 }
 EXPORT_SYMBOL(drm_mode_config_cleanup);
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index be174ca..25f91cd 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -297,7 +297,6 @@ static void __exit drm_core_exit(void)
 
unregister_chrdev(DRM_MAJOR, "drm");
 
-   idr_remove_all(_minors_idr);
idr_destroy(_minors_idr);
 }
 
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 24efae4..e775859 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -561,8 +561,6 @@ drm_gem_release(struct drm_device *dev, struct drm_file 
*file_private)
 {
idr_for_each(_private->object_idr,
 _gem_object_release_handle, file_private);
-
-   idr_remove_all(_private->object_idr);
idr_destroy(_private->object_idr);
 }
 
diff --git a/drivers/gpu/drm/exynos/exynos_drm_ipp.c 
b/drivers/gpu/drm/exynos/exynos_drm_ipp.c
index 0bda964..49278f0 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_ipp.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_ipp.c
@@ -1786,8 +1786,6 @@ err_iommu:
drm_iommu_detach_device(drm_dev, ippdrv->dev);
 
 err_idr:
-   idr_remove_all(>ipp_idr);
-   idr_remove_all(>prop_idr);
idr_destroy(>ipp_idr);
idr_destroy(>prop_idr);
return ret;
@@ -1965,8 +1963,6 @@ static int ipp_remove(struct platform_device *pdev)
exynos_drm_subdrv_unregister(>subdrv);
 
/* remove,destroy ipp idr */
-   idr_remove_all(>ipp_idr);
-   idr_remove_all(>prop_idr);
idr_destroy(>ipp_idr);
idr_destroy(>prop_idr);
 
diff --git a/drivers/gpu/drm/sis/sis_drv.c b/drivers/gpu/drm/sis/sis_drv.c
index 841065b..5a5325e 100644
--- a/drivers/gpu/drm/sis/sis_drv.c
+++ b/drivers/gpu/drm/sis/sis_drv.c
@@ -58,7 +58,6 @@ static int sis_driver_unload(struct drm_device *dev)
 {
drm_sis_private_t *dev_priv = dev->dev_private;
 
-   idr_remove_all(_priv->object_idr);
idr_destroy(_priv->object_idr);
 
kfree(dev_priv);
diff --git a/drivers/gpu/drm/via/via_map.c b/drivers/gpu/drm/via/via_map.c
index c0f1cc7..d0ab3fb 100644
--- a/drivers/gpu/drm/via/via_map.c
+++ b/drivers/gpu/drm/via/via_map.c
@@ -120,7 +120,6 @@ int via_driver_unload(struct drm_device *dev)
 {
drm_via_private_t *dev_priv = dev->dev_private;
 
-   idr_remove_all(_priv->object_idr);
idr_destroy(_priv->object_idr);
 
kfree(dev_priv);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/14] block/loop: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/block/loop.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ae12512..3b9c32b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1911,7 +1911,6 @@ static void __exit loop_exit(void)
range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
 
idr_for_each(_index_idr, _exit_cb, NULL);
-   idr_remove_all(_index_idr);
idr_destroy(_index_idr);
 
blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/14] firewire: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Stefan Richter 
Cc: linux1394-de...@lists.sourceforge.net
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/firewire/core-cdev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index f8d2287..68c3138 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -1779,7 +1779,6 @@ static int fw_device_op_release(struct inode *inode, 
struct file *file)
wait_event(client->tx_flush_wait, !has_outbound_transactions(client));
 
idr_for_each(>resource_idr, shutdown_resource, client);
-   idr_remove_all(>resource_idr);
idr_destroy(>resource_idr);
 
list_for_each_entry_safe(event, next_event, >event_list, link)
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/14] remoteproc: don't use idr_remove_all()

2013-01-25 Thread Tejun Heo

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo 
Cc: Ohad Ben-Cohen 
---
This patch depends on an earlier idr patch and given the trivial
nature of the patch, I think it would be best to route these together
through -mm.  Please holler if there's any objection.

Thanks.

 drivers/remoteproc/remoteproc_core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/remoteproc/remoteproc_core.c 
b/drivers/remoteproc/remoteproc_core.c
index dd3bfaf..634d367 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1180,7 +1180,6 @@ static void rproc_type_release(struct device *dev)
 
rproc_delete_debug_dir(rproc);
 
-   idr_remove_all(>notifyids);
idr_destroy(>notifyids);
 
if (rproc->index >= 0)
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/14] idr: make idr_destroy() imply idr_remove_all()

2013-01-25 Thread Tejun Heo

idr is silly in quite a few ways, one of which is how it's supposed to
be destroyed - idr_destroy() doesn't release IDs and doesn't even
whine if the idr isn't empty.  If the caller forgets idr_remove_all(),
it simply leaks memory.

Even ida gets this wrong and leaks memory on destruction.  There is
absoltely no reason not to call idr_remove_all() from idr_destroy().
Nobody is abusing idr_destroy() for shrinking free layer buffer and
continues to use idr after idr_destroy(), so it's safe to do
remove_all from destroy.

In the whole kernel, there is only one place where idr_remove_all() is
legitimiately used without following idr_destroy() while there are
quite a few places where the caller forgets either idr_remove_all() or
idr_destroy() leaking memory.

This patch makes idr_destroy() call idr_destroy_all() and updates the
function description accordingly.

Signed-off-by: Tejun Heo 
---
 lib/idr.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/lib/idr.c b/lib/idr.c
index 6482390..1e47832 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -436,15 +436,6 @@ EXPORT_SYMBOL(idr_remove);
 /**
  * idr_remove_all - remove all ids from the given idr tree
  * @idp: idr handle
- *
- * idr_destroy() only frees up unused, cached idp_layers, but this
- * function will remove all id mappings and leave all idp_layers
- * unused.
- *
- * A typical clean-up sequence for objects stored in an idr tree will
- * use idr_for_each() to free all objects, if necessay, then
- * idr_remove_all() to remove all ids, and idr_destroy() to free
- * up the cached idr_layers.
  */
 void idr_remove_all(struct idr *idp)
 {
@@ -484,9 +475,20 @@ EXPORT_SYMBOL(idr_remove_all);
 /**
  * idr_destroy - release all cached layers within an idr tree
  * @idp: idr handle
+ *
+ * Free all id mappings and all idp_layers.  After this function, @idp is
+ * completely unused and can be freed / recycled.  The caller is
+ * responsible for ensuring that no one else accesses @idp during or after
+ * idr_destroy().
+ *
+ * A typical clean-up sequence for objects stored in an idr tree will use
+ * idr_for_each() to free all objects, if necessay, then idr_destroy() to
+ * free up the id mappings and cached idr_layers.
  */
 void idr_destroy(struct idr *idp)
 {
+   idr_remove_all(idp);
+
while (idp->id_free_cnt) {
struct idr_layer *p = get_from_free_list(idp);
kmem_cache_free(idr_layer_cache, p);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] acpi, memory-hotplug: Support getting hotplug info from SRAT.

2013-01-25 Thread H. Peter Anvin

On 01/25/2013 05:12 PM, Andrew Morton wrote:
> On Fri, 25 Jan 2013 17:42:09 +0800
> Tang Chen  wrote:
> 
>> NOTE: Using this way will cause NUMA performance down because the whole node
>>   will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
>>   If users don't want to lose NUMA performance, just don't use it.
> 
> I agree with this, but it means that nobody will test any of your new code.
> 
> To get improved testing coverage, can you think of any temporary
> testing-only patch which will cause testers to exercise the
> memory-hotplug changes?
> 

There is another problem: if ALL the nodes in the system support
hotpluggable memory, what happens?

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/11] PCI, ACPI: pci root bus hotplug support / pci match_driver

2013-01-25 Thread Jiang Liu

On 2013-1-26 8:04, Bjorn Helgaas wrote:
> On Tue, Jan 22, 2013 at 3:19 PM, Yinghai Lu  wrote:
>> On Tue, Jan 22, 2013 at 2:09 PM, Rafael J. Wysocki  wrote:
>>> On Monday, January 21, 2013 01:20:41 PM Yinghai Lu wrote:
 It includes
 1. preparing patches for pci root bus hotadd/hotremove support
 2. move root bus hotadd from acpiphp to pci_root.c
 3. add hot-remove support
 4. add acpi_hp_work to be shared with acpiphp and root-bus hotplug
 5. add match_driver to add pci device to device tree early but
not attach driver for hotplug path.

 based on pci/next + pm/acpi-scan

 could get from
 
 git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git 
 for-pci-root-bus-hotplug

 -v9: merges several patches together for easy review, requested by Rafael.
 -v10: address comments from Rafael.

 Jiang Liu (2):
   PCI: Fix a device reference count leakage issue in pci_dev_present()
   PCI: make PCI device create/destroy logic symmetric

 Tang Chen (1):
   PCI, ACPI: debug print for installation of acpi root bridge's
 notifier

 Yinghai Lu (8):
   PCI, acpiphp: Add is_hotplug_bridge detection
   PCI: Add root bus children dev's res to fail list
   PCI: Set dev_node early for pci_dev
   PCI, ACPI, acpiphp: Rename alloc_acpiphp_hp_work() to alloc_acpi_hp_work
   PCI, acpiphp: Move and enhance hotplug support of pci host bridge
   PCI, acpiphp: Don't bailout even no slots found yet.
   PCI: Skip attaching driver in device_add()
   PCI: Put pci dev to device tree as early as possible
>>>
>>> OK
>>>
>>> Please feel free to add
>>>
>>> Acked-by: Rafael J. Wysocki 
>>>
>>> to all of the patches in this series I haven't acked already.
> 
> I first pulled in
> "git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git
> acpi-scan" again (to pci/acpi-scan2), added your acks, Rafael, and put
> this series on a pci/yinghai-root-bus branch based on pci/acpi-scan2.
> 
> I reworked some of the changelogs a bit, but I don't think I made any
> code changes except that in [10/11] I just inlined the
> pci_bus_attach_device() code rather than making a new function, since
> it's small, there's only one caller, and I didn't think we needed any
> more pci_* and pci_bus_* functions than we already have.
> 
> Let me know if I messed anything up.
Great, so I could rebase my PCI notification related work to this branch.
I'm trying to resolve conflicts between acpi-scan and pci-root-bus-hotplug
last night.

Thanks!

> 
> Bjorn
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]cputime: make bool type for steal ticks

2013-01-25 Thread Joe Perches

On Sat, 2013-01-26 at 01:45 +0100, Frederic Weisbecker wrote:
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
[]
> > @@ -282,7 +282,7 @@ static __always_inline bool 
> > steal_account_process_tick(void)
[]
> > -   return st;
> > +   return !!st;
> 
> I would expect gcc to perform the semantic "!!" cast implicitly. I
> just did some basic tests locally and it does.
> I prefer to be paranoid and not do any assumption though, unless I'm
> told gcc always guarantees this correct implicit cast. I'm queuing
> this patch and will send it to Ingo.

It's unnecessary.

6.3.1.2p1:

"When any scalar value is converted to _Bool, the result is 0
if the value compares equal to 0; otherwise, the result is 1."


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] checkpatch: Fix $Float creation of match variables

2013-01-25 Thread Joe Perches

commit 74349bccedb
("checkpatch: add support for floating point constants")
added an unnecessary match variable that caused
tests that used a $Constant or $LvalOrFunc to have
one too many matches.

This causes problems with usleep_range, min/max and
other extended tests.

Avoid using match variables in $Float.
Avoid using match variables in $Assignment too.

Signed-off-by: Joe Perches 
---
 scripts/checkpatch.pl | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 9de3a69..3d0f577 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -230,12 +230,12 @@ our $Inline   = qr{inline|__always_inline|noinline};
 our $Member= qr{->$Ident|\.$Ident|\[[^]]*\]};
 our $Lval  = qr{$Ident(?:$Member)*};
 
-our $Float_hex = qr{(?i:0x[0-9a-f]+p-?[0-9]+[fl]?)};
-our $Float_dec = 
qr{(?i:((?:[0-9]+\.[0-9]*|[0-9]*\.[0-9]+)(?:e-?[0-9]+)?[fl]?))};
-our $Float_int = qr{(?i:[0-9]+e-?[0-9]+[fl]?)};
+our $Float_hex = qr{(?i)0x[0-9a-f]+p-?[0-9]+[fl]?};
+our $Float_dec = qr{(?i)(?:[0-9]+\.[0-9]*|[0-9]*\.[0-9]+)(?:e-?[0-9]+)?[fl]?};
+our $Float_int = qr{(?i)[0-9]+e-?[0-9]+[fl]?};
 our $Float = qr{$Float_hex|$Float_dec|$Float_int};
-our $Constant  = qr{(?:$Float|(?i:(?:0x[0-9a-f]+|[0-9]+)[ul]*))};
-our $Assignment= qr{(?:\*\=|/=|%=|\+=|-=|<<=|>>=|&=|\^=|\|=|=)};
+our $Constant  = qr{$Float|(?i)(?:0x[0-9a-f]+|[0-9]+)[ul]*};
+our $Assignment= qr{\*\=|/=|%=|\+=|-=|<<=|>>=|&=|\^=|\|=|=};
 our $Compare= qr{<=|>=|==|!=|<|>};
 our $Operators = qr{
<=|>=|==|!=|


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] acpi, memory-hotplug: Support getting hotplug info from SRAT.

2013-01-25 Thread Andrew Morton

On Fri, 25 Jan 2013 17:42:09 +0800
Tang Chen  wrote:

> NOTE: Using this way will cause NUMA performance down because the whole node
>   will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
>   If users don't want to lose NUMA performance, just don't use it.

I agree with this, but it means that nobody will test any of your new code.

To get improved testing coverage, can you think of any temporary
testing-only patch which will cause testers to exercise the
memory-hotplug changes?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] timekeeping: Add CONFIG_HAS_PERSISTENT_CLOCK option

2013-01-25 Thread John Stultz


On 01/22/2013 11:49 AM, John Stultz wrote:

On 01/22/2013 11:44 AM, Jason Gunthorpe wrote:

On Tue, Jan 15, 2013 at 11:50:18AM -0800, John Stultz wrote:

On 01/15/2013 08:09 AM, Feng Tang wrote:

Make the persistent clock check a kernel config option, so that some
platform can explicitely select it, also make CONFIG_RTC_HCTOSYS 
depends

on its non-existence, which could prevent the persistent clock and RTC
code from doing similar thing twice during system's 
init/suspend/resume

phases.

If the CONFIG_HAS_PERSISTENT_CLOCK=n, then no change happens for 
kernel

which still does the persistent clock check in timekeeping_init().

Cc: Thomas Gleixner 
Suggested-by: John Stultz 
Signed-off-by: Feng Tang 

Applied. I also added a dependency for Jason's CONFIG_RTC_SYSTOHC.

Sort of an ugly config name, since I gather ARM should always set this
to 'n'...

CONFIG_USE_ONLY_PERSISTENT_CLOCK ?

(Sigh. I got this seemingly microseconds after I sent the pull request :)

So yea, fair point, there could be some confusion. But 
ONLY_PERSISTENT_CLOCK isn't quite right either,  more like 
CONFIG_HAS_PERSISTENT_CLOCK_ALWAYS or something.




Decided upon CONFIG_ALWAYS_USE_PERSISTENT_CLOCK which I think is clear 
enough.


Let me know if you object or have a better idea.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:x86/asm] x86/xor: Make virtualization friendly

2013-01-25 Thread H. Peter Anvin


On 01/25/2013 02:15 PM, H. Peter Anvin wrote:

On 01/25/2013 02:11 PM, H. Peter Anvin wrote:

On 01/25/2013 02:43 AM, tip-bot for Jan Beulich wrote:

Commit-ID:  05fbf4d6fc6a3c0c3e63b77979c9311596716d10
Gitweb: http://git.kernel.org/tip/05fbf4d6fc6a3c0c3e63b77979c9311596716d10
Author: Jan Beulich 
AuthorDate: Fri, 2 Nov 2012 14:21:23 +
Committer:  Ingo Molnar 
CommitDate: Fri, 25 Jan 2013 09:23:51 +0100

x86/xor: Make virtualization friendly

In virtualized environments, the CR0.TS management needed here
can be a lot slower than anticipated by the original authors of
this code, which particularly means that in such cases forcing
the use of SSE- (or MMX-) based implementations is not desirable
- actual measurements should always be done in that case.

For consistency, pull into the shared (32- and 64-bit) header
not only the inclusion of the generic code, but also that of the
AVX variants.



This patch is wrong and should be dropped.  I verified it with the KVM
people that they do NOT want this change.  It is a Xen-specific problem.



FWIW: I have dropped this patch from tip:x86/asm.



The bottom line, I guess, is that we need something like 
cpu_has_slow_kernel_fpu or something like that, and set it for 
specifically affected hypervisors?


Do we know if Hyper-V has performance issues with CR0.TS?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] PCI: introduce accessor to retrieve PCIe Capabilities Register

2013-01-25 Thread Myron Stowe

Provide an accessor to retrieve the PCI Express device's Capabilities
Register.

Signed-off-by: Myron Stowe 
---

 include/linux/pci.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 15472d6..78581e1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1693,6 +1693,15 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 /**
+ * pcie_caps_reg - get the PCIe Capabilities Register
+ * @dev: PCI device
+ */
+static inline u16 pcie_caps_reg(const struct pci_dev *dev)
+{
+   return dev->pcie_flags_reg;
+}
+
+/**
  * pci_pcie_type - get the PCIe device/port type
  * @dev: PCI device
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] PCI: Use PCI Express Capability accessors

2013-01-25 Thread Myron Stowe

Use PCI Express Capability access functions to simplify device
Capabilities Register usages.

Signed-off-by: Myron Stowe 
---

 drivers/pci/access.c|4 ++--
 drivers/pci/pcie/portdrv_core.c |2 +-
 include/linux/pci.h |2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 3af0478..5278ac6 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -472,7 +472,7 @@ EXPORT_SYMBOL_GPL(pci_cfg_access_unlock);
 
 static inline int pcie_cap_version(const struct pci_dev *dev)
 {
-   return dev->pcie_flags_reg & PCI_EXP_FLAGS_VERS;
+   return pcie_caps_reg(dev) & PCI_EXP_FLAGS_VERS;
 }
 
 static inline bool pcie_cap_has_devctl(const struct pci_dev *dev)
@@ -497,7 +497,7 @@ static inline bool pcie_cap_has_sltctl(const struct pci_dev 
*dev)
return pcie_cap_version(dev) > 1 ||
   type == PCI_EXP_TYPE_ROOT_PORT ||
   (type == PCI_EXP_TYPE_DOWNSTREAM &&
-   dev->pcie_flags_reg & PCI_EXP_FLAGS_SLOT);
+   pcie_caps_reg(dev) & PCI_EXP_FLAGS_SLOT);
 }
 
 static inline bool pcie_cap_has_rtctl(const struct pci_dev *dev)
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index b42133a..31063ac 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -272,7 +272,7 @@ static int get_port_device_capability(struct pci_dev *dev)
 
/* Hot-Plug Capable */
if ((cap_mask & PCIE_PORT_SERVICE_HP) &&
-   dev->pcie_flags_reg & PCI_EXP_FLAGS_SLOT) {
+   pcie_caps_reg(dev) & PCI_EXP_FLAGS_SLOT) {
pcie_capability_read_dword(dev, PCI_EXP_SLTCAP, );
if (reg32 & PCI_EXP_SLTCAP_HPC) {
services |= PCIE_PORT_SERVICE_HP;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 78581e1..63b3628 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1707,7 +1707,7 @@ static inline u16 pcie_caps_reg(const struct pci_dev *dev)
  */
 static inline int pci_pcie_type(const struct pci_dev *dev)
 {
-   return (dev->pcie_flags_reg & PCI_EXP_FLAGS_TYPE) >> 4;
+   return (pcie_caps_reg(dev) & PCI_EXP_FLAGS_TYPE) >> 4;
 }
 
 void pci_request_acs(void);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Extend interfaces to access PCIe capabilities registers

2013-01-25 Thread Myron Stowe

This series is a minor extension to Jiang Liu's recent efforts - [PATCH v3
00/32] provide interfaces to access PCIe capabilities registers - which
adds an additional PCI Express accessor for obtaining a device's
Capabilities Register.

Reference: https://lkml.org/lkml/2012/8/1/253
---

Myron Stowe (2):
  PCI: Use PCI Express Capability accessors
  PCI: introduce accessor to retrieve PCIe Capabilities Register


 drivers/pci/access.c|4 ++--
 drivers/pci/pcie/portdrv_core.c |2 +-
 include/linux/pci.h |   11 ++-
 3 files changed, 13 insertions(+), 4 deletions(-)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 19/19] [INCOMPLETE] ARM: make return_address available for ARM_UNWIND

2013-01-25 Thread Arnd Bergmann

On Friday 25 January 2013, Dave Martin wrote:
> On Fri, Jan 25, 2013 at 11:44:14AM -0500, Steven Rostedt wrote:
> > [ I got an error with linux-arm-ker...@list.infradead.org and had to
> > remove from CC ]
> 
> Blame Arnd :)
>

Sorry about that, I now posted the entire series again with the right
mailing list address.

ARnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]cputime: make bool type for steal ticks

2013-01-25 Thread Frederic Weisbecker

2012/11/16 liguang :
> Signed-off-by: liguang 
> ---
>  kernel/sched/cputime.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 81b763b..d2c24c1 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -282,7 +282,7 @@ static __always_inline bool 
> steal_account_process_tick(void)
> this_rq()->prev_steal_time += st * TICK_NSEC;
>
> account_steal_time(st);
> -   return st;
> +   return !!st;

I would expect gcc to perform the semantic "!!" cast implicitly. I
just did some basic tests locally and it does.
I prefer to be paranoid and not do any assumption though, unless I'm
told gcc always guarantees this correct implicit cast. I'm queuing
this patch and will send it to Ingo.

Thanks!

> }
>  #endif
> return false;
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] acpi, memory-hotplug: Support getting hotplug info from SRAT.

2013-01-25 Thread Andrew Morton

On Fri, 25 Jan 2013 17:42:09 +0800
Tang Chen  wrote:

> We now provide an option for users who don't want to specify physical
> memory address in kernel commandline.
> 
> /*
>  * For movablemem_map=acpi:
>  *
>  * SRAT:|_| |_| |_| |_| ..
>  * node id:0   1 1   2
>  * hotpluggable:   n   y y   n
>  * movablemem_map:  |_| |_|
>  *
>  * Using movablemem_map, we can prevent memblock from allocating 
> memory
>  * on ZONE_MOVABLE at boot time.
>  */
> 
> So user just specify movablemem_map=acpi, and the kernel will use hotpluggable
> info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE.

Well, as a reult of my previous hackery, arch/x86/mm/srat.c now looks
rather different.  Please check it carefully and runtime test this code
when it appears in linux-next?


/*
 * ACPI 3.0 based NUMA setup
 * Copyright 2004 Andi Kleen, SuSE Labs.
 *
 * Reads the ACPI SRAT table to figure out what memory belongs to which CPUs.
 *
 * Called from acpi_numa_init while reading the SRAT and SLIT tables.
 * Assumes all memory regions belonging to a single proximity domain
 * are in one chunk. Holes between them will be included in the node.
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int acpi_numa __initdata;

static __init int setup_node(int pxm)
{
return acpi_map_pxm_to_node(pxm);
}

static __init void bad_srat(void)
{
printk(KERN_ERR "SRAT: SRAT not used.\n");
acpi_numa = -1;
}

static __init inline int srat_disabled(void)
{
return acpi_numa < 0;
}

/* Callback for SLIT parsing */
void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
{
int i, j;

for (i = 0; i < slit->locality_count; i++)
for (j = 0; j < slit->locality_count; j++)
numa_set_distance(pxm_to_node(i), pxm_to_node(j),
slit->entry[slit->locality_count * i + j]);
}

/* Callback for Proximity Domain -> x2APIC mapping */
void __init
acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
{
int pxm, node;
int apic_id;

if (srat_disabled())
return;
if (pa->header.length < sizeof(struct acpi_srat_x2apic_cpu_affinity)) {
bad_srat();
return;
}
if ((pa->flags & ACPI_SRAT_CPU_ENABLED) == 0)
return;
pxm = pa->proximity_domain;
apic_id = pa->apic_id;
if (!apic->apic_id_valid(apic_id)) {
printk(KERN_INFO "SRAT: PXM %u -> X2APIC 0x%04x ignored\n",
 pxm, apic_id);
return;
}
node = setup_node(pxm);
if (node < 0) {
printk(KERN_ERR "SRAT: Too many proximity domains %x\n", pxm);
bad_srat();
return;
}

if (apic_id >= MAX_LOCAL_APIC) {
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u 
skipped apicid that is too big\n", pxm, apic_id, node);
return;
}
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
acpi_numa = 1;
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
   pxm, apic_id, node);
}

/* Callback for Proximity Domain -> LAPIC mapping */
void __init
acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
{
int pxm, node;
int apic_id;

if (srat_disabled())
return;
if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) {
bad_srat();
return;
}
if ((pa->flags & ACPI_SRAT_CPU_ENABLED) == 0)
return;
pxm = pa->proximity_domain_lo;
if (acpi_srat_revision >= 2)
pxm |= *((unsigned int*)pa->proximity_domain_hi) << 8;
node = setup_node(pxm);
if (node < 0) {
printk(KERN_ERR "SRAT: Too many proximity domains %x\n", pxm);
bad_srat();
return;
}

if (get_uv_system_type() >= UV_X2APIC)
apic_id = (pa->apic_id << 8) | pa->local_sapic_eid;
else
apic_id = pa->apic_id;

if (apic_id >= MAX_LOCAL_APIC) {
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u 
skipped apicid that is too big\n", pxm, apic_id, node);
return;
}

set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
acpi_numa = 1;
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",
   pxm, apic_id, node);
}

#ifdef CONFIG_MEMORY_HOTPLUG
static inline int save_add_info(void)

Re: [PATCH 2/3] acpi, memory-hotplug: Extend movablemem_map ranges to the end of node.

2013-01-25 Thread Andrew Morton

On Fri, 25 Jan 2013 17:42:08 +0800
Tang Chen  wrote:

> When implementing movablemem_map boot option, we introduced an array
> movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE.
> 
> Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
> the whole node memory range, we need to extend it to the node end so that
> we can use it to prevent memblock from allocating memory in the ranges
> user didn't specify.
> 
> We now implement movablemem_map boot option like this:
> /*
>  * For movablemem_map=nn[KMG]@ss[KMG]:
>  *
>  * SRAT:|_| |_| |_| |_| ..
>  * node id:0   1 1   2
>  * user specified:|__| |___|
>  * movablemem_map:|___| |_||__| ..
>  *
>  * Using movablemem_map, we can prevent memblock from allocating 
> memory
>  * on ZONE_MOVABLE at boot time.
>  *
>  * NOTE: In this case, SRAT info will be ingored.
>  */
> 

The patch generates a bunch of rejects, partly due to linux-next
changes but I think I fixed everything up OK.

> index 4ddf497..f841d0e 100644
> --- a/arch/x86/mm/srat.c
> +++ b/arch/x86/mm/srat.c
> @@ -141,11 +141,16 @@ static inline int save_add_info(void) {return 1;}
>  static inline int save_add_info(void) {return 0;}
>  #endif
>  
> +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +extern struct movablemem_map movablemem_map;
> +#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

Well.

a) we shouldn't put extern declarations in C files - put them in
   headers so we can be assured that all compilation units agree on the
   type.

b) the ifdefs are unneeded - a unused extern declaration is OK (as
   long as the type itself is always defined!)

c) movablemem_map is already declared in memblock.h.

So I zapped the above three lines.

> @@ -178,9 +185,57 @@ acpi_numa_memory_affinity_init(struct 
> acpi_srat_mem_affinity *ma)
>  
>   node_set(node, numa_nodes_parsed);
>  
> - printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
> + printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
>  node, pxm,
> -(unsigned long long) start, (unsigned long long) end - 1);
> +(unsigned long long) start, (unsigned long long) end - 1,
> +hotpluggable ? "Hot Pluggable": "");
> +
> +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> + int overlap;
> + unsigned long start_pfn, end_pfn;

no, we don't put declarations of locals in the middle of C statements
like this:

arch/x86/mm/srat.c: In function 'acpi_numa_memory_affinity_init':
arch/x86/mm/srat.c:185: warning: ISO C90 forbids mixed declarations and code

Did your compiler not emit this warning?

I fixed this by moving the code into a new function
"handle_movablemem".  Feel free to suggest a more appropriate name!


From: Andrew Morton 
Subject: acpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix

clean up code, fix build warning

Cc: "Brown, Len" 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jiang Liu 
Cc: Jianguo Wu 
Cc: KOSAKI Motohiro 
Cc: Kamezawa Hiroyuki 
Cc: Lai Jiangshan 
Cc: Len Brown 
Cc: Tang Chen 
Cc: Thomas Gleixner 
Cc: Wu Jianguo 
Cc: Yasuaki Ishimatsu 
Signed-off-by: Andrew Morton 
---

 arch/x86/mm/srat.c |   93 ++-
 1 file changed, 49 insertions(+), 44 deletions(-)

diff -puN 
arch/x86/mm/srat.c~acpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix
 arch/x86/mm/srat.c
--- 
a/arch/x86/mm/srat.c~acpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix
+++ a/arch/x86/mm/srat.c
@@ -142,50 +142,8 @@ static inline int save_add_info(void) {r
 #endif
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-extern struct movablemem_map movablemem_map;
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-
-/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
-int __init
-acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
+static void __init handle_movablemem(int node, u64 start, u64 end)
 {
-   u64 start, end;
-   u32 hotpluggable;
-   int node, pxm;
-
-   if (srat_disabled())
-   goto out_err;
-   if (ma->header.length != sizeof(struct acpi_srat_mem_affinity))
-   goto out_err_bad_srat;
-   if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
-   goto out_err;
-   hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
-   if (hotpluggable && !save_add_info())
-   goto out_err;
-
-   start = ma->base_address;
-   end = start + ma->length;
-   pxm = ma->proximity_domain;
-   if (acpi_srat_revision <= 1)
-   pxm &= 0xff;
-
-   node = setup_node(pxm);
-   if (node < 0) {
-   printk(KERN_ERR "SRAT: Too many proximity domains.\n");
-   goto out_err_bad_srat;
-   }
-

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1550 matches

Mail list logo