Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 8:47 PM, Aaron Lu  wrote:
> On Sun, Aug 4, 2013 at 6:20 AM, Felipe Contreras
>  wrote:
>> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki  wrote:

>>> Do we still need to revert commit efaa14c if this patch is applied?
>>
>> I guess not. At least in this machine changing the backlight works
>> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10
>> didn't work at all. I cannot see how it would affect negatively other
>> machines.
>
> That commit makes hotkey emit notifications, and it's not the
> problem of "booting into a black screen", that problem is due to
> broken _BQC.

The broken _BQC has been there for quite some time, hasn't it?

Either way, without efaa14c, changing the backlight doesn't work at
all either way, so there's no black screen, because there cannot be.

> BTW, the efaa14c will also make screen off at level 0 according
> to Felipe, who consider this is a bug. But since it is required to
> let firmware emit notifications on hotkey press, I think user will
> want it.

With or without efaa14c, level 0 makes the screen off, or at least it
would, if the control worked at all. So efaa14c is one step forward,
but two back, my patch removes the two steps back, but we are still
not at the level of what my blacklisting patch does, for that we would
need to fix two issues:

1. Fix the retrieval of the last level at boot
2. Fix level 0 (yes, I consider that a regression)

But we cannot achieve either of those for v3.11, the only
possibilities seem to be either a) revert efaa14c, or b) keep it and
apply my patch. Anything else doesn't seem to be a possible or
sensible option, and I vote for b).

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 8:18 PM, Aaron Lu  wrote:
> On 08/03/2013 07:34 PM, Rafael J. Wysocki wrote:
>> On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
>>> On 08/03/2013 07:47 AM, Rafael J. Wysocki wrote:
 On Friday, August 02, 2013 02:37:09 PM Felipe Contreras wrote:
> If the _BCL package is descending, the first level (br->levels[2]) will
> be 0, and if the number of levels matches the number of steps, we might
> confuse a returned level to mean the index.
>
> For example:
>
>   current_level = max_level = 100
>   test_level = 0
>   returned level = 100
>
> In this case 100 means the level, not the index, and _BCM failed. But if
> the _BCL package is descending, the index of level 0 is also 100, so we
> assume _BQC is indexed, when it's not.
>
> This causes all _BQC calls to return bogus values causing weird behavior
> from the user's perspective. For example: xbacklight -set 10; xbacklight
> -set 20; would flash to 90% and then slowly down to the desired level
> (20).
>
> The solution is simple; test anything other than the first level (e.g.
> 1).
>
> Signed-off-by: Felipe Contreras 

 Looks reasonable.

 Aaron, what do you think?
>>>
>>> Yes, the patch is correct, but I still prefer my own version :-)
>>> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>>>
>>> In case you want to take mine and mine needs refresh, please let me know
>>> and I can do the re-base, thanks.
>>
>> Well, I prefer simpler, unless there's a good reason to use more complicated.
>>
>> Why exactly do you think your version is better?
>
> As explained here:
> https://lkml.org/lkml/2013/8/2/81
> https://lkml.org/lkml/2013/8/2/112
>
> And for the demo broken _BQC, mine patch will disable _BQC while still
> make the backlight work, and this patch here is testing the max
> brightness level and may fail.

Yes, but that problem can *only* happen in such a simplified _BCL,
which is very very unlikely. Still, it would make sense to fix the
code for that case.

However, we can fix the problem first for the real known cases with my
simple one-liner patch that can even be merged for v3.11, and *later*
fix the issue for the synthetic unlikely case.

Personally I think there are better ways to fix the code for the
synthetic case than what you patch does, which will also make _BQC
work. That can be discussed later though, the one-liner is simple, and
it works.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


aviso

2013-08-03 Thread Western Union

--
setecientos cincuenta mil dolares han sido depositados a usted de Western
Union. Envie su nombre, numero de telefono, direccion, ocupacion



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

2013-08-03 Thread Hillf Danton
On Fri, Aug 2, 2013 at 12:17 AM, Aneesh Kumar K.V
 wrote:
> Hillf Danton  writes:
>
>> On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim  wrote:
>>> On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
 On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim  
 wrote:
 > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
 >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim  
 >> wrote:
 >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
 >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim 
 >> >>  wrote:
 >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
 >> >> > any validation check, so it can steal reserved page 
 >> >> > unconditionally.
 >> >>
 >> >> Well, why is it illegal to use reserved page here?
 >> >
 >> > If we use reserved page here, other processes which are promised to 
 >> > use
 >> > enough hugepages cannot get enough hugepages and can die. This is
 >> > unexpected result to them.
 >> >
 >> But, how do you determine that a huge page is requested by a process
 >> that is not allowed to use reserved pages?
 >
 > Reserved page is just one for each address or file offset. If we need to
 > move this page, this means that it already use it's own reserved page, 
 > this
 > page is it. So we should not use other reserved page for moving this 
 > page.
 >
 Hm, how do you determine "this page" is not buddy?
>>>
>>> If this page comes from the buddy, it doesn't matter. It imply that
>>> this mapping cannot use reserved page pool, because we always allocate
>>> a page from reserved page pool first.
>>>
>> A buddy page also implies, if the mapping can use reserved pages, that no
>> reserved page was available when requested. Now we can try reserved
>> page again.
>
> I didn't quiet get that. My understanding is, the new page we are

Neither did I ;)

> allocating here for soft offline should not be allocated from the
> reserve pool. If we do that we may possibly have allocation failure
> later for a request that had done page reservation. Now to
> avoid that we make sure we have enough free pages outside reserve pool
> so that we can dequeue the huge page. If not we use buddy to allocate
> the hugepage.
>
What if it is a mapping with HPAGE_RESV_OWNER set?
Or can we block owner from using any page available here?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-08-03 Thread Jörn Engel
On Sat, 3 August 2013 20:33:16 -0400, Theodore Ts'o wrote:
> 
> P.P.S.  At least in theory, nothing of what I've described here has to
> be ext4 specific.  We could implement this in the VFS layer, at which
> point not only ext4 would benefit, but also btrfs, xfs, f2fs, etc.

Except for an inode bit that needs to be stored in the filesystem,
agreed.  The ugliness I see is in detecting how to treat the
filesystem at hand.

Filesystems with mandatory compression (jffs2, ubifs,...):
- Just write the file, nothing to do.
Filesystems with optional compression (logfs, ext2compr,...):
- You may or may not want to chattr between file creation and writing
  the payload.
Filesystems without compression (ext[234], xfs,...):
- Just write the file, nothing can be done.
- Alternatively fall back to a userspace version.
Filesystems with optional uncompression (what is being proposed):
- Write the file in compressed form, close, chattr.

I would like to see the compression side done in the kernel as well.
Then we can chattr right after creat() and, if that fails, either
proceed anyway or go to a userspace fallback.  All decisions can be
made early on and we don't have to share the format with lots of
userspace.

Sure, we still have to share the format with fsck and similar
filesystem tools.  But not with installers.

Jörn

--
Man darf nicht das, was uns unwahrscheinlich und unnatürlich erscheint,
mit dem verwechseln, was absolut unmöglich ist.
-- Carl Friedrich Gauß
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] drm build fix

2013-08-03 Thread Linus Torvalds
On Sat, Aug 3, 2013 at 6:08 PM, Dave Airlie  wrote:
>
> Alex Deucher (1):
>   drm/radeon: fix 64 bit divide in SI spm code

That code is stupid. You're asking for a 64-by-64 divide, when the
divisor is clearly an "int" (100 and 1000 respectively).

Why is it doing "div64_s64()" instead of the simpler and faster "div_s64()"?

A full 64-by-64 divide is _expensive_ on 32-bit architecture (up to
four divide instructions, each potentially expensive in its own
right), which is the whole reason why we have that "math64.h" to begin
with - to make people aware of it.

Now, our lib/div64.c routines do notice that the upper bits of the
divisor are zero and end up using the simpler 64-by-32 divide
functions, but why explicitly ask for those more complex functions to
begin with?

So the code is likely not performance critical, and hey, our library
routines do the simple optimizations to avoid the trivially excessive
divide instructions, so this "doesn't matter". Except for the
annoyance factor of you using a more complicated function for no
reason.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-08-03 Thread Theodore Ts'o
On Fri, Jul 26, 2013 at 09:20:34AM -0400, Jörn Engel wrote:
> 
> I don't think the e2compr patches are strictly necessary.  They are a
> good option, but not the only one.

Sorry for not chiming in earlier; I've been travelling this past week,
and between that and a bunch of other things I've fallen a bit earlier
on my e-mail.

> One trick to simplify the problem is to make Dhaval's compressed files
> strictly read-only.  It will require some dance to load the compressed
> content, flip the switch, then uncompress data on the fly and disallow
> writes.  Not the most pleasing of interfaces, but yet another option.

Yeah, this is something that I've wanted for a while.  (In fact a few
years ago I shopped around this design to some folks who were
associated with Firefox.)  MacOS has something rather similar to this.
I haven't had a chance to look at Dhaval's patches yet, but the way
I've been thinking about this is that the compression and building the
table mapping compressed clusters to byte offsets in the file would be
done in userspace.  Once the compressed file plus the table is written
to the disk, the userspace program would then close the file
descriptor, and then set the "compressed" bit.

When the bit is set, we flush all of its pages from the page cache,
and the file becomes immutable.  At that point, the kernel will handle
the decompression, by implementing readpages() by reading the pages
into the buffer cache, and then decompressing the compressed cluster
of pages into the page cache.  This gives us transparent compression,
with a fraction of the complexity of supporting read/write
compression.  In addition, since we don't have to worry rewriting a
cluster (and having the modified compressed cluster taking up more
space), the on-disk representation can be a lot more efficient, since
you don't have to use a stacker-style design.

One of the cool things about this design is that the vast majority of
files on a typical distribution are write-once, and better yet, they
are written by the package manager.  So once you teach dpkg, rpm, and
the Android package installer how to write the file in this compressed
format and set the compressed bit, we can the vast majority of the
benefits of using compressed file with minimal effort.

- Ted

P.S.  This is interesting not just for systems with slow HDD's, but
also for cheap, single-channel MMC flash, the kind found in low-end
handset and embedded systems.

P.P.S.  At least in theory, nothing of what I've described here has to
be ext4 specific.  We could implement this in the VFS layer, at which
point not only ext4 would benefit, but also btrfs, xfs, f2fs, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.

Radix tree perload overhead can be significant on BASE_SMALL systems, so
let's add dependency on !BASE_SMALL.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
enabled.

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/vm/transhuge.txt |  9 +
 include/linux/huge_mm.h|  9 +
 mm/Kconfig | 12 
 mm/huge_memory.c   | 23 +++
 4 files changed, 53 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953..4cc15c4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
 echo never >/sys/kernel/mm/transparent_hugepage/enabled
 
+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
 It's also possible to limit defrag efforts in the VM to generate
 hugepages in case they're not immediately free to madvise regions or
 to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..1534e1e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+   TRANSPARENT_HUGEPAGE_PAGECACHE,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct mm_struct 
*mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+   return false;
+   if (!(transparent_hugepage_flags & (1

[PATCH 07/23] mm: trace filemap: dump page order

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Dave Hansen 
---
 include/trace/events/filemap.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__field(struct page *, page)
__field(unsigned long, i_ino)
__field(unsigned long, index)
+   __field(int, order)
__field(dev_t, s_dev)
),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__entry->page = page;
__entry->i_ino = page->mapping->host->i_ino;
__entry->index = page->index;
+   __entry->order = compound_order(page);
if (page->mapping->host->i_sb)
__entry->s_dev = page->mapping->host->i_sb->s_dev;
else
__entry->s_dev = page->mapping->host->i_rdev;
),
 
-   TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+   TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
__entry->i_ino,
__entry->page,
page_to_pfn(__entry->page),
-   __entry->index << PAGE_SHIFT)
+   __entry->index << PAGE_SHIFT,
+   __entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/23] thp: represent file thp pages in meminfo and friends

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
 thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov 
Acked-by: Dave Hansen 
---
 drivers/base/node.c| 4 
 fs/proc/meminfo.c  | 3 +++
 include/linux/mmzone.h | 1 +
 mm/vmstat.c| 1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d SUnreclaim: %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
   "Node %d AnonHugePages:  %8lu kB\n"
+  "Node %d FileHugePages:  %8lu kB\n"
 #endif
,
   nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+   HPAGE_PMD_NR)
+   , nid,
+   K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
 #else
   nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages:  %8lu kB\n"
+   "FileHugePages:  %8lu kB\n"
 #endif
,
K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
   HPAGE_PMD_NR)
+   ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+  HPAGE_PMD_NR)
 #endif
);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..ba81833 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
NUMA_OTHER, /* allocation from other node */
 #endif
NR_ANON_TRANSPARENT_HUGEPAGES,
+   NR_FILE_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..ffe3fbd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
"numa_other",
 #endif
"nr_anon_transparent_hugepages",
+   "nr_file_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
"nr_dirty_background_threshold",
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/fs.h |  1 +
 mm/filemap.c   | 24 ++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..d5f58b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS  0x0004 /* used by filesystem to direct
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE 0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 28f4927..b17ebb9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2313,18 +2313,38 @@ struct page *grab_cache_page_write_begin(struct 
address_space *mapping,
gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+   bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+   IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+
 
gfp_mask = mapping_gfp_mask(mapping);
+   if (must_use_thp) {
+   BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+   BUG_ON(!(gfp_mask & __GFP_COMP));
+   }
if (mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
 repeat:
page = find_lock_page(mapping, index);
-   if (page)
+   if (page) {
+   if (must_use_thp && !PageTransHuge(page)) {
+   unlock_page(page);
+   page_cache_release(page);
+   return NULL;
+   }
goto found;
+   }
 
-   page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+   if (must_use_thp) {
+   page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+   if (page)
+   count_vm_event(THP_WRITE_ALLOC);
+   else
+   count_vm_event(THP_WRITE_ALLOC_FAILED);
+   } else
+   page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/23] block: implement add_bdi_stat()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/backing-dev.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..7060180 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info 
*bdi,
__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+   enum bdi_stat_item item, s64 amount)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __add_bdi_stat(bdi, item, amount);
+   local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
 {
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/vm/transhuge.txt | 7 +++
 include/linux/huge_mm.h| 5 +
 include/linux/vm_event_item.h  | 4 
 mm/vmstat.c| 4 
 4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c4..a78f738 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it 
has found
a range of pages to collapse into one huge page and has
successfully allocated a new huge page to store the data.
 
+thp_write_alloc and thp_read_alloc are incremented every time a huge
+   page is successfully allocated to handle write(2) to a file or
+   read(2) from file.
+
 thp_fault_fallback is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages.
 
@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged 
found a range
of pages that should be collapsed into one huge page but failed
the allocation.
 
+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+   huge page allocation failed when tried on write(2) or read(2).
+
 thp_split is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4dc66c9..9a0a114 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, 
struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED  ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..8e071bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+   THP_WRITE_ALLOC,
+   THP_WRITE_ALLOC_FAILED,
+   THP_READ_ALLOC,
+   THP_READ_ALLOC_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ffe3fbd..a80ea59 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -815,6 +815,10 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+   "thp_write_alloc",
+   "thp_write_alloc_failed",
+   "thp_read_alloc",
+   "thp_read_alloc_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Dave Hansen 
---
 include/linux/huge_mm.h| 24 ++
 include/linux/page-flags.h | 33 ++
 mm/filemap.c   | 50 +++---
 3 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1534e1e..4dc66c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct 
*mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER  (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER  ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
 static inline bool transparent_hugepage_pagecache(void)
 {
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -238,4 +252,14 @@ static inline bool transparent_hugepage_pagecache(void)
return false;
return transparent_hugepage_flags & (1 RADIX_TREE_PRELOAD_NR);
+
+   nr = hpagecache_nr_pages(page);
+
+   error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
if (error) {
mem_cgroup_uncharge_cache_page(page);
return error;
}
 
-   page_cache_get(page);
-   page->mapping = mapping;
-   page->index = offset;
-
spin_lock_irq(&mapping->tree_lock);
-   error = radix_tree_insert(&mapping->page_tree, offset, page);
+   page_cache_get(page);
+   for (i = 0; i < nr; i++) {
+   error = radix_tree_insert(&mapping->page_tree,
+   offset + i, page + i);
+   /*
+* In the midle of THP we can collide with small page which was
+* established before THP page cache is enabled or by other VMA
+* with bad alignement (most likely MAP_FIXED).
+*/
+   if (error)
+   goto err_insert;
+   page[i].index = offset + i;
+   page[i].mapping = mapping;
+   }
radix_tree_preload_end();
-   if (unlikely(error))
-   goto err_insert;
-   mapping->nrpages++;
-   __inc_zone_page_state(page, NR_FILE_PAGES);
+   mapping->nrpages += nr;
+   __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+   if (PageTransHuge(page))
+   __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
spin_unlock_irq(&mapping->tree_lock);
trace_mm_filemap_add_to_page_cache(page);
return 0;
 err_insert:
-   page->mapping = NULL;
-   /* Leave page->index set: truncation relies upon it */
+   radix_tree_preload_end();
+   if (i != 0)
+   error = -ENOSPC; /* no space for a huge page */
+
+   /* page[i] was not inserted to tree, skip it */
+   i--;
+
+   for (; i >= 0; i--) {
+   /* Leave page->index set: truncation relies upon it */
+   page[i].mapping = NULL;
+   radix_tree_delete(&mapping->page_tree, offset + i);
+   }
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

For tail pages we need to take two refcounters:
 - ->_count for its head page;
 - ->_mapcount for the tail page;

To protect against splitting we take compound lock and re-check that we
still have tail page before taking ->_mapcount reference.
If the page was split we drop ->_count reference from head page and
return 0 to indicate caller that it must retry.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/pagemap.h | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 47b5082..d459b38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+   struct page *page_head = compound_trans_head(page);
+
VM_BUG_ON(in_interrupt());
 
 #ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page 
*page)
 * disabling preempt, and hence no need for the "speculative get" that
 * SMP requires.
 */
-   VM_BUG_ON(page_count(page) == 0);
-   atomic_inc(&page->_count);
+   VM_BUG_ON(page_count(page_head) == 0);
+   atomic_inc(&page_head->_count);
 
 #else
-   if (unlikely(!get_page_unless_zero(page))) {
+   if (unlikely(!get_page_unless_zero(page_head))) {
/*
 * Either the page has been freed, or will be freed.
 * In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page 
*page)
return 0;
}
 #endif
-   VM_BUG_ON(PageTail(page));
+
+   if (unlikely(PageTransTail(page))) {
+   unsigned long flags;
+   int got = 0;
+
+   flags = compound_lock_irqsave(page_head);
+   if (likely(PageTransTail(page))) {
+   atomic_inc(&page->_mapcount);
+   got = 1;
+   }
+   compound_unlock_irqrestore(page_head, flags);
+
+   if (unlikely(!got))
+   atomic_dec(&page_head->_count);
+
+   return got;
+   }
 
return 1;
 }
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b75bdf5..28f4927 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -418,6 +418,10 @@ int replace_page_cache_page(struct page *old, struct page 
*new, gfp_t gfp_mask)
 {
int error;
 
+   if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+"unexpected transhuge page\n"))
+   return -EINVAL;
+
VM_BUG_ON(!PageLocked(old));
VM_BUG_ON(!PageLocked(new));
VM_BUG_ON(new->mapping);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

I've noticed that we allocated unneeded page for cache on read beyond
i_size. Simple test case (I checked it on ramfs):

$ touch testfile
$ cat testfile

It triggers 'no_cached_page' code path in do_generic_file_read().

Looks like it's regression since commit a32ea1e. Let's fix it.

Signed-off-by: Kirill A. Shutemov 
Cc: NeilBrown 
---
 mm/filemap.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 066bbff..c31d296 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, 
loff_t *ppos,
loff_t isize;
unsigned long nr, ret;
 
+   isize = i_size_read(inode);
+   if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
+   goto out;
+
cond_resched();
 find_page:
page = find_get_page(mapping, index);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/23] thp: handle file pages in split_huge_page()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c | 89 +++-
 1 file changed, 76 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 523946c..d7c6830 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page *page,
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
int tail_count = 0;
+   int initial_tail_refcount;
 
/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
@@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page *page,
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);
 
+   /*
+* When we add a huge page to page cache we take only reference to head
+* page, but on split we need to take addition reference to all tail
+* pages since they are still in page cache after splitting.
+*/
+   initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
struct page *page_tail = page + i;
 
@@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page *page,
 * atomic_set() here would be safe on all archs (and
 * not only on x86), it's safer to use atomic_add().
 */
-   atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-  &page_tail->_count);
+   atomic_add(initial_tail_refcount + page_mapcount(page) +
+   page_mapcount(page_tail) + 1,
+   &page_tail->_count);
 
/* after clearing PageTail the gup refcount can be released */
smp_mb();
@@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page 
*page,
*/
page_tail->_mapcount = page->_mapcount;
 
-   BUG_ON(page_tail->mapping);
page_tail->mapping = page->mapping;
 
page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));
 
-   BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
-   BUG_ON(!PageSwapBacked(page_tail));
 
lru_add_page_tail(page, page_tail, lruvec, list);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);
 
-   __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+   if (PageAnon(page))
+   __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+   else
+   __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
ClearPageCompound(page);
compound_unlock(page);
@@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
  struct anon_vma *anon_vma,
  struct list_head *list)
 {
@@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
 * and establishes a child pmd before
 * __split_huge_page_splitting() freezes the parent pmd (so if
 * we fail to prevent copy_huge_pmd() from running until the
-* whole __split_huge_page() is complete), we will still see
+* whole __split_anon_huge_page() is complete), we will still see
 * the newly established pmd of the child later during the
 * walk, to be able to set it as pmd_trans_splitting too.
 */
@@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
struct anon_vma *anon_vma;
int ret = 1;
 
-   BUG_ON(is_huge_zero_page(page));
-   BUG_ON(!PageAnon(page));
-
/*
 * The caller does not necessarily hold an mmap_sem that would prevent
 * the anon_vma disappearing so we first we take a reference to it
@@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
goto out_unlock;
 
BUG_ON(!PageSwapBacked(pag

[PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() --  which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on file write or read
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  mm, fs: avoid page allocation beyond i_size on read
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  thp: libfs: introduce simple_thp_release()
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 
 drivers/base/node.c|   4 +
 fs/libfs.c |  80 ++-
 fs/proc/meminfo.c  |   3 +
 fs/ramfs/file-mmu.c|   3 +-
 fs/ramfs/inode.c   |   6 +-
 include/linux/backing-dev.h|  10 +++
 include/linux/fs.h |  10 +++
 include/linux/huge_mm.h|  53 -
 include/linux/mmzone.h |   1 +
 include/linux/page-flags.h |  33 
 include/linux/pagemap.h|  48 +++-
 include/linux/radix-tree.h |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c   |  41 +++---
 mm/Kconfig |  12 +++
 mm/filemap.c   | 171 +++--
 mm/huge_memory.c   | 116 
 mm/memcontrol.c|   2 -
 mm/memory.c|   4 +-
 mm/truncate.c  | 108 --
 mm/vmstat.c|   5 ++
 23 files changed, 658 insertions(+), 90 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/23] thp: libfs: introduce simple_thp_release()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

simple_thp_release() is a dummy implementation of fops->release with
transparent huge page support. It's required to minimize memory overhead
of huge pages for small files.

It checks whether we should split the last page in the file to give
memory back to the system.

We split the page if it meets following criteria:
 - nobody has the file opened on write;
 - spliting will actually free any memory (at least one small page);
 - if it's a huge page ;)

Signed-off-by: Kirill A. Shutemov 
---
 fs/libfs.c | 27 +++
 include/linux/fs.h |  2 ++
 2 files changed, 29 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 934778b..c43b055 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -488,6 +488,33 @@ int simple_thp_write_begin(struct file *file, struct 
address_space *mapping,
}
return 0;
 }
+
+int simple_thp_release(struct inode *inode, struct file *file)
+{
+   pgoff_t last_index;
+   struct page *page;
+
+   /* check if anybody still writes to file */
+   if (atomic_read(&inode->i_writecount) != !!(file->f_mode & FMODE_WRITE))
+   return 0;
+
+   last_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+
+   /* check if splitting the page will free any memory */
+   if ((last_index & HPAGE_CACHE_INDEX_MASK) + 1 == HPAGE_CACHE_NR)
+   return 0;
+
+   page = find_get_page(file->f_mapping,
+   last_index & ~HPAGE_CACHE_INDEX_MASK);
+   if (!page)
+   return 0;
+
+   if (PageTransHuge(page))
+   split_huge_page(page);
+
+   page_cache_release(page);
+   return 0;
+}
 #endif
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1dbf43..b594f10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2557,8 +2557,10 @@ extern int simple_write_end(struct file *file, struct 
address_space *mapping,
 extern int simple_thp_write_begin(struct file *file,
struct address_space *mapping, loff_t pos, unsigned len,
unsigned flags, struct page **pagep, void **fsdata);
+extern int simple_thp_release(struct inode *inode, struct file *file);
 #else
 #define simple_thp_write_begin simple_write_begin
+#define simple_thp_release NULL
 #endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned 
int flags);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/23] truncate: support huge pages

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov 
---
 mm/truncate.c | 108 +-
 1 file changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..fcef7cb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
 {
pgoff_t start;  /* inclusive */
pgoff_t end;/* exclusive */
-   unsigned intpartial_start;  /* inclusive */
-   unsigned intpartial_end;/* exclusive */
+   boolpartial_thp_start = false, partial_thp_end = false;
struct pagevec  pvec;
pgoff_t index;
int i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
if (mapping->nrpages == 0)
return;
 
-   /* Offsets within partial pages */
-   partial_start = lstart & (PAGE_CACHE_SIZE - 1);
-   partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
/*
 * 'start' and 'end' always covers the range of pages to be fully
-* truncated. Partial pages are covered with 'partial_start' at the
-* start of the range and 'partial_end' at the end of the range.
-* Note that 'end' is exclusive while 'lend' is inclusive.
+* truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 */
start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
if (index >= end)
break;
 
+   if (PageTransTailCache(page)) {
+   /* part of already handled huge page */
+   if (!page->mapping)
+   continue;
+   /* the range starts in middle of huge page */
+   partial_thp_start = true;
+   start = index & ~HPAGE_CACHE_INDEX_MASK;
+   continue;
+   }
+   /* the range ends on huge page */
+   if (PageTransHugeCache(page) &&
+   index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+   partial_thp_end = true;
+   end = index;
+   break;
+   }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
index++;
}
 
-   if (partial_start) {
-   struct page *page = find_lock_page(mapping, start - 1);
+   if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+   pgoff_t off;
+   struct page *page;
+   unsigned pstart, pend;
+   void (*zero_segment)(struct page *page,
+   unsigned start, unsigned len);
+retry_partial_start:
+   if (partial_thp_start) {
+   zero_segment = zero_huge_user_segment;
+   off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
+   pstart = lstart & ~HPAGE_PMD_MASK;
+   if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
+   pend = (lend - 1) & ~HPAGE_PMD_MASK;
+   else
+   pend = HPAGE_PMD_SIZE;
+   } else {
+   zero_segment = zero_user_segment;
+   off = start - 1;
+   pstart = lstart & ~PAGE_CACHE_MASK;
+   if (start > end)
+   pend = (lend - 1) & ~PAGE_CACHE_MASK;
+   else
+   pend = PAGE_CACHE_SIZE;
+   }
+
+   page = find_get_page(mapping, off);
if (page) {
-   unsigned int top = PAGE_CACHE_SIZE;
-   if (start > end) {
-   /* Truncation within a single page */
-   top = partial_end;
-   partial_end = 0;
+   /* the last tail page*/
+   if (PageTransTailCache(page)) {
+   partial_thp_start = true;
+   page_cache_release(page);

[PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
 - check whether the page anon or not and serialize only over required
   lock;
 - always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/huge_mm.h | 15 ---
 mm/huge_memory.c|  4 ++--
 mm/memory.c |  4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a0a114..186f4f2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct 
*vma,
__split_huge_page_pmd(__vma, __address, \
pmd);   \
}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)
\
+#define wait_split_huge_page(__vma, __pmd) \
do {\
pmd_t *pmd = (__pmd);   \
-   anon_vma_lock_write(__anon_vma);\
-   anon_vma_unlock_write(__anon_vma);  \
+   struct address_space *__mapping = (__vma)->vm_file ?\
+   (__vma)->vm_file->f_mapping : NULL; \
+   struct anon_vma *__anon_vma = (__vma)->anon_vma;\
+   if (__mapping)  \
+   mutex_lock(&__mapping->i_mmap_mutex);   \
+   if (__anon_vma) {   \
+   anon_vma_lock_write(__anon_vma);\
+   anon_vma_unlock_write(__anon_vma);  \
+   }   \
+   if (__mapping)  \
+   mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*pmd) || \
   pmd_trans_huge(*pmd));   \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c6830..9af643d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -911,7 +911,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);
 
-   wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+   wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1493,7 +1493,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct 
vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
-   wait_split_huge_page(vma->anon_vma, pmd);
+   wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 7d35f90..ea74ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -609,7 +609,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct 
*vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
-   wait_split_huge_page(vma->anon_vma, pmd);
+   wait_split_huge_page(vma, pmd);
return 0;
 }
 
@@ -1522,7 +1522,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
-   wait_split_huge_page(vma->anon_vma, pmd);
+   wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
 pmd, flags);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 22/23] thp, mm: split huge page on mmap file page

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ed65af5..f7857ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1743,6 +1743,8 @@ retry_find:
goto no_cached_page;
}
 
+   if (PageTransCompound(page))
+   split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read()

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 57 +++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c31d296..ed65af5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,8 +1175,28 @@ find_page:
ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
-   if (unlikely(page == NULL))
-   goto no_cached_page;
+   if (unlikely(page == NULL)) {
+   if (mapping_can_have_hugepages(mapping))
+   goto no_cached_page_thp;
+   else
+   goto no_cached_page;
+   }
+   }
+   if (PageTransCompound(page)) {
+   struct page *head = compound_trans_head(page);
+
+   if (!PageReadahead(head) && PageUptodate(page))
+   goto page_ok;
+
+   /*
+* Switch 'page' to head page. That's needed to handle
+* readahead or make page uptodate.
+* It will be switched back to the right tail page at
+* the begining 'page_ok'.
+*/
+   page_cache_get(head);
+   page_cache_release(page);
+   page = head;
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
@@ -1198,6 +1218,18 @@ find_page:
unlock_page(page);
}
 page_ok:
+   /* Switch back to relevant tail page, if needed */
+   if (PageTransCompoundCache(page) && !PageTransTail(page)) {
+   int off = index & HPAGE_CACHE_INDEX_MASK;
+   if (off){
+   page_cache_get(page + off);
+   page_cache_release(page);
+   page += off;
+   }
+   }
+
+   VM_BUG_ON(page->index != index);
+
/*
 * i_size must be checked after we know the page is Uptodate.
 *
@@ -1329,6 +1361,27 @@ readpage_error:
page_cache_release(page);
goto out;
 
+no_cached_page_thp:
+   page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+   HPAGE_PMD_ORDER);
+   if (!page) {
+   count_vm_event(THP_READ_ALLOC_FAILED);
+   goto no_cached_page;
+   }
+   count_vm_event(THP_READ_ALLOC);
+
+   error = add_to_page_cache_lru(page, mapping,
+   index & ~HPAGE_CACHE_INDEX_MASK, GFP_KERNEL);
+   if (!error)
+   goto readpage;
+
+   page_cache_release(page);
+   if (error != -EEXIST && error != -ENOSPC) {
+   desc->error = error;
+   goto out;
+   }
+
+   /* Fallback to small page */
 no_cached_page:
/*
 * Ok, it wasn't cached, so we need to create a new
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/23] thp, libfs: initial thp support

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov 
---
 fs/libfs.c  | 53 +
 include/linux/fs.h  |  7 +++
 include/linux/pagemap.h |  8 
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b5..934778b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-   clear_highpage(page);
+   clear_pagecache_page(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct 
address_space *mapping,
 
/* zero the stale part of the page if we did a short copy */
if (copied < len) {
-   unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-   zero_user(page, from + copied, len - copied);
+   unsigned from;
+   if (PageTransHuge(page)) {
+   from = pos & ~HPAGE_PMD_MASK;
+   zero_huge_user(page, from + copied, len - copied);
+   } else {
+   from = pos & ~PAGE_CACHE_MASK;
+   zero_user(page, from + copied, len - copied);
+   }
}
 
if (!PageUptodate(page))
@@ -445,6 +450,46 @@ int simple_write_end(struct file *file, struct 
address_space *mapping,
return copied;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct page *page = NULL;
+   pgoff_t index;
+
+   index = pos >> PAGE_CACHE_SHIFT;
+
+   if (mapping_can_have_hugepages(mapping)) {
+   page = grab_cache_page_write_begin(mapping,
+   index & ~HPAGE_CACHE_INDEX_MASK,
+   flags | AOP_FLAG_TRANSHUGE);
+   /* fallback to small page */
+   if (!page) {
+   unsigned long offset;
+   offset = pos & ~PAGE_CACHE_MASK;
+   /* adjust the len to not cross small page boundary */
+   len = min_t(unsigned long,
+   len, PAGE_CACHE_SIZE - offset);
+   }
+   BUG_ON(page && !PageTransHuge(page));
+   }
+   if (!page)
+   return simple_write_begin(file, mapping, pos, len, flags,
+   pagep, fsdata);
+
+   *pagep = page;
+
+   if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+   unsigned from = pos & ~HPAGE_PMD_MASK;
+
+   zero_huge_user_segment(page, 0, from);
+   zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+   }
+   return 0;
+}
+#endif
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5f58b3..c1dbf43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,13 @@ extern int simple_write_begin(struct file *file, struct 
address_space *mapping,
 extern int simple_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+   struct address_space *mapping, loff_t pos, unsigned len,
+   unsigned flags, struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned 
int flags);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
*);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d459b38..eb484f2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,4 +591,12 @@ static inline int add_to_page_cache(struct page *page,
return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+   if (PageTransHuge(page))
+   zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+   else
+   clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 23/23] ramfs: enable transparent huge page cache

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov 
---
 fs/ramfs/file-mmu.c | 3 ++-
 fs/ramfs/inode.c| 6 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5..3236e41 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@
 
 const struct address_space_operations ramfs_aops = {
.readpage   = simple_readpage,
-   .write_begin= simple_write_begin,
+   .write_begin= simple_thp_write_begin,
.write_end  = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
 };
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
.llseek = generic_file_llseek,
+   .release= simple_thp_release,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d1465..5dafdfc 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-   mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+   /*
+* TODO: make ramfs pages movable
+*/
+   mapping_set_gfp_mask(inode->i_mapping,
+   GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/pagemap.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e8ca8cf..47b5082 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space 
*m, gfp_t mask)
(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+   gfp_t gfp_mask = mapping_gfp_mask(m);
+
+   if (!transparent_hugepage_pagecache())
+   return false;
+
+   /*
+* It's up to filesystem what gfp mask to use.
+* The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+*/
+   return !!(gfp_mask & __GFP_COMP);
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/23] memcg, thp: charge huge cache pages

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Acked-by: Dave Hansen 
---
 mm/memcontrol.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6cd870..dc50c1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct 
mm_struct *mm,
 
if (mem_cgroup_disabled())
return 0;
-   if (PageCompound(page))
-   return 0;
 
if (!PageSwapCache(page))
ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/23] thp, mm: naive support of thp in generic_perform_write

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b17ebb9..066bbff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2382,6 +2382,7 @@ static ssize_t generic_perform_write(struct file *file,
unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
void *fsdata;
+   int subpage_nr = 0;
 
offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2412,14 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
 
+   if (PageTransHuge(page)) {
+   off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+   subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+   }
+
pagefault_disable();
-   copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+   copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+   offset, bytes);
pagefault_enable();
flush_dcache_page(page);
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 619e6cb..b75bdf5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
struct address_space *mapping = page->mapping;
+   int i, nr;
 
trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +128,21 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);
 
-   radix_tree_delete(&mapping->page_tree, page->index);
+   nr = hpagecache_nr_pages(page);
+   for (i = 0; i < nr; i++) {
+   page[i].mapping = NULL;
+   radix_tree_delete(&mapping->page_tree, page->index + i);
+   }
+   /* thp */
+   if (nr > 1)
+   __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
-   mapping->nrpages--;
-   __dec_zone_page_state(page, NR_FILE_PAGES);
+   mapping->nrpages -= nr;
+   __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
-   __dec_zone_page_state(page, NR_SHMEM);
+   __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));
 
/*
@@ -144,8 +153,8 @@ void __delete_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-   dec_zone_page_state(page, NR_FILE_DIRTY);
-   dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+   mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+   add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
 }
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

2013-08-03 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 5KiB per-CPU, details below.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT   (CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT   3   /* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
 for preload array:
   (30 - 21) * sizeof(void*) = 72 bytes
 plus, if the preload array is full
   (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
 total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
 for preload array:
   (19 - 11) * sizeof(void*) = 32 bytes
 plus, if the preload array is full
   (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
 total: 2400 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
Acked-by: Dave Hansen 
---
 include/linux/radix-tree.h | 11 +++
 lib/radix-tree.c   | 41 -
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { 
\
(root)->rnode = NULL;   \
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR  512
+#else
+#define RADIX_TREE_PRELOAD_NR  1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root 
*root,
unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..99ab73c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE

Re: [PATCH 1/2] ext4: fix handling of nodelalloc parameter

2013-08-03 Thread Theodore Ts'o
On Fri, Aug 02, 2013 at 02:03:46PM +0200, Piotr Sarna wrote:
> Commit 26092bf ("ext4: use a table-driven handler for mount options")
> introduced buggy handling of nodelalloc parameter in mount command.
> 
> After explicitly using delalloc or nodelalloc parameter in mount command,
> MOPT_EXPLICIT flag is set. After that, a test ensures that "data=journal"
> and "delalloc" parameters are not simultaneously activated.
> Unluckily, the mentioned test reports a bug in both situations:
> - "data=journal,delalloc"
> - "data=journal,nodelalloc"
> whereas the second one is perfectly legal and acceptable.
> 
> A simple solution to this problem is in setting EXPLICIT_DELALLOC flag
> properly. This patch ensures that EXPLICIT_DELALLOC flag is set only
> if "delalloc" parameter was used, and not set in case of "nodelalloc".

Thanks for this bug report and patch.  There is an even simpler way of
fixing this doesn't involve adding an additional check in the code,
though.  Just make the following change the table entry:

 {Opt_nodelalloc, EXT4_MOUNT_DELALLOC,
-  MOPT_EXT4_ONLY | MOPT_CLEAR | MOPT_EXPLICIT},
+  MOPT_EXT4_ONLY | MOPT_CLEAR},

I'll send out a patch which does this...

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Aaron Lu
On Sun, Aug 4, 2013 at 6:20 AM, Felipe Contreras
 wrote:
> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki  wrote:
>> On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote:
>>> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
>>> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
>>>
>>> >> Yes, the patch is correct, but I still prefer my own version :-)
>>> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>>> >>
>>> >> In case you want to take mine and mine needs refresh, please let me know
>>> >> and I can do the re-base, thanks.
>>> >
>>> > Well, I prefer simpler, unless there's a good reason to use more 
>>> > complicated.
>>>
>>> Note that these are not exclusionary; his patch can be applied on top
>>> of mine. I don't think his patch is needed though.
>>
>> OK
>>
>> Do we still need to revert commit efaa14c if this patch is applied?
>
> I guess not. At least in this machine changing the backlight works
> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10
> didn't work at all. I cannot see how it would affect negatively other
> machines.

That commit makes hotkey emit notifications, and it's not the
problem of "booting into a black screen", that problem is due to
broken _BQC.

BTW, the efaa14c will also make screen off at level 0 according
to Felipe, who consider this is a bug. But since it is required to
let firmware emit notifications on hotkey press, I think user will
want it.

-Aaron

>
> That being said, the blacklisting is still needed, because 1. the
> level is not preserved between boots, and 2. level 0 turns off the
> screen, which I personally consider a regression.
>
> At least it boots to level 100 instead of 0.
>
> --
> Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] EXT4: LazyInit Mount Bug Fix

2013-08-03 Thread Theodore Ts'o
On Mon, Jul 15, 2013 at 12:14:18PM +0530, Nitin Singla wrote:
> -sbi->s_itb_per_group = sbi->s_inodes_per_group /
> -sbi->s_inodes_per_block;
> +sbi->s_itb_per_group = DIV_ROUND_UP(sbi->s_inodes_per_group,
> +sbi->s_inodes_per_block);

This would only matter if s_inodes_per_group is not a multiple of
s_inodes_per_block.  Which is never supposed to happen; mke2fs doesn't
create file systems like this.

Ancient Android build systems, before the bug was fixed, did do this
in the past, but that was a long time ago.  Where did this file system
come from?

I could apply this patch, but you should be warned that there may be
other bugs hiding here for file systems like this, both in the kernel
and in e2fsprogs.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] regulator: 88pm800: Fix checking whether num_regulator is valid

2013-08-03 Thread Axel Lin
The code to check whether num_regulator is valid is wrong because it should
iterate all array entries rather than break from the for loop if
pdata->regulators[i] is NULL.

Signed-off-by: Axel Lin 
---
 drivers/regulator/88pm800.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/regulator/88pm800.c b/drivers/regulator/88pm800.c
index c72fe95..58e9b74 100644
--- a/drivers/regulator/88pm800.c
+++ b/drivers/regulator/88pm800.c
@@ -299,10 +299,13 @@ static int pm800_regulator_probe(struct platform_device 
*pdev)
return -ENODEV;
}
} else if (pdata->num_regulators) {
-   /* Check whether num_regulator is valid. */
unsigned int count = 0;
-   for (i = 0; pdata->regulators[i]; i++)
-   count++;
+
+   /* Check whether num_regulator is valid. */
+   for (i = 0; ARRAY_SIZE(pdata->regulators); i++) {
+   if (pdata->regulators[i])
+   count++;
+   }
if (count != pdata->num_regulators)
return -EINVAL;
} else {
-- 
1.8.1.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Aaron Lu
On 08/03/2013 07:34 PM, Rafael J. Wysocki wrote:
> On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
>> On 08/03/2013 07:47 AM, Rafael J. Wysocki wrote:
>>> On Friday, August 02, 2013 02:37:09 PM Felipe Contreras wrote:
 If the _BCL package is descending, the first level (br->levels[2]) will
 be 0, and if the number of levels matches the number of steps, we might
 confuse a returned level to mean the index.

 For example:

   current_level = max_level = 100
   test_level = 0
   returned level = 100

 In this case 100 means the level, not the index, and _BCM failed. But if
 the _BCL package is descending, the index of level 0 is also 100, so we
 assume _BQC is indexed, when it's not.

 This causes all _BQC calls to return bogus values causing weird behavior
 from the user's perspective. For example: xbacklight -set 10; xbacklight
 -set 20; would flash to 90% and then slowly down to the desired level
 (20).

 The solution is simple; test anything other than the first level (e.g.
 1).

 Signed-off-by: Felipe Contreras 
>>>
>>> Looks reasonable.
>>>
>>> Aaron, what do you think?
>>
>> Yes, the patch is correct, but I still prefer my own version :-)
>> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>>
>> In case you want to take mine and mine needs refresh, please let me know
>> and I can do the re-base, thanks.
> 
> Well, I prefer simpler, unless there's a good reason to use more complicated.
> 
> Why exactly do you think your version is better?

As explained here:
https://lkml.org/lkml/2013/8/2/81
https://lkml.org/lkml/2013/8/2/112

And for the demo broken _BQC, mine patch will disable _BQC while still
make the backlight work, and this patch here is testing the max
brightness level and may fail.

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ARM: remove dmacap,memset from Device tree binding

2013-08-03 Thread Jason Cooper
DT Maintainers,

It's been a week with no comment.  Shall I assume it's ok to apply
this?

thx,

Jason.

On Thu, Jul 25, 2013 at 11:31:04AM -0400, Jason Cooper wrote:
> On Tue, Jul 02, 2013 at 12:54:12PM +0200, Sebastian Hesselbarth wrote:
> > DMA_MEMSET support has been removed, so update the device tree files
> > and corresponding binding documentation for Marvell SoCs.
> > 
> > Signed-off-by: Sebastian Hesselbarth 
> > ---
> > Cc: Russell King  
> > Cc: Jason Cooper 
> > Cc: Andrew Lunn 
> > Cc: Thomas Petazzoni 
> > Cc: Gregory CLEMENT  
> > Cc: devicetree-disc...@lists.ozlabs.org 
> > Cc: linux-kernel@vger.kernel.org 
> > Cc: linux-arm-ker...@lists.infradead.org 
> > ---
> >  Documentation/devicetree/bindings/dma/mv-xor.txt |2 --
> >  arch/arm/boot/dts/armada-370.dtsi|2 --
> >  arch/arm/boot/dts/armada-xp.dtsi |2 --
> >  arch/arm/boot/dts/dove.dtsi  |2 --
> >  arch/arm/boot/dts/kirkwood.dtsi  |2 --
> >  arch/arm/boot/dts/orion5x.dtsi   |1 -
> >  6 files changed, 0 insertions(+), 11 deletions(-)
> 
> Adding the new devicetree ml to the Cc:
> 
> I'm fine with the changes to the dts{i} files, but I think the binding
> document should be handled differently.
> 
> thx,
> 
> Jason.
> 
> > 
> > diff --git a/Documentation/devicetree/bindings/dma/mv-xor.txt 
> > b/Documentation/devicetree/bindings/dma/mv-xor.txt
> > index 7c6cb7f..68f7004 100644
> > --- a/Documentation/devicetree/bindings/dma/mv-xor.txt
> > +++ b/Documentation/devicetree/bindings/dma/mv-xor.txt
> > @@ -14,7 +14,6 @@ properties:
> >  
> >  And the following optional properties:
> >  - dmacap,memcpy to indicate that the XOR channel is capable of memcpy 
> > operations
> > -- dmacap,memset to indicate that the XOR channel is capable of memset 
> > operations
> >  - dmacap,xor to indicate that the XOR channel is capable of xor operations
> >  
> >  Example:
> > @@ -35,6 +34,5 @@ xor@d0060900 {
> >   interrupts = <52>;
> >   dmacap,memcpy;
> >   dmacap,xor;
> > - dmacap,memset;
> > };
> >  };
> > diff --git a/arch/arm/boot/dts/armada-370.dtsi 
> > b/arch/arm/boot/dts/armada-370.dtsi
> > index fa3dfc6..a315ad1 100644
> > --- a/arch/arm/boot/dts/armada-370.dtsi
> > +++ b/arch/arm/boot/dts/armada-370.dtsi
> > @@ -132,7 +132,6 @@
> > interrupts = <52>;
> > dmacap,memcpy;
> > dmacap,xor;
> > -   dmacap,memset;
> > };
> > };
> >  
> > @@ -151,7 +150,6 @@
> > interrupts = <95>;
> > dmacap,memcpy;
> > dmacap,xor;
> > -   dmacap,memset;
> > };
> > };
> >  
> > diff --git a/arch/arm/boot/dts/armada-xp.dtsi 
> > b/arch/arm/boot/dts/armada-xp.dtsi
> > index 416eb94..4b3dd56 100644
> > --- a/arch/arm/boot/dts/armada-xp.dtsi
> > +++ b/arch/arm/boot/dts/armada-xp.dtsi
> > @@ -114,7 +114,6 @@
> > interrupts = <52>;
> > dmacap,memcpy;
> > dmacap,xor;
> > -   dmacap,memset;
> > };
> > };
> >  
> > @@ -134,7 +133,6 @@
> > interrupts = <95>;
> > dmacap,memcpy;
> > dmacap,xor;
> > -   dmacap,memset;
> > };
> > };
> >  
> > diff --git a/arch/arm/boot/dts/dove.dtsi b/arch/arm/boot/dts/dove.dtsi
> > index 6cab468..2cef34f 100644
> > --- a/arch/arm/boot/dts/dove.dtsi
> > +++ b/arch/arm/boot/dts/dove.dtsi
> > @@ -232,7 +232,6 @@
> >  
> > channel1 {
> > interrupts = <40>;
> > -   dmacap,memset;
> > dmacap,memcpy;
> > dmacap,xor;
> > };
> > @@ -253,7 +252,6 @@
> >  
> > channel1 {
> > interrupts = <43>;
> > -   dmacap,memset;
> > dmacap,memcpy;
> > dmacap,xor;
> > };
> > diff --git a/arch/arm/boot/dts/kirkwood.dtsi 
> > b/arch/arm/boot/dts/kirkwood.dtsi
> > index 9809fc1..078637c 100644
> > --- a/arch/arm/boot/dts/kirkwood.dtsi
> > +++ b/arch/arm/boot/dts/kirkwood.dtsi
> > @@ -126,7 +126,6 @@
> >   interrupts = <6>;
> >   dmacap,memcpy;
> >   dmacap,xor;
> > - dmacap,memset;
> > };
> >   

[git pull] drm build fix

2013-08-03 Thread Dave Airlie

Hi Linus,

just a quick fix that a few people have reported, be nice to have in asap.

Dave.

The following changes since commit 72a67a94bcba71a5fddd6b3596a20604d2b5dcd6:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2013-08-03 
15:00:23 -0700)

are available in the git repository at:


  git://people.freedesktop.org/~airlied/linux drm-fixes

for you to fetch changes up to adfb8e51332153016857194b85309150ac560286:

  drm/radeon: fix 64 bit divide in SI spm code (2013-08-04 11:03:14 +1000)


Alex Deucher (1):
  drm/radeon: fix 64 bit divide in SI spm code

 drivers/gpu/drm/radeon/si_dpm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Cannot hot remove a memory device

2013-08-03 Thread Toshi Kani
On Sat, 2013-08-03 at 03:01 +0200, Rafael J. Wysocki wrote:
> On Friday, August 02, 2013 06:04:40 PM Toshi Kani wrote:
> > On Sat, 2013-08-03 at 01:43 +0200, Rafael J. Wysocki wrote:
> > > On Friday, August 02, 2013 03:46:15 PM Toshi Kani wrote:
> > > > On Thu, 2013-08-01 at 23:43 +0200, Rafael J. Wysocki wrote:
> > > > > Hi,
> > > > > 
> > > > > Thanks for your report.
> > > > > 
> > > > > On Thursday, August 01, 2013 05:37:21 PM Yasuaki Ishimatsu wrote:
> > > > > > By following commit, I cannot hot remove a memory device.
> > > > > > 
> > > > > > ACPI / memhotplug: Bind removable memory blocks to ACPI device nodes
> > > > > > commit e2ff39400d81233374e780b133496a2296643d7d
> > > > > > 
> > > > > > Details are follows:
> > > > > > When I add a memory device, acpi_memory_enable_device() always fails
> > > > > > as follows:
> > > > > > 
> > > > > > ...
> > > > > > [ 1271.114116]  [ea121c40-ea121c7f] PMD -> 
> > > > > > [880813c0-880813ff] on node 3
> > > > > > [ 1271.128682]  [ea121c80-ea121cbf] PMD -> 
> > > > > > [88081380-880813bf] on node 3
> > > > > > [ 1271.143298]  [ea121cc0-ea121cff] PMD -> 
> > > > > > [88081300-8808133f] on node 3
> > > > > > [ 1271.157799]  [ea121d00-ea121d3f] PMD -> 
> > > > > > [880812c0-880812ff] on node 3
> > > > > > [ 1271.172341]  [ea121d40-ea121d7f] PMD -> 
> > > > > > [88081280-880812bf] on node 3
> > > > > > [ 1271.186872]  [ea121d80-ea121dbf] PMD -> 
> > > > > > [88081240-8808127f] on node 3
> > > > > > [ 1271.201481]  [ea121dc0-ea121dff] PMD -> 
> > > > > > [88081200-8808123f] on node 3
> > > > > > [ 1271.216041]  [ea121e00-ea121e3f] PMD -> 
> > > > > > [880811c0-880811ff] on node 3
> > > > > > [ 1271.230623]  [ea121e40-ea121e7f] PMD -> 
> > > > > > [88081180-880811bf] on node 3
> > > > > > [ 1271.245148]  [ea121e80-ea121ebf] PMD -> 
> > > > > > [88081140-8808117f] on node 3
> > > > > > [ 1271.259683]  [ea121ec0-ea121eff] PMD -> 
> > > > > > [88081100-8808113f] on node 3
> > > > > > [ 1271.274194]  [ea121f00-ea121f3f] PMD -> 
> > > > > > [880810c0-880810ff] on node 3
> > > > > > [ 1271.288764]  [ea121f40-ea121f7f] PMD -> 
> > > > > > [88081080-880810bf] on node 3
> > > > 
> > > > It appears that each memory object only has 64MB of memory.  This is
> > > > less than the memory block size, which is 128MB.  This means that a
> > > > single memory block associates with two ACPI memory device objects.
> > > 
> > > That'd be bad.
> > > 
> > > How did that work before if that indeed is the case?
> > 
> > Well, it looks to me that it has never worked before...
> > 
> > > > > > ... 
> > > > > > [ 1271.325841] acpi PNP0C80:03: acpi_memory_enable_device() error
> > > > > 
> > > > > Well, the only new way acpi_memory_enable_device() can fail after 
> > > > > that commit
> > > > > is a failure in acpi_bind_memory_blocks().
> > > > 
> > > > Agreed.
> > > > 
> > > > > This means that either handle is NULL, which I think we can exclude, 
> > > > > because
> > > > > acpi_memory_enable_device() wouldn't be called at all if that were 
> > > > > the case, or
> > > > > there's a more subtle error in acpi_bind_one().
> > > > > 
> > > > > One that comes to mind is that we may be calling acpi_bind_one() 
> > > > > twice for the
> > > > > same memory region, in which it will trigger -EINVAL from the sanity 
> > > > > check in
> > > > > there.
> > > > 
> > > > I think it fails with -EINVAL at the place with dev_warn(dev, "ACPI
> > > > handle is already set\n").  When two ACPI memory objects associate with
> > > > a same memory block, the bind procedure of the 2nd ACPI memory object
> > > > sees that ACPI_HANDLE(dev) is already set to the 1st ACPI memory object.
> > > 
> > > That sound's plausible, but I wonder how we can fix that?
> > > 
> > > There's no way for a single physical device to have two different ACPI
> > > "companions".  It looks like the memory blocks should be 64 M each in that
> > > case.  Or we need to create two child devices for each memory block and
> > > associate each of them with an ACPI object.  That would lead to 
> > > complications
> > > in the user space interface, though.
> > 
> > Right.  Even bigger issue is that I do not think __add_pages() and
> > __remove_pages() can add / delete a memory chunk that is less than
> > 128MB.  128MB is the granularity of them.  So, we may just have to fail
> > this case gracefully.
> 
> Sigh.
> 
> BTW, why do you think they are 64 M each (it's late and I'm obviously tired)?

Oops!  Sorry, I had confused the above messages with the one in
init_memory_mapping(), which shows a memory range being added, i.e. the
size of an ACPI

Re: [PATCH] ACPI: Do not fail acpi_bind_one() if device is already bound correctly

2013-08-03 Thread Toshi Kani
On Sat, 2013-08-03 at 02:47 +0200, Rafael J. Wysocki wrote:
> On Friday, August 02, 2013 04:38:38 PM Toshi Kani wrote:
> > On Fri, 2013-08-02 at 00:33 +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki 
> > > 
> > > Modify acpi_bind_one() so that it doesn't fail if the device
> > > represented by its first argument has already been bound to the
> > > given ACPI handle (second argument), because that is not a good
> > > enough reason for returning an error code.
> > 
> > While it seems reasonable to allow such case, I do not think we will hit
> > this case under the normal scenarios.  So, I do not think we need to
> > make this change now unless it actually solves Yasuaki's issue (which I
> > am guessing not).
> 
> In theory it should be possible to call acpi_bind_one() twice in a row
> for the same dev and the same handle without failure, that simply is
> logical.  The patch may not fix any problems visible now, but returning an
> error code in such a case is simply incorrect.

We changed acpi_bus_device_attach() to not call the handler or driver
again if it is already bound.  So, I was under impression that we
prevent from attaching a same device twice.  But I may be missing
something...

Thanks,
-Toshi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 6:27 PM, Igor Gnatenko
 wrote:
> On Sat, 2013-08-03 at 15:21 -0500, Felipe Contreras wrote:
>> On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko
>>  wrote:
>>
>> > I am opposed to this patch. On ThinkPad X230 I had problems with it.
>> > Felipe, come over to dark side. They have cookies.
>>
>> And v3.11-rc3 works's fine out-of-the box?
>>
>
> rc2 work's fine. Since rc3 Rafael reverted patch and I think backlight
> broken again (I don't have time to check rc3 yet).

You think?

We are talking about this patch [1]. Did you apply that patch on top
of rc3 and booted without any arguments?

If you did, what problems *exactly* did you have?

[1] 
https://bugzilla.kernel.org/attachment.cgi?id=107084&action=diff&context=patch&collapsed=&headers=1&format=raw

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel causes excessive delay mounting USB device

2013-08-03 Thread Dr. David Alan Gilbert
* Vern Clark (vmcl...@verizon.net) wrote:
> 
> Plugging in any USB flash stick takes 5-6 minutes before its mounted.
> 
> ===
> Using the current kernel-3.8.0-28-generic, the USB fails to load in proper 
> time.
> I found this message in syslog: timeout: killing '/sbin/blkid -o udev -p 
> /dev/sdb'. Right after which, the USB device was mounted.
> 
> Using an earlier kernel-3.2.37-030237-generic, fixed the problem.

Did you try going back to one of their previous 3.8.0 versions rather than a 
big jump like that?

> Aug  2 09:39:16 u kernel: [ 2269.914570] sd 11:0:0:0: [sdb] Assuming drive 
> cache: write through
> Aug  2 09:39:47 u kernel: [ 2300.739825] usb 1-8: reset high-speed USB device 
> number 11 using ehci-pci

Curious; I've just started getting something similar on the Ubuntu 3.10.0-6 
kernels, and 
last nights v3.11-rc3 build.

It's been suggested perhaps:
http://www.spinics.net/lists/stable/msg16022.html

is the culprit, but I haven't tried it yet.

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: named anonymous vmas

2013-08-03 Thread KOSAKI Motohiro
On Thu, Aug 1, 2013 at 4:36 AM, Rich Felker  wrote:
> On Thu, Aug 01, 2013 at 01:29:51AM -0700, Christoph Hellwig wrote:
>> Btw, FreeBSD has an extension to shm_open to create unnamed but fd
>> passable segments.  From their man page:
>>
>> As a FreeBSD extension, the constant SHM_ANON may be used for the path
>> argument to shm_open().  In this case, an anonymous, unnamed shared
>> memory object is created.  Since the object has no name, it cannot be
>> removed via a subsequent call to shm_unlink().  Instead, the shared
>> memory object will be garbage collected when the last reference to the
>> shared memory object is removed.  The shared memory object may be shared
>> with other processes by sharing the file descriptor via fork(2) or
>> sendmsg(2).  Attempting to open an anonymous shared memory object with
>> O_RDONLY will fail with EINVAL. All other flags are ignored.
>>
>> To me this sounds like the best way to expose this functionality to the
>> user.  Implementing it is another question as shm_open sits in libc,
>> we could either take it and shm_unlink to the kernel, or use O_TMPFILE
>> on tmpfs as the backend.
>
> I'm not sure what the purpose is. shm_open with a long random filename
> and O_EXCL|O_CREAT, followed immediately by shm_unlink, is just as
> good except in the case where you have a malicious user killing the
> process in between these two operations.

Practically, filename length is restricted by NAME_MAX(255bytes). Several
people don't think it is enough long length. The point is, race free API.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Igor Gnatenko
On Sat, 2013-08-03 at 15:21 -0500, Felipe Contreras wrote:
> On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko
>  wrote:
> 
> > I am opposed to this patch. On ThinkPad X230 I had problems with it.
> > Felipe, come over to dark side. They have cookies.
> 
> And v3.11-rc3 works's fine out-of-the box?
> 

rc2 work's fine. Since rc3 Rafael reverted patch and I think backlight
broken again (I don't have time to check rc3 yet).
-- 
Igor Gnatenko
Fedora release 19 (Schrödinger’s Cat)
Linux 3.10.4-300.fc19.x86_64

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Dan Carpenter
On Sat, Aug 03, 2013 at 02:38:45PM -0700, Andres Salomon wrote:
> On Sat, 3 Aug 2013 23:36:15 +0200
> Jens Frederich  wrote:
> 
> > On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon 
> > wrote:
> > > Please Cc Daniel on these.  Cjb and myself are no longer at olpc.
> > >
> > 
> > Do you know what's with Jon Nettleton? He is also on the TODO list?
> 
> Let's ask.  Jon, do you still want to be Cc'd on DCON and other OLPC
> patches?

No one reads the TODO files...  Just update MAINTAINERS so that
get_maintainer.pl CC's the right people automatically.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Dan Carpenter
On Sat, Aug 03, 2013 at 10:44:35PM +0200, Jens Frederich wrote:
> @@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv *dcon, int 
> is_powered_down)
>  power_up:
>   if (is_powered_down) {
>   x = 1;
> - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0);
> + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0);

You didn't introduce this but using "x" as the inbuf here messy.
It should be char instead of an int.  The code won't work on big
endian systems.  I know this hardware is only available on little
endian systems and that's why it's not a bug.  It's just an ugly
thing to do.

(Since you didn't introduce this, it means your patch is fine and
you can ignore this email.  I am just commenting in case anyone
wants to fix clean it up).

>   if (x) {
>   pr_warn("unable to force dcon to power up: %d!\n", x);
>   return x;

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 00/99] 3.10.5-stable review

2013-08-03 Thread Greg Kroah-Hartman
On Sat, Aug 03, 2013 at 07:02:50AM -0700, Guenter Roeck wrote:
> On 08/02/2013 03:07 AM, Greg Kroah-Hartman wrote:
> >This is the start of the stable review cycle for the 3.10.5 release.
> >There are 99 patches in this series, all will be posted as a response
> >to this one.  If anyone has any issues with these being applied, please
> >let me know.
> >
> >Responses should be made by Sun Aug  4 10:00:45 UTC 2013.
> >Anything received after that time might be too late.
> >
> >The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.10.5-rc1.gz
> >and the diffstat can be found below.
> >
> Cross build looks good.
> 
> Summary:
>   Total builds: 64 Total build errors: 3
> 
> Details:
>   
> http://desktop.roeck-us.net/buildlogs/v3.10/v3.10.4-99-g069bbeb.2013-08-03.02:22:12

Thanks for testing and letting me know.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 5:38 PM, Rafael J. Wysocki  wrote:
> On Saturday, August 03, 2013 05:20:33 PM Felipe Contreras wrote:
>> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki  wrote:
>> > On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote:
>> >> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
>> >> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
>> >>
>> >> >> Yes, the patch is correct, but I still prefer my own version :-)
>> >> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>> >> >>
>> >> >> In case you want to take mine and mine needs refresh, please let me 
>> >> >> know
>> >> >> and I can do the re-base, thanks.
>> >> >
>> >> > Well, I prefer simpler, unless there's a good reason to use more 
>> >> > complicated.
>> >>
>> >> Note that these are not exclusionary; his patch can be applied on top
>> >> of mine. I don't think his patch is needed though.
>> >
>> > OK
>> >
>> > Do we still need to revert commit efaa14c if this patch is applied?
>>
>> I guess not. At least in this machine changing the backlight works
>> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10
>> didn't work at all. I cannot see how it would affect negatively other
>> machines.
>>
>> That being said, the blacklisting is still needed, because 1. the
>> level is not preserved between boots, and 2. level 0 turns off the
>> screen, which I personally consider a regression.
>>
>> At least it boots to level 100 instead of 0.
>
> OK
>
> I'll push this patch to Linus for -rc5 then without the revert of commit
> commit efaa14c.  That's all I'm going to do for 3.11 in the ACPI video
> area at this point.

That seems fair.

> As far as the blacklisting is concerned, I still have the blacklist of
> your Asus machine queued up for 3.12.  Since you're claiming that it
> doesn't have any side effects on that machine, I think I can apply it.
>
> However, for other machines to be added to that blacklist I need to see
> requests from their users with confirmation that there are no visible side
> effects there.

Good, that's the purpose of bug 60682.

https://bugzilla.kernel.org/show_bug.cgi?id=60682

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible mmap() write() problem in SLES11 SP2 kernel

2013-08-03 Thread Hugh Dickins
On Thu, 1 Aug 2013, Ulrich Windl wrote:
> Hi folks!
> 
> I think I'd let you know (maybe I'm wrong, and the kernel is right):
> 
> I write a C-program that maps a file into an private writable map. Then I 
> modify the area a bit and use one write to write that area back to a file.
> 
> This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel  
> 3.0.80-0.7 the write() fails with EFAULT if the output file is the same as 
> the input file.

I wonder if you actually did exactly the same on both kernels.

> 
> The strace is amazingly short (I removed the unrelated calls):

Providing that was very helpful.

> open("xxx", O_RDONLY)   = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0
> mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000
> close(3)= 0
> open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

The crucial point is the above O_TRUNC when you now open the file for
writing: that truncates the file to 0-length, which unmaps any pages
mapped from it into userspace.  Even the privately modified COW pages:
that often seems surprising, but it is how mmap versus truncate is
specified to work.

> write(3, 0x7f85ac045000, 4414)  = -1 EFAULT (Bad address)

If your program now touched a part of the mapping, it would get
SIGBUS, there being no pages of underlying object to page in from.
But since you're accessing the area from within a system call,
that simply fails with EFAULT.

> close(3)= 0
> munmap(0x7f85ac045000, 4414)= 0
> 
> I want to have your attention if this should work, and you get my attention 
> if this should not work.

It should not work.

> Note that the input file is closed before it's opened for write again. As the 
> output file is typically shorter than the input, I didn't want to use a 
> non-private mapping and a truncate, just in case you wonder...

(I didn't understand your logic there.)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 001/001] CHAR DRIVERS: a simple device to give daemons a /sys-like interface

2013-08-03 Thread Greg Kroah-Hartman
On Fri, Aug 02, 2013 at 06:19:19PM -0700, Bob Smith wrote:
> This character device can give daemons an interface similar to
> the kernel's /sys and /proc interfaces.   It is a nice way to
> give user space drivers real device nodes in /dev.

Other comments about this patch:

> From 7ee4391af95b828179cf5627f8b431c3301c5057 Mon Sep 17 00:00:00 2001
> From: Bob Smith 
> Date: Fri, 2 Aug 2013 16:44:48 -0700
> Subject: [PATCH] PROXY, a driver that gives daemons a /sys like interface
> 

No signed-off-by:, or body of text here that explains what this is and
why it should be accepted.

> ---
>  Documentation/proxy.txt |   36 
>  drivers/char/Kconfig|8 +
>  drivers/char/Makefile   |2 +-
>  drivers/char/proxy.c|  539 
> +++
>  4 files changed, 584 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/proxy.txt
>  create mode 100644 drivers/char/proxy.c
> 
> diff --git a/Documentation/proxy.txt b/Documentation/proxy.txt
> new file mode 100644
> index 000..6b9206a
> --- /dev/null
> +++ b/Documentation/proxy.txt
> @@ -0,0 +1,36 @@
> +Proxy Character Devices
> +
> +
> +Proxy is a small character device that connects two user space
> +processes.  It is intended to give user space daemons a /sys like
> +interface for configuration and status.
> +
> +As an example consider a daemon that controls a stepper motor. The
> +daemon would create and open one proxy device to read and write
> +configuration (/dev/stepper/config) and another proxy device to
> +accept a motor step count (/dev/stepper/count).
> +Shell commands to illustrate this example:
> + $ stepper_daemon# start the stepper control daemon
> + $ # Set config to full steps, clockwise and 400 step/sec
> + $ echo "full, cw, 400" > /dev/stepper/config
> + $ # Now tell the motor to step 4000 steps
> + $ echo 4000 > /dev/stepper/count
> + $ sleep 2
> + $ # How many steps remain?
> + $ cat /dev/stepper/count
> +
> +
> +Proxy has some unique features that make ideal for providing a
> +/sys like interface.  It has no internal buffering.  The means
> +the daemon can not write until a client program is listening.
> +Both named pipes and pseudo-ttys have internal buffers.

So what is wrong with internal buffers?  Named pipes have been around
for a long time, they should be able to be used much like this, right?

> +Proxy will succeed on a write of zero bytes.  A zero byte write
> +gives the client an EOF.  The daemon in the example above would
> +use a zero byte write in the last command after it had written the
> +number of steps remaining.  No other IPC mechanism can close one
> +side of a device and leave the other side open.

No "direct" IPC, but you can always emulate this just fine with existing
IPC mechanisms.

> +Proxy works well with select(), an important feature for daemons.
> +In contrast, the FUSE filesystem has some issues with select() on
> +the client side.

What are those issues?  Why not just fix them?

Adding a new IPC function to the kernel should not be burried down in
drivers/char/.  We have 10+ different IPC mechanisms already, some
simple, some more complex.  Are you _sure_ none of the existing ones
will not work for you?  Maybe a simple userspace library that wraps the
existing mechanisms would be better (no kernel changes needed, portable
to any kernel release, etc.)?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 5:21 PM, Rafael J. Wysocki  wrote:
> On Saturday, August 03, 2013 05:06:10 PM Felipe Contreras wrote:
>> On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki  wrote:
>
> [...]
>
>> > Whatever you are thinking you will achieve this way, it doesn't work.
>>
>> It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken,
>> v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be
>> broken too, and perhaps even v3.13.
>
> Be precise and say "backlight control on a number of machines in broken in
> those kernels".  Yes, it is.  It needs to be fixed.  Not necessarily your way,
> though.

"My" way, can be done *today*. "Your" way can be done whenever it's ready.

>> Who benefits from this?
>
> Clearly, no one.
>
> And who benefits from your "crusade"?

Who benefits from yours?

It's all very simple; we ask the owners of these machines if
acpi_osi="!Windows 2012" makes things work better, if they do, they go
into the blacklist, if not, they don't. That would fix the problems
for these machines *today*. If later it turns out there are other
issues introduced, they get removed, just like you removed the
intel_backlight switch patch. Igor Gnatenko said he had problems, but
didn't mention which problem. In bug #51231, tons of people reported
that things improved for ThinkPad X230.

And when the proper fix is done, everyone will be happy and we can
drop the blacklist.

Everybody benefits this way.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 05:20:33 PM Felipe Contreras wrote:
> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki  wrote:
> > On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote:
> >> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
> >> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
> >>
> >> >> Yes, the patch is correct, but I still prefer my own version :-)
> >> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
> >> >>
> >> >> In case you want to take mine and mine needs refresh, please let me know
> >> >> and I can do the re-base, thanks.
> >> >
> >> > Well, I prefer simpler, unless there's a good reason to use more 
> >> > complicated.
> >>
> >> Note that these are not exclusionary; his patch can be applied on top
> >> of mine. I don't think his patch is needed though.
> >
> > OK
> >
> > Do we still need to revert commit efaa14c if this patch is applied?
> 
> I guess not. At least in this machine changing the backlight works
> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10
> didn't work at all. I cannot see how it would affect negatively other
> machines.
> 
> That being said, the blacklisting is still needed, because 1. the
> level is not preserved between boots, and 2. level 0 turns off the
> screen, which I personally consider a regression.
> 
> At least it boots to level 100 instead of 0.

OK

I'll push this patch to Linus for -rc5 then without the revert of commit
commit efaa14c.  That's all I'm going to do for 3.11 in the ACPI video
area at this point.

As far as the blacklisting is concerned, I still have the blacklist of
your Asus machine queued up for 3.12.  Since you're claiming that it
doesn't have any side effects on that machine, I think I can apply it.

However, for other machines to be added to that blacklist I need to see
requests from their users with confirmation that there are no visible side
effects there.

I think this is fair enough.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/4] ASoc: kirkwood: merge kirkwood-i2c and kirkwood-dma

2013-08-03 Thread Russell King - ARM Linux
On Wed, Jul 31, 2013 at 08:18:06AM +0200, Jean-Francois Moine wrote:
> To avoid the declaration of a 'kirkwood-pcm-audio' device in the DT,
> this patch merges the kirkwood-i2c and kirkwood-dma drivers into one
> module associated with 'kirkwood-i2s'. 

I suggest holding off on this stuff at the moment.  I think Mark and Liam
(who now have a whole raft of emails from me today) have some work to do
to fix ASoC to work how they're saying it should work, because to me
ASoC seems to contradict what they're telling me.  To put it another way,
it must be buggy.

The DAPM stuff seems to be the worst thing since mouldy bread.  I'm
chasing what seems to be multiple bugs through this stuff, and many of
them are not particularly nice.

For example - we register multiple copies of DAPM widgets for the same
set of declarations if both a CPU DAI and a platform share the same
struct device, and thereby end up overwriting some pointers in the DAI
structure.  It seems that CPU DAIs themselves aren't supposed to have
DAPM widgets, but the core creates some... but there's no explicit
cleanup of them unlike the other DAPMs, and there's no debugfs for them.

The bias levels are stuck at off/standby all the time, which makes any
kind of PM with spdif-dit impossible - or even any routing between
stream input and output.

Basically, I'm tearing my hair out today looking at this stuff and getting
nowhere fast - all I'm doing is finding more and more problems.

For example, where a codec has no input/output widgets declared, it used
to be powered up automatically by ASoC as a matter of course.  Things
like UDA134x and other things.  Things like spdif-dit.  That "mysteriously"
stopped happening.

ASoC: dapm: Treat DAI widgets like AIF widgets for power

seems to be the cause, this results in such setups having zero connected
inputs/outputs reported, which causes them to remain powered down - because
what used to happen before was the DAI links would report their powered
state depending on whether they were active, and they are set active
in soc_dapm_stream_event().

Now, when playback widget for the Codec and CPU DAIs get marked as active.
The playback widget is created as a snd_soc_dapm_dai_in.  It's power
check function is set to dapm_dac_check_power.  Since this widget is
active, it checks for connected outputs via is_connected_output_ep().

We drop through to the second switch statement (why do we have two there?
They're both switching on the same damned variable and its not like a
widget can change its ID.)  This is where the different behaviour has
appeared - when these were just a simple snd_soc_dapm_dai, we used to
just do the snd_soc_dapm_suspend_check() here, but we don't anymore.
This is a snd_soc_dapm_dai_in type of widget, so we fall through this
switch statement now and start searching the paths.

As these codecs have no paths, this ultimately ends up returning no
connections.  Hence why these codecs with no DAPM widgets declared but
have PM support via the bias level stuff have stopped working.

Now, about the spdif-dit, if we're going to have to add "pin" widgets
to it, what the output of a SPDIF in terms of DAPM widgets?  At a guess,
it's a "codec pin" despite there being no codec and no "pin" on that
codec in reality, and that "pin" is always active.

With that "fixed" (rather, altered to a state where it behaves the
same way as it used to before the above commit) if I set up a DAPM route
between the CPU DAI playback stream, an AIF (for spdif output) and the
spdif-dit playback stream, I see the playback streams marked active and
powered up, but the AIF stream remains powered down:

spdif-dit/dapm/Playback:Playback: On  in 1 out 1
spdif-dit/dapm/Playback: stream Playback active
spdif-dit/dapm/Playback: in  "static" "spdif-playback"
spdif-dit/dapm/bias_level:On

kirkwood-audio.1/dapm/spdif-playback:spdif-playback: Off  in 1 out 0
kirkwood-audio.1/dapm/spdif-playback: in  "static" "dma-playback"
kirkwood-audio.1/dapm/spdif-playback: out "static" "Playback"

kirkwood-audio.1/dapm/dma-playback:dma-playback: On  in 1 out 1
kirkwood-audio.1/dapm/dma-playback: stream dma-playback active
kirkwood-audio.1/dapm/dma-playback: out "static" "spdif-playback"
kirkwood-audio.1/dapm/dma-playback: out "static" "i2s-playback"

It looks to me like the DAPM stuff is - in one plain and simple word -
buggered.

I've no idea what the right fixes are in this area.  It needs someone
like Mark or Liam who supposedly understand this to spend time checking
that it actually operates as they _think_ it should operate, because
at the moment it plainly doesn't.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki  wrote:
> On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote:
>> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
>> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
>>
>> >> Yes, the patch is correct, but I still prefer my own version :-)
>> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>> >>
>> >> In case you want to take mine and mine needs refresh, please let me know
>> >> and I can do the re-base, thanks.
>> >
>> > Well, I prefer simpler, unless there's a good reason to use more 
>> > complicated.
>>
>> Note that these are not exclusionary; his patch can be applied on top
>> of mine. I don't think his patch is needed though.
>
> OK
>
> Do we still need to revert commit efaa14c if this patch is applied?

I guess not. At least in this machine changing the backlight works
correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10
didn't work at all. I cannot see how it would affect negatively other
machines.

That being said, the blacklisting is still needed, because 1. the
level is not preserved between boots, and 2. level 0 turns off the
screen, which I personally consider a regression.

At least it boots to level 100 instead of 0.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 05:06:10 PM Felipe Contreras wrote:
> On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki  wrote:

[...]

> > Whatever you are thinking you will achieve this way, it doesn't work.
> 
> It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken,
> v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be
> broken too, and perhaps even v3.13.

Be precise and say "backlight control on a number of machines in broken in
those kernels".  Yes, it is.  It needs to be fixed.  Not necessarily your way,
though.

> Who benefits from this?

Clearly, no one.

And who benefits from your "crusade"?

Again, no one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki  wrote:
> On Saturday, August 03, 2013 03:20:22 PM Felipe Contreras wrote:
>> On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki  wrote:
>> > On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote:
>>
>> >> Yes, that's fine, either the revert, or the patch I mentioned, or
>> >> something else, but something has to be done, and it was better to do
>> >> it in v3.11-rc4 than in v3.11-rc5, because that change itself can
>> >> cause further problems.
>> >
>> > A revert can be done in -rc5 just fine.  If we don't have a working fix 
>> > this
>> > week, I'll do the revert.
>>
>> I think you are waiting for miracles. But whatever.
>>
>> >> Let's do a though experiment, let's say you are right, and they can
>> >> survive the 6 months it would take you to find the "perfect" solution,
>> >> say in v3.13. What's wrong with having a partial solution in v3.12? If
>> >> the blacklisting doesn't work properly (there's absolutely no evidence
>> >> for that), then you revert it on v3.12.1.
>> >>
>> >> What's wrong with that approach?
>> >
>> > If the blacklisting leads to problems, they may not be reported in the 3.12
>> > time frame, but much later.  For example because people won't realize that
>> > the problems are caused by the blacklisting until much much later.  And 
>> > then
>> > we'll be in a spot where whatever we do will break things for someone.
>>
>> The key word is *may*, you don't *know*. Why do you insist in
>> committing this reification fallacy?
>>
>> This threat is not real, it's theoretical.
>
> I believe it is real.

That is called wishful thinking. Believing so doesn't make it so.

> Igor has told you that already, hasn't he?

He said he "had problems", that doesn't mean much. What kind of
problems? Did he have those problems in v3.10? Does v3.11-rc3 work
correctly? Did he boot without any boot arguments?

"I had problems" is almost meaningless.

>> But let's suppose you are right, and there are issues, and those don't
>> get reported in v3.12. That is actually GOOD, if people don't report
>> issues, it means the issues are not that big, or not even *there*.
>
> You don't even realize how wrong you are.  The issues that aren't visible
> in testing from the start are often much *worse* than those we can see
> immediately, because usually they are much more difficult to fix and they
> cause much more pain overall as a result.

How convenient. In other words; there's absolutely no empirical way to
prove you wrong.

It's like trying to prove that your invisible pink unicorn doesn't exist.

If we don't find evidence of problems, that is evidence that there are
no problems. And even if there are "invisible" "worse" issues, it
doesn't matter, because you will fix them properly in six months,
right?

I'd say, fix the visible (aka. real) issues, don't worry about the
invisible (aka. imaginary) ones. Worry when they are visible.

>> And no, we won't be in a spot where whatever we do will break things,
>> because if the intel backlight driver works correctly, that solution
>> would work for everyone. And if it doesn't, we should stay with what
>> works best.
>>
>> > And we had situations like that in the past, which is the source of my 
>> > concern.
>> > You obviously don't have that experience, or you won't be so eager to 
>> > inflict
>> > the blacklisting on everyone.
>> >
>> > Anyway, as you know, but conveniently don't mention, I asked some 
>> > experienced
>> > people for opinions about that.  If they agree with you, we will add the
>> > blacklist.  If they don't, we won't add it.
>>
>> Again, screw the users. We are stuck with broken backlight for several
>> more months to come. Great.
>
> Whatever you are thinking you will achieve this way, it doesn't work.

It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken,
v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be
broken too, and perhaps even v3.13.

Who benefits from this?

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/10] cpufreq: Fix broken usage of governor->owner's refcount

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 06:57:11 PM Viresh Kumar wrote:
> On 3 August 2013 17:38, Rafael J. Wysocki  wrote:
> > On Saturday, August 03, 2013 05:19:26 PM Viresh Kumar wrote:
> >> Governor's owner refcount usage was broken. We should increment refcount 
> >> only
> >> when CPUFREQ_GOV_POLICY_INIT event has come and should decrement only if
> >> CPUFREQ_GOV_POLICY_EXIT has come.
> >>
> >> Lets fix it.
> >
> > OK, and what happens if we don't fix it?
> 
> What about this changelog:
> 
> Subject: [PATCH 08/10] cpufreq: Fix broken usage of governor->owner's
>  refcount
> 
> Governor's owner refcount usage was broken. We should increment refcount only
> when CPUFREQ_GOV_POLICY_INIT event has come and should decrement only if
> CPUFREQ_GOV_POLICY_EXIT has come.
> 
> Currently there can be situations where governor is in use but we have allowed
> it to be unloaded. And that may cause in undefined behavior.

"it to be unloaded which may result in undefined behavior."

> 
> Lets fix it.
> 
> Signed-off-by: Viresh Kumar 

Apart from the above looks good.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] CPUFreq: Fixes & Cleanups for 3.12

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 06:58:35 PM Viresh Kumar wrote:
> On 3 August 2013 17:37, Rafael J. Wysocki  wrote:
> > On Saturday, August 03, 2013 05:19:18 PM Viresh Kumar wrote:
> >> Hi Rafael,
> >>
> >> This patchset tries to fix & cleanup many existing cpufreq core issues. 
> >> First
> >> four patches tries to cleanup basic problems in cpufreq core. Its first 
> >> patch
> >> was earlier sent separately but now is part of this series.
> >>
> >> Fifth patch was also sent earlier as reply to your patches and was 
> >> reviewed by
> >> Srivatsa. Sixth patch was picked from Lukasz's patchset on introducing 
> >> software
> >> "boost" feature in core. It will be used by this patchset.
> >>
> >> And last four are the most significant part of this set. They try to make 
> >> many
> >> things simple and robust.
> >>
> >> This is rebased of your bleeding-edge branch + two patches from you:
> >> 18a6b03 cpufreq: Avoid double kobject_put() for the same kobject in error 
> >> code path
> >> d0cde63 cpufreq: Do not hold driver module references for additional 
> >> policy CPUs
> >> abe513f Merge branch 'acpi-sleep-next' into linux-next
> >>
> >> They are also pushed in my cpufreq-next branch
> >
> > How much testing has that received so far?
> 
> I planned to add this information but forgot at the last moment. It
> was partially
> tested. As it was mostly developed over the weekend I wasn't able to do much
> of testing. I posted it to get early comments, and testing would complete by
> beginning of next week.

OK, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 03:20:22 PM Felipe Contreras wrote:
> On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki  wrote:
> > On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote:
> 
> >> Yes, that's fine, either the revert, or the patch I mentioned, or
> >> something else, but something has to be done, and it was better to do
> >> it in v3.11-rc4 than in v3.11-rc5, because that change itself can
> >> cause further problems.
> >
> > A revert can be done in -rc5 just fine.  If we don't have a working fix this
> > week, I'll do the revert.
> 
> I think you are waiting for miracles. But whatever.
> 
> >> Let's do a though experiment, let's say you are right, and they can
> >> survive the 6 months it would take you to find the "perfect" solution,
> >> say in v3.13. What's wrong with having a partial solution in v3.12? If
> >> the blacklisting doesn't work properly (there's absolutely no evidence
> >> for that), then you revert it on v3.12.1.
> >>
> >> What's wrong with that approach?
> >
> > If the blacklisting leads to problems, they may not be reported in the 3.12
> > time frame, but much later.  For example because people won't realize that
> > the problems are caused by the blacklisting until much much later.  And then
> > we'll be in a spot where whatever we do will break things for someone.
> 
> The key word is *may*, you don't *know*. Why do you insist in
> committing this reification fallacy?
> 
> This threat is not real, it's theoretical.

I believe it is real.  Igor has told you that already, hasn't he?

> But let's suppose you are right, and there are issues, and those don't
> get reported in v3.12. That is actually GOOD, if people don't report
> issues, it means the issues are not that big, or not even *there*.

You don't even realize how wrong you are.  The issues that aren't visible
in testing from the start are often much *worse* than those we can see
immediately, because usually they are much more difficult to fix and they
cause much more pain overall as a result.

But I'm not sure why I'm still trying to explain things to you, because you
don't understand basic stuff.

> And no, we won't be in a spot where whatever we do will break things,
> because if the intel backlight driver works correctly, that solution
> would work for everyone. And if it doesn't, we should stay with what
> works best.
> 
> > And we had situations like that in the past, which is the source of my 
> > concern.
> > You obviously don't have that experience, or you won't be so eager to 
> > inflict
> > the blacklisting on everyone.
> >
> > Anyway, as you know, but conveniently don't mention, I asked some 
> > experienced
> > people for opinions about that.  If they agree with you, we will add the
> > blacklist.  If they don't, we won't add it.
> 
> Again, screw the users. We are stuck with broken backlight for several
> more months to come. Great.

Whatever you are thinking you will achieve this way, it doesn't work.

You seem to believe that the more you press people, the more they are likely
to do what you want.  Maybe that belief is a result of your previous
experience, but this is not helping you here.

Decisions made under pressure usually lead to wrong choices, so I avoid that
as much as I can.  As a result, the more you're pressing me, the more I'm
concerned about the drawbacks and the less convincing you are to me.

Consider this before you reply.

Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Andres Salomon
On Sat, 3 Aug 2013 23:36:15 +0200
Jens Frederich  wrote:

> On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon 
> wrote:
> > Please Cc Daniel on these.  Cjb and myself are no longer at olpc.
> >
> 
> Do you know what's with Jon Nettleton? He is also on the TODO list?

Let's ask.  Jon, do you still want to be Cc'd on DCON and other OLPC
patches?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Jens Frederich
On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon  wrote:
> Please Cc Daniel on these.  Cjb and myself are no longer at olpc.
>

Do you know what's with Jon Nettleton? He is also on the TODO list?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] cpufreq: Re-arrange declarations in cpufreq.h

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 10:43:45 PM Viresh Kumar wrote:
> On 3 August 2013 17:19, Viresh Kumar  wrote:
> > They are pretty much mixed up. Although generic headers are present but
> > definitions/declarations are present outside them too..
> >
> > This patch just moves stuff up and down to make it look better and 
> > consistent.
> >
> > Signed-off-by: Viresh Kumar 
> > ---
> >  include/linux/cpufreq.h | 370 
> > +++-
> >  1 file changed, 177 insertions(+), 193 deletions(-)
> 
> Fixup due to compilation reported by Fengguang's kbuild system:
> [Will post the series again once I receive more comments on it]

OK, thanks.  I'm waiting for the update of the whole series, then.

> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index a6b97e2..d568f39 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -268,6 +268,19 @@ int cpufreq_unregister_notifier(struct
> notifier_block *nb, unsigned int list);
>  void cpufreq_notify_transition(struct cpufreq_policy *policy,
> struct cpufreq_freqs *freqs, unsigned int state);
> 
> +#else /* CONFIG_CPU_FREQ */
> +static inline int cpufreq_register_notifier(struct notifier_block *nb,
> +   unsigned int list)
> +{
> +   return 0;
> +}
> +static inline int cpufreq_unregister_notifier(struct notifier_block *nb,
> +   unsigned int list)
> +{
> +   return 0;
> +}
> +#endif /* !CONFIG_CPU_FREQ */
> +
>  /**
>   * cpufreq_scale - "old * mult / div" calculation for large values 
> (32-bit-arch
>   * safe)
> @@ -282,32 +295,16 @@ static inline unsigned long
> cpufreq_scale(unsigned long old, u_int div,
> u_int mult)
>  {
>  #if BITS_PER_LONG == 32
> -
> u64 result = ((u64) old) * ((u64) mult);
> do_div(result, div);
> return (unsigned long) result;
> 
>  #elif BITS_PER_LONG == 64
> -
> unsigned long result = old * ((u64) mult);
> result /= div;
> return result;
> -
>  #endif
> -};
> -
> -#else /* CONFIG_CPU_FREQ */
> -static inline int cpufreq_register_notifier(struct notifier_block *nb,
> -   unsigned int list)
> -{
> -   return 0;
>  }
> -static inline int cpufreq_unregister_notifier(struct notifier_block *nb,
> -   unsigned int list)
> -{
> -   return 0;
> -}
> -#endif /* !CONFIG_CPU_FREQ */
> 
>  /*
>   *  CPUFREQ GOVERNORS*
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] cpufreq: Cleanup header files included in core

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 10:45:04 PM Viresh Kumar wrote:
> On 3 August 2013 17:19, Viresh Kumar  wrote:
> > This patch intends to cleanup following issues in the header files included 
> > in
> > cpufreq core layers:
> > - Include headers in ascending order, so that we don't add same multiple 
> > times
> >   by mistake.
> > -  must be included after , so that they override whatever 
> > they
> >   need.
> > - Remove unnecessary header files
> > - Don't include files already included by cpufreq.h or cpufreq_governor.h
> >
> > Signed-off-by: Viresh Kumar 
> 
> Fixup due to compilation warning reported by Fengguang's kbuild system:
> [Latest stuff pushed in my cpufreq-next branch]
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 70399ea..ccaf025 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -19,6 +19,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 

Can you please repost the complete patch with this folded in?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Jens Frederich
On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon  wrote:
> Please Cc Daniel on these.  Cjb and myself are no longer at olpc.
>
Sorry, I've forgotten it. I will update the the TODO list.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Rafael J. Wysocki
On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote:
> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:
> 
> >> Yes, the patch is correct, but I still prefer my own version :-)
> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
> >>
> >> In case you want to take mine and mine needs refresh, please let me know
> >> and I can do the re-base, thanks.
> >
> > Well, I prefer simpler, unless there's a good reason to use more 
> > complicated.
> 
> Note that these are not exclusionary; his patch can be applied on top
> of mine. I don't think his patch is needed though.

OK

Do we still need to revert commit efaa14c if this patch is applied?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Andres Salomon
Please Cc Daniel on these.  Cjb and myself are no longer at olpc. 



On Sat,  3 Aug 2013 22:44:35 +0200
Jens Frederich  wrote:

> This patch replace some magic numbers. I believe it makes
> the driver more readable.
> 
> The magic number 0x26 is the XO system embedded controller
> (EC) command 'DCON power enable/disable'.
> 
> Number 0x41, and 0x42 are special memory controller settings
> register.  The 0x41 initialize bit sequence 0x101 means:
> enable memory power down function and special SDRAM clock
> delay for synchronize SDRAM output and clock signal.
> 
> The 0x42 initialize squence 0x101 is wrong.  According to
> the specification Bit 8 is reserved, thus not in use.
> I removed it.
> 
> Signed-off-by: Jens Frederich 
> 
> diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c
> b/drivers/staging/olpc_dcon/olpc_dcon.c index 7c460f2..5ca4fa4 100644
> --- a/drivers/staging/olpc_dcon/olpc_dcon.c
> +++ b/drivers/staging/olpc_dcon/olpc_dcon.c
> @@ -90,9 +90,10 @@ static int dcon_hw_init(struct dcon_priv *dcon,
> int is_init) 
>   /* SDRAM setup/hold time */
>   dcon_write(dcon, 0x3a, 0xc040);
> - dcon_write(dcon, 0x41, 0x);
> - dcon_write(dcon, 0x41, 0x0101);
> - dcon_write(dcon, 0x42, 0x0101);
> + dcon_write(dcon, DCON_REG_MEM_OPT_A, 0x);  /* clear
> option bits */
> + dcon_write(dcon, DCON_REG_MEM_OPT_A,
> + MEM_DLL_CLOCK_DELAY |
> MEM_POWER_DOWN);
> + dcon_write(dcon, DCON_REG_MEM_OPT_B, MEM_SOFT_RESET);
>  
>   /* Colour swizzle, AA, no passthrough, backlight */
>   if (is_init) {
> @@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv
> *dcon, int is_powered_down) power_up:
>   if (is_powered_down) {
>   x = 1;
> - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL,
> 0);
> + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1,
> NULL, 0); if (x) {
>   pr_warn("unable to force dcon to power up:
> %d!\n", x); return x;
> @@ -144,7 +145,7 @@ power_up:
>   pr_err("unable to stabilize dcon's smbus,
> reasserting power and praying.\n");
> BUG_ON(olpc_board_at_least(olpc_board(0xc2))); x = 0;
> - olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0);
> + olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL,
> 0); msleep(100);
>   is_powered_down = 1;
>   goto power_up;  /* argh, stupid hardware.. */
> @@ -208,7 +209,7 @@ static void dcon_sleep(struct dcon_priv *dcon,
> bool sleep) 
>   if (sleep) {
>   x = 0;
> - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL,
> 0);
> + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1,
> NULL, 0); if (x)
>   pr_warn("unable to force dcon to power down:
> %d!\n", x); else
> diff --git a/drivers/staging/olpc_dcon/olpc_dcon.h
> b/drivers/staging/olpc_dcon/olpc_dcon.h index 997bded..524ee49 100644
> --- a/drivers/staging/olpc_dcon/olpc_dcon.h
> +++ b/drivers/staging/olpc_dcon/olpc_dcon.h
> @@ -22,15 +22,24 @@
>  #define MODE_DEBUG   (1<<14)
>  #define MODE_SELFTEST(1<<15)
>  
> -#define DCON_REG_HRES2
> -#define DCON_REG_HTOTAL  3
> -#define DCON_REG_HSYNC_WIDTH 4
> -#define DCON_REG_VRES5
> -#define DCON_REG_VTOTAL  6
> -#define DCON_REG_VSYNC_WIDTH 7
> -#define DCON_REG_TIMEOUT 8
> -#define DCON_REG_SCAN_INT9
> -#define DCON_REG_BRIGHT  10
> +#define DCON_REG_HRES0x2
> +#define DCON_REG_HTOTAL  0x3
> +#define DCON_REG_HSYNC_WIDTH 0x4
> +#define DCON_REG_VRES0x5
> +#define DCON_REG_VTOTAL  0x6
> +#define DCON_REG_VSYNC_WIDTH 0x7
> +#define DCON_REG_TIMEOUT 0x8
> +#define DCON_REG_SCAN_INT0x9
> +#define DCON_REG_BRIGHT  0x10
> +#define DCON_REG_MEM_OPT_A   0x41
> +#define DCON_REG_MEM_OPT_B   0x42
> +
> +/* Load Delay Locked Loop (DLL) settings for clock delay */
> +#define MEM_DLL_CLOCK_DELAY  (1<<0)
> +/* Memory controller power down function */
> +#define MEM_POWER_DOWN   (1<<8)
> +/* Memory controller software reset */
> +#define MEM_SOFT_RESET   (1<<0)
>  
>  /* Status values */
>  
> diff --git a/include/linux/olpc-ec.h b/include/linux/olpc-ec.h
> index 5bb6e76..2925df3 100644
> --- a/include/linux/olpc-ec.h
> +++ b/include/linux/olpc-ec.h
> @@ -6,6 +6,7 @@
>  #define EC_WRITE_SCI_MASK0x1b
>  #define EC_WAKE_UP_WLAN  0x24
>  #define EC_WLAN_LEAVE_RESET  0x25
> +#define EC_DCON_POWER_MODE   0x26
>  #define EC_READ_EB_MODE  0x2a
>  #define EC_SET_SCI_INHIBIT   0x32
>  #define EC_SET_SCI_INHIBIT_RELEASE   0x34
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Staging: olpc_dcon: Already completed TODO entry removed

2013-08-03 Thread Jens Frederich
The TODO entry - drop global variables, use a proper olpc_dcon_priv
struct - is already finished. The driver has no global variables.
It uses the private structure 'dcon_priv'.

Signed-off-by: Jens Frederich 

diff --git a/drivers/staging/olpc_dcon/TODO b/drivers/staging/olpc_dcon/TODO
index f378e84..886e46e 100644
--- a/drivers/staging/olpc_dcon/TODO
+++ b/drivers/staging/olpc_dcon/TODO
@@ -3,7 +3,6 @@ TODO:
  share more code
- allow simultaneous XO-1 and XO-1.5 support
- console event notifier support
-   - drop global variables, use a proper olpc_dcon_priv struct
- audit code for unnecessary code; old unsupported prototype
  workarounds, ancient variables (noaa?), etc
- verify sane i2c API usage, update to new stuff if necessary
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Staging: olpc_dcon: replace some magic numbers

2013-08-03 Thread Jens Frederich
This patch replace some magic numbers. I believe it makes
the driver more readable.

The magic number 0x26 is the XO system embedded controller
(EC) command 'DCON power enable/disable'.

Number 0x41, and 0x42 are special memory controller settings
register.  The 0x41 initialize bit sequence 0x101 means:
enable memory power down function and special SDRAM clock
delay for synchronize SDRAM output and clock signal.

The 0x42 initialize squence 0x101 is wrong.  According to
the specification Bit 8 is reserved, thus not in use.
I removed it.

Signed-off-by: Jens Frederich 

diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c 
b/drivers/staging/olpc_dcon/olpc_dcon.c
index 7c460f2..5ca4fa4 100644
--- a/drivers/staging/olpc_dcon/olpc_dcon.c
+++ b/drivers/staging/olpc_dcon/olpc_dcon.c
@@ -90,9 +90,10 @@ static int dcon_hw_init(struct dcon_priv *dcon, int is_init)
 
/* SDRAM setup/hold time */
dcon_write(dcon, 0x3a, 0xc040);
-   dcon_write(dcon, 0x41, 0x);
-   dcon_write(dcon, 0x41, 0x0101);
-   dcon_write(dcon, 0x42, 0x0101);
+   dcon_write(dcon, DCON_REG_MEM_OPT_A, 0x);  /* clear option bits */
+   dcon_write(dcon, DCON_REG_MEM_OPT_A,
+   MEM_DLL_CLOCK_DELAY | MEM_POWER_DOWN);
+   dcon_write(dcon, DCON_REG_MEM_OPT_B, MEM_SOFT_RESET);
 
/* Colour swizzle, AA, no passthrough, backlight */
if (is_init) {
@@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv *dcon, int 
is_powered_down)
 power_up:
if (is_powered_down) {
x = 1;
-   x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0);
+   x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0);
if (x) {
pr_warn("unable to force dcon to power up: %d!\n", x);
return x;
@@ -144,7 +145,7 @@ power_up:
pr_err("unable to stabilize dcon's smbus, reasserting power and 
praying.\n");
BUG_ON(olpc_board_at_least(olpc_board(0xc2)));
x = 0;
-   olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0);
+   olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0);
msleep(100);
is_powered_down = 1;
goto power_up;  /* argh, stupid hardware.. */
@@ -208,7 +209,7 @@ static void dcon_sleep(struct dcon_priv *dcon, bool sleep)
 
if (sleep) {
x = 0;
-   x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0);
+   x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0);
if (x)
pr_warn("unable to force dcon to power down: %d!\n", x);
else
diff --git a/drivers/staging/olpc_dcon/olpc_dcon.h 
b/drivers/staging/olpc_dcon/olpc_dcon.h
index 997bded..524ee49 100644
--- a/drivers/staging/olpc_dcon/olpc_dcon.h
+++ b/drivers/staging/olpc_dcon/olpc_dcon.h
@@ -22,15 +22,24 @@
 #define MODE_DEBUG (1<<14)
 #define MODE_SELFTEST  (1<<15)
 
-#define DCON_REG_HRES  2
-#define DCON_REG_HTOTAL3
-#define DCON_REG_HSYNC_WIDTH   4
-#define DCON_REG_VRES  5
-#define DCON_REG_VTOTAL6
-#define DCON_REG_VSYNC_WIDTH   7
-#define DCON_REG_TIMEOUT   8
-#define DCON_REG_SCAN_INT  9
-#define DCON_REG_BRIGHT10
+#define DCON_REG_HRES  0x2
+#define DCON_REG_HTOTAL0x3
+#define DCON_REG_HSYNC_WIDTH   0x4
+#define DCON_REG_VRES  0x5
+#define DCON_REG_VTOTAL0x6
+#define DCON_REG_VSYNC_WIDTH   0x7
+#define DCON_REG_TIMEOUT   0x8
+#define DCON_REG_SCAN_INT  0x9
+#define DCON_REG_BRIGHT0x10
+#define DCON_REG_MEM_OPT_A 0x41
+#define DCON_REG_MEM_OPT_B 0x42
+
+/* Load Delay Locked Loop (DLL) settings for clock delay */
+#define MEM_DLL_CLOCK_DELAY(1<<0)
+/* Memory controller power down function */
+#define MEM_POWER_DOWN (1<<8)
+/* Memory controller software reset */
+#define MEM_SOFT_RESET (1<<0)
 
 /* Status values */
 
diff --git a/include/linux/olpc-ec.h b/include/linux/olpc-ec.h
index 5bb6e76..2925df3 100644
--- a/include/linux/olpc-ec.h
+++ b/include/linux/olpc-ec.h
@@ -6,6 +6,7 @@
 #define EC_WRITE_SCI_MASK  0x1b
 #define EC_WAKE_UP_WLAN0x24
 #define EC_WLAN_LEAVE_RESET0x25
+#define EC_DCON_POWER_MODE 0x26
 #define EC_READ_EB_MODE0x2a
 #define EC_SET_SCI_INHIBIT 0x32
 #define EC_SET_SCI_INHIBIT_RELEASE 0x34
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sparc64 WARNING: at mm/mmap.c:2757 exit_mmap+0x13c/0x160()

2013-08-03 Thread David Miller
From: Aaro Koskinen 
Date: Mon, 17 Jun 2013 08:58:39 +0300

> On Mon, Jun 17, 2013 at 08:32:25AM +0300, Aaro Koskinen wrote:
>> On Mon, Jun 17, 2013 at 12:06:00AM +0300, Meelis Roos wrote:
>> > Got this in 3.10-rc6 whil testing debian unstable upgrade with aptitude. 
>> > 3.10-rc5 did not exhibit this (nor any other kernel recently tried, 
>> > including most -rc's). Does not seem to be reproducible.
>> 
>> I get this regularly on Ultrasparc during long compilations. It's been
>> there with all recent kernels (probably at least since 3.8). Latest I
>> saw with 3.10-rc5.
> 
> Two examples:

Thanks for the reports, I'm actively looking into this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] acpi: video: trivial costmetic cleanups

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 6:38 AM, Rafael J. Wysocki  wrote:
> On Friday, August 02, 2013 08:34:29 PM Felipe Contreras wrote:
>> On Fri, Aug 2, 2013 at 7:07 PM, Rafael J. Wysocki  wrote:
>> > On Friday, August 02, 2013 12:52:18 PM Felipe Contreras wrote:
>> >> On Fri, Aug 2, 2013 at 9:05 AM, Rafael J. Wysocki  wrote:
>> >> > On Thursday, August 01, 2013 11:15:38 PM Felipe Contreras wrote:
>> >> >> On Thu, Aug 1, 2013 at 8:50 PM, Aaron Lu  wrote:
>> >> >> > On 08/02/2013 07:43 AM, Felipe Contreras wrote:
>> >> >> >> Signed-off-by: Felipe Contreras 
>> >> >> >
>> >> >> > Please add change log explaining what you have changed.
>> >> >> > It seems that the patch modify comment style only, some add a space 
>> >> >> > and
>> >> >> > some change spaces to tab, is it the case?
>> >> >>
>> >> >> The commit message already explains what the change is; trivial
>> >> >> cosmetic cleanups. Cosmetic means it's completely superficial.
>> >> >
>> >> > And I have a rule not to apply patches without changelogs.  So either 
>> >> > I'll
>> >> > need to write it for you, or can you add one pretty please?
>> >>
>> >> The commit message is right there. Maybe Jiri can apply it then, if
>> >> not, then stay happy with your untidy code.
>> >
>> > First of all, I didn't say I wouldn't apply the patch, did I?
>> >
>> > Second, I asked you *nicely* to add a changelog so that I don't need to 
>> > write
>> > it for you.
>> >
>> > I don't know what made it difficult to understand.
>> >
>> > Anyway, I ask everyone to write changelogs and nobody has had any problems 
>> > with
>> > that so far.  I don't see why I should avoid asking you to follow the rules
>> > that everybody else is asked to follow.  If those rules are too difficult 
>> > for
>> > you to follow, I'm sorry.
>>
>> The patch has a commit message that describes exactly what it does.
>
> No, it doesn't describe it exactly.  You're contradicting facts.
>
>> Unless there is valid feedback I will not send another version.
>>
>> To me, a valid criticism to the commit message would be: "I read X,
>> but I thought it would do Y". For example; "I didn't expect the patch
>> to do white-space cleanups", but I think that's exactly what people
>> expect when they read "trivial costmetic cleanups'.
>
> If what you're saying was correct, then it would be sufficient to use a
> "this patch fixes a bug" commit message for every bug fix, but quite evidently
> that is not the case.

No, it wouldn't be sufficient, take a look a the Corbert's list you
yourself mentioned:

* the original motivation for the work is quickly forgotten

"this patch fixes a bug" doesn't describe the motivation.

* Andrew Morton also famously pushes developers to document the
reasons explaining why a patch was written, including the user-visible
effects of any bugs fixed

The reason for the patch is not documented, nor the user-visible effects.

* Kernel developers do not like having to reverse engineer the intent
of a patch years after the fact.

The intent of the patch is not mentioned.

That is completely different with my patch.

Personally I like to answer these questions: What is the patch is
doing (motivation)? What is the current problem? What is the change?
What are the side-effects?

All those are clear with "trivial costmetic cleanups", they are not
with "this patch fixes a bug".

I think you are committing a hasty generalization fallacy. Not all
patches are the same.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] acpi: video: improve quirk check

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki  wrote:
> On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote:

>> Yes, the patch is correct, but I still prefer my own version :-)
>> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9
>>
>> In case you want to take mine and mine needs refresh, please let me know
>> and I can do the re-base, thanks.
>
> Well, I prefer simpler, unless there's a good reason to use more complicated.

Note that these are not exclusionary; his patch can be applied on top
of mine. I don't think his patch is needed though.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko
 wrote:

> I am opposed to this patch. On ThinkPad X230 I had problems with it.
> Felipe, come over to dark side. They have cookies.

And v3.11-rc3 works's fine out-of-the box?

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4

2013-08-03 Thread Felipe Contreras
On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki  wrote:
> On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote:

>> Yes, that's fine, either the revert, or the patch I mentioned, or
>> something else, but something has to be done, and it was better to do
>> it in v3.11-rc4 than in v3.11-rc5, because that change itself can
>> cause further problems.
>
> A revert can be done in -rc5 just fine.  If we don't have a working fix this
> week, I'll do the revert.

I think you are waiting for miracles. But whatever.

>> Let's do a though experiment, let's say you are right, and they can
>> survive the 6 months it would take you to find the "perfect" solution,
>> say in v3.13. What's wrong with having a partial solution in v3.12? If
>> the blacklisting doesn't work properly (there's absolutely no evidence
>> for that), then you revert it on v3.12.1.
>>
>> What's wrong with that approach?
>
> If the blacklisting leads to problems, they may not be reported in the 3.12
> time frame, but much later.  For example because people won't realize that
> the problems are caused by the blacklisting until much much later.  And then
> we'll be in a spot where whatever we do will break things for someone.

The key word is *may*, you don't *know*. Why do you insist in
committing this reification fallacy?

This threat is not real, it's theoretical.

But let's suppose you are right, and there are issues, and those don't
get reported in v3.12. That is actually GOOD, if people don't report
issues, it means the issues are not that big, or not even *there*.

And no, we won't be in a spot where whatever we do will break things,
because if the intel backlight driver works correctly, that solution
would work for everyone. And if it doesn't, we should stay with what
works best.

> And we had situations like that in the past, which is the source of my 
> concern.
> You obviously don't have that experience, or you won't be so eager to inflict
> the blacklisting on everyone.
>
> Anyway, as you know, but conveniently don't mention, I asked some experienced
> people for opinions about that.  If they agree with you, we will add the
> blacklist.  If they don't, we won't add it.

Again, screw the users. We are stuck with broken backlight for several
more months to come. Great.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] drop_caches: add some documentation and info message

2013-08-03 Thread KOSAKI Motohiro
>>> You missed the "!".  I'm proposing that setting the new bit 2 will
>>> permit people to prevent the new printk if it is causing them problems.
>>
>> No I don't. I'm sure almost all abuse users think our usage is correct. Then,
>> I can imagine all crazy applications start to use this flag eventually.
> 
> I guess we do not care about those. If somebody wants to shoot his feet
> then we cannot do much about it. The primary motivation was to find out
> those that think this is right and they are willing to change the setup
> once they know this is not the right way to do things.
> 
> I think that giving a way to suppress the warning is a good step. Log
> level might be to coarse and sysctl would be an overkill.

When Dave Hansen reported this issue originally, he explained a lot of userland
developer misuse /proc/drop_caches because they don't understand what
drop_caches do.
So, if they never understand the fact, why can we trust them? I have no
idea.
Or, if you have different motivation w/ Dave, please let me know it.

While the purpose is to shoot misuse, I don't think we can trust userland app.
If "If somebody wants to shoot his feet then we cannot do much about it." is 
true,
this patch is useless. OK, we still catch the right user. But we never want to 
know
who is the right users, right?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 4/5] Only set page reserved in the memblock region

2013-08-03 Thread Nathan Zimmer
On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote:
> Currently we when we initialze each page struct is set as reserved upon
> initialization.  This changes to starting with the reserved bit clear and
> then only setting the bit in the reserved region.
> 
> I could restruture a bit to eliminate the perform hit.  But I wanted to make
> sure I am on track first.
> 
> Signed-off-by: Robin Holt 
> Signed-off-by: Nathan Zimmer 
> To: "H. Peter Anvin" 
> To: Ingo Molnar 
> Cc: Linux Kernel 
> Cc: Linux MM 
> Cc: Rob Landley 
> Cc: Mike Travis 
> Cc: Daniel J Blueman 
> Cc: Andrew Morton 
> Cc: Greg KH 
> Cc: Yinghai Lu 
> Cc: Mel Gorman 
> ---
>  include/linux/mm.h |  2 ++
>  mm/nobootmem.c |  3 +++
>  mm/page_alloc.c| 16 
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..b264a26 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct 
> page *page, long count)
>   totalram_pages += count;
>  }
>  
> +extern void reserve_bootmem_region(unsigned long start, unsigned long end);
> +
>  /* Free the reserved page into the buddy system, so it gets managed. */
>  static inline void __free_reserved_page(struct page *page)
>  {
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 2159e68..0840af2 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -117,6 +117,9 @@ static unsigned long __init 
> free_low_memory_core_early(void)
>   phys_addr_t start, end, size;
>   u64 i;
>  
> + for_each_reserved_mem_region(i, &start, &end)
> + reserve_bootmem_region(start, end);
> +
>   for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>   count += __free_memory_core(start, end);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df3ec13..382223e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct 
> page *page, int order,
>   spin_unlock(&zone->lock);
>  }
>  
> -static void __init_single_page(unsigned long pfn, unsigned long zone, int 
> nid)
> +static void __init_single_page(unsigned long pfn, unsigned long zone,
> +int nid, int page_count)
>  {
>   struct page *page = pfn_to_page(pfn);
>   struct zone *z = &NODE_DATA(nid)->node_zones[zone];
>  
>   set_page_links(page, zone, nid, pfn);
>   mminit_verify_page_links(page, zone, nid, pfn);
> - init_page_count(page);
>   page_mapcount_reset(page);
>   page_nid_reset_last(page);
> - SetPageReserved(page);
> + set_page_count(page, page_count);
> + ClearPageReserved(page);
>  
>   /*
>* Mark the block movable so that blocks are reserved for
> @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, 
> unsigned long zone, int nid)
>  #endif
>  }
>  
> +void reserve_bootmem_region(unsigned long start, unsigned long end)
> +{
> + for (; start < end; start++)
> + if (pfn_valid(start))
> + SetPageReserved(pfn_to_page(start));
> +}
> +
>  static bool free_pages_prepare(struct page *page, unsigned int order)
>  {
>   int i;
> @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
> nid, unsigned long zone,
>   if (!early_pfn_in_nid(pfn, nid))
>   continue;
>   }
> - __init_single_page(pfn, zone, nid);
> + __init_single_page(pfn, zone, nid, 1);
>   }
>  }
>  
> -- 
> 1.8.2.1
> 
Actually I believe reserve_bootmem_region is wrong.  I am passing in phys_adr_t 
and not pfns.

It should be: 
void reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

for (; start_pfn < end_pfn; start_pfn++)
if (pfn_valid(start_pfn))
SetPageReserved(pfn_to_page(start_pfn));
}

That also brings the timings back in line with the previous patch set.

Nate
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT] Networking

2013-08-03 Thread David Miller

1) Don't ignore user initiated wireless regulatory settings on cards
   with custom regulatory domains, from Arik Nemtsov.

2) Fix length check of bluetooth information responses, from Jaganath
   Kanakkassery.

3) Fix misuse of PTR_ERR in btusb, from Adam Lee.

4) Handle rfkill properly while iwlwifi devices are offline,
   from Emmanuel Grumbach.

5) Fix r815x devices DMA'ing to stack buffers, from Hayes Wang.

6) Kernel info leak in ATM packet scheduler, from Dan Carpenter.

7) 8139cp doesn't check for DMA mapping errors, from Neil Horman.

8) Fix bridge multicast code to not snoop when no querier exists,
   otherwise mutlicast traffic is lost.  From Linus Lüssing.

9) Avoid soft lockups in fib6_run_gc(), from Michal Kubecek.

10) Fix races in automatic address asignment on ipv6, which can
result in incorrect lifetime assignments.  From Jiri Benc.

11) Cure build bustage when CONFIG_NET_LL_RX_POLL is not set and
rename it CONFIG_NET_RX_BUSY_POLL to eliminate the last reference
to the original naming of this feature.  From Cong Wang.

12) Fix crash in TIPC when server socket creation fails, from Ying
Xue.

13) macvlan_changelink() silently succeeds when it shouldn't, from
Michael S. Tsirkin.

14) HTB packet scheduler can crash due to sign extension, fix from
Stephen Hemminger.

15) With the cable unplugged, r8169 prints out a message every 10
seconds, make it netif_dbg() instead of netif_warn().  From
Peter Wu.

16) Fix memory leak in rtm_to_ifaddr(), from Daniel Borkmann.

17) sis900 gets spurious TX queue timeouts due to mismanagement of
link carrier state, from Denis Kirjanov.

18) Validate somaxconn sysctl to make sure it fits inside of a
u16.  From Roman Gushchin.

19) Fix MAC address filtering on qlcnic, from Shahed Shaikh.

Please pull, thanks a lot!

The following changes since commit 06693f305e60202d2795a10bee7fb7da23bc2acc:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2013-07-31 
12:56:18 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net master

for you to fetch changes up to 4bd8e7385961932d863ea976a67f384c3a8302cb:

  qlcnic: Fix for flash update failure on 83xx adapter (2013-08-03 12:03:04 
-0700)


AceLan Kao (2):
  Bluetooth: Add support for Atheros [0cf3:3121]
  Bluetooth: Add support for Atheros [0cf3:e003]

Adam Lee (1):
  Bluetooth: fix wrong use of PTR_ERR() in btusb

Arend van Spriel (1):
  brcmfmac: inform cfg80211 about disconnect when device is unplugged

Arik Nemtsov (1):
  regulatory: use correct regulatory initiator on wiphy register

Avinash Patil (2):
  mwifiex: check for bss_role instead of bss_mode for STA operations
  mwifiex: fix wrong data rates in P2P client

Chun-Yeow Yeoh (1):
  mac80211: prevent the buffering or frame transmission to non-assoc mesh 
STA

Cong Wang (2):
  net: fix a compile error when CONFIG_NET_LL_RX_POLL is not set
  net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL

Dan Carpenter (1):
  net_sched: info leak in atm_tc_dump_class()

Daniel Borkmann (1):
  net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing fails

David S. Miller (1):
  Merge branch 'for-davem' of git://git.kernel.org/.../linville/wireless

David Spinadel (1):
  iwlwifi: mvm: set SSID bits for passive channels

Denis Kirjanov (1):
  sis900: Fix the tx queue timeout issue

Emmanuel Grumbach (3):
  iwlwifi: add DELL SKU for 5150 HMC
  iwlwifi: pcie: reset the NIC before the bring up
  iwlwifi: pcie: clear RFKILL interrupt in AMPG

Felipe Balbi (1):
  net: ethernet: cpsw: drop IRQF_DISABLED

Frederic Danis (1):
  NFC: Fix NCI over SPI build

Geert Uytterhoeven (1):
  ath10k: ATH10K should depend on HAS_DMA

Gustavo Padovan (1):
  Bluetooth: Fix race between hci_register_dev() and hci_dev_open()

Himanshu Madhani (2):
  qlcnic: Free up memory in error path.
  qlcnic: Fix for flash update failure on 83xx adapter

Ilan Peer (1):
  iwlwifi: mvm: Disable managed PS when GO is added

Jack Morgenstein (1):
  net/mlx4_core: VFs must ignore the enable_64b_cqe_eqe module param

Jaganath Kanakkassery (1):
  Bluetooth: Fix invalid length check in l2cap_information_rsp()

Jiri Benc (1):
  ipv6: prevent race between address creation and removal

Jiri Pirko (1):
  ipv6: move peer_addr init into ipv6_add_addr()

Joe Perches (1):
  ndisc: Add missing inline to ndisc_addr_option_pad

Johan Hedberg (2):
  Bluetooth: Fix HCI init for BlueFRITZ! devices
  Bluetooth: Fix calling request callback more than once

Johannes Berg (2):
  iwlwifi: mvm: use only a single GTK in D3
  iwlwifi: mvm: fix flushing not started aggregation sessions

John W. Linville (6):
  Merge branch 'for-linville-current' of git://github.com/kvalo/ath
  Merge branch 'for-john' of git://git.kernel.org

Re: [PATCH V2 1/9] perf tools: add test for reading object code

2013-08-03 Thread Adrian Hunter

On 31/07/2013 8:46 p.m., Arnaldo Carvalho de Melo wrote:

Em Wed, Jul 31, 2013 at 02:28:33PM -0300, Arnaldo Carvalho de Melo escreveu:

Still investigating, but the attached patch is needed to handle such
failure cases:



[root@zoo ~]# perf test 21
21: Test object code reading   : FAILED!
[root@zoo ~]# perf test -v 21


Lowering the freq to 4kHz gets me to where I think you was at this
point:

[root@zoo ~]# perf test -v 21
21: Test object code reading   :
--- start ---
Looking at the vmlinux_path (6 entries long)
symsrc__init: cannot get elf header.
Using /lib/modules/3.11.0-rc2+/build/vmlinux for symbols
Parsing event 'cycles'
Reading object code for memory address: 0x8101ce7d
File is: /lib/modules/3.11.0-rc2+/build/vmlinux
On file address is: 0x8101ce7d
dso__data_read_offset failed
 end 
Test object code reading: FAILED!
[root@zoo ~]#

I.e. we need the follow on patches to fix this issue, right?


Yes.  It is using an "identity" map so the memory address and on-file 
address are the same - which doesn't work of course.




I'll merge my changes with your first patch and continue from there.


Please take V3.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add per-process flag to control thp

2013-08-03 Thread Kees Cook
On Fri, Aug 2, 2013 at 1:34 PM, Alex Thorlton  wrote:
>> What kind of workloads are you talking about?
>
> Our benchmarking team has a list of several of the SPEC OMP benchmarks
> that perform significantly better when THP is disabled. I tried to get
> the list but one of our servers is acting up and I can't get to it
> right now :/
>
>> What's wrong with madvise? Could you elaborate?
>
> The main issue with using madvise is that it's not an option with static
> binaries, but there are also some users who have legacy Fortran code
> that they're not willing/able to change.
>
>> And I think thp_disabled should be reset to 0 on exec.
>
> The main purpose for this getting carried down from the parent process
> is that we'd like to be able to have a userland program set this flag on
> itself, and then spawn off children who will also carry the flag.
> This allows us to set the flag for programs where we're unable to modify
> the code, thus resolving the issues detailed above.

This could be done with LD_PRELOAD for uncontrolled binaries. Though
it does seem sensible to make it act like most personality flags do
(i.e. inherited).

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] exec: kill "int depth" in search_binary_handler()

2013-08-03 Thread Kees Cook
On Sat, Aug 3, 2013 at 11:55 AM, Oleg Nesterov  wrote:
> On 08/03, Kees Cook wrote:
>>
>> On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov  wrote:
>> > Nobody except search_binary_handler() should touch ->recursion_depth,
>> > "int depth" buys nothing but complicates the code, kill it.
>>
>> I'd like to see a comment added to binfmts.h's recursion_depth field
>> that reminds people that recursion_depth is for
>> search_binary_handler()'s use only, and a binfmt loader shouldn't
>> touch it.
>
> And this comment probably makes sense even without this change

Yeah totally agreed -- I should have added this when I reorganized the
depth handling earlier. :)

>> Besides that, yeah, sensible clean up.
>
> OK, thanks, please see v2. The only change is the comment in .h
>
> --
> Subject: [PATCH 1/1] exec: kill "int depth" in search_binary_handler()
>
> Nobody except search_binary_handler() should touch ->recursion_depth,
> "int depth" buys nothing but complicates the code, kill it.
>
> Probably we should also kill "fn" and the !NULL check, ->load_binary
> should be always defined. And it can not go away after read_unlock()
> or this code is buggy anyway.
>
> v2: add the comment about linux_binprm->recursion_depth
>
> Signed-off-by: Oleg Nesterov 

Acked-by: Kees Cook 

Thanks,

-Kees

> ---
>  fs/exec.c   |9 -
>  include/linux/binfmts.h |2 +-
>  2 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a9ae4f2..f32079c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1370,12 +1370,11 @@ EXPORT_SYMBOL(remove_arg_zero);
>   */
>  int search_binary_handler(struct linux_binprm *bprm)
>  {
> -   unsigned int depth = bprm->recursion_depth;
> -   int try,retval;
> +   int try, retval;
> struct linux_binfmt *fmt;
>
> /* This allows 4 levels of binfmt rewrites before failing hard. */
> -   if (depth > 5)
> +   if (bprm->recursion_depth > 5)
> return -ELOOP;
>
> retval = security_bprm_check(bprm);
> @@ -1396,9 +1395,9 @@ int search_binary_handler(struct linux_binprm *bprm)
> if (!try_module_get(fmt->module))
> continue;
> read_unlock(&binfmt_lock);
> -   bprm->recursion_depth = depth + 1;
> +   bprm->recursion_depth++;
> retval = fn(bprm);
> -   bprm->recursion_depth = depth;
> +   bprm->recursion_depth--;
> if (retval >= 0) {
> put_binfmt(fmt);
> allow_write_access(bprm->file);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 70cf138..e8112ae 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -31,7 +31,7 @@ struct linux_binprm {
>  #ifdef __alpha__
> unsigned int taso:1;
>  #endif
> -   unsigned int recursion_depth;
> +   unsigned int recursion_depth; /* only for search_binary_handler() */
> struct file * file;
> struct cred *cred;  /* new credentials */
> int unsafe; /* how unsafe this exec is (mask of 
> LSM_UNSAFE_*) */
> --
> 1.5.5.1
>
>



-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] exec: more cleanups

2013-08-03 Thread Kees Cook
On Fri, Aug 2, 2013 at 12:27 PM, Oleg Nesterov  wrote:
> On top of "[PATCH 0/3] exec: minor cleanups + minor fix" I sent
> yesterday.
>
> Perhaps too many patches for the poor search_binary_handler(),
> but I do not know how to document the changes if I join them.
>
> Oleg.
>
>  fs/exec.c |   82 ++--
>  1 files changed, 36 insertions(+), 46 deletions(-)
>

This all looks really good. Thanks for the cleanups! Besides the one
comment on patch 1, consider the series:

Acked-by: Kees Cook 

Thanks,

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] exec: move allow_write_access/fput to exec_binprm()

2013-08-03 Thread Kees Cook
On Fri, Aug 2, 2013 at 12:27 PM, Oleg Nesterov  wrote:
> When search_binary_handler() succeeds it does allow_write_access()
> and fput(), then it clears bprm->file to ensure the caller will not
> do the same.
>
> We can simply move this code to exec_binprm() which is called only
> once. In fact we could move this to free_bprm() and remove the same
> code in do_execve_common's error path.
>
> Signed-off-by: Oleg Nesterov 
> ---
>  fs/exec.c |9 +
>  1 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ad7d624..ef70320 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1400,10 +1400,6 @@ int search_binary_handler(struct linux_binprm *bprm)
> bprm->recursion_depth--;
> if (retval >= 0) {
> put_binfmt(fmt);
> -   allow_write_access(bprm->file);
> -   if (bprm->file)
> -   fput(bprm->file);
> -   bprm->file = NULL;
> return retval;
> }
> read_lock(&binfmt_lock);
> @@ -1455,6 +1451,11 @@ static int exec_binprm(struct linux_binprm *bprm)
> ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
> current->did_exec = 1;
> proc_exec_connector(current);
> +
> +   if (bprm->file) {
> +   allow_write_access(bprm->file);
> +   fput(bprm->file);
> +   }

Why not keep the bprm->file = NULL assignment? Seems reasonable to
keep that just to be avoid use-after-free accidents.

-Kees

> }
>
> return ret;
> --
> 1.5.5.1
>



-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf tools: Renaming 'time' variable in perf_time_to_tsc due to name shadowing error

2013-08-03 Thread Adrian Hunter

On 2/08/2013 4:33 p.m., Jiri Olsa wrote:

The perf compilation fails with following error:
   ...
   CC arch/x86/util/tsc.o
   arch/x86/util/tsc.c: In function ‘perf_time_to_tsc’:
   arch/x86/util/tsc.c:13:6: error: declaration of ‘time’ shadows a global 
declaration [-Werror=shadow]
   cc1: all warnings being treated as errors

Renaming the 'time' variable to prevent this.


Did you see David did the same patch.

Although David noted the gcc version.  It doesn't happen for gcc 4.7.3. 
 The commit message should probably reflect that it depends on the gcc 
version.


Otherwise:

Acked-by: Adrian Hunter 




Signed-off-by: Jiri Olsa
Cc: Arnaldo Carvalho de Melo
Cc: Corey Ashford
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Adrian Hunter
---
  tools/perf/arch/x86/util/tsc.c |8 
  1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
index f111744..9570c2b 100644
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -10,11 +10,11 @@

  u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc)
  {
-   u64 time, quot, rem;
+   u64 t, quot, rem;

-   time = ns - tc->time_zero;
-   quot = time / tc->time_mult;
-   rem  = time % tc->time_mult;
+   t = ns - tc->time_zero;
+   quot = t / tc->time_mult;
+   rem  = t % tc->time_mult;
return (quot<<  tc->time_shift) +
   (rem<<  tc->time_shift) / tc->time_mult;
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] exec: proc_exec_connector() should be called only once

2013-08-03 Thread Kees Cook
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov  wrote:
> A separate one-liner with the minor fix.
>
> PROC_EVENT_EXEC reports the "exec" event, but this message
> is sent at least twice if search_binary_handler() is called
> by ->load_binary() recursively, say, load_script().
>
> Move it to exec_binprm(), this is "depth == 0" code too.
>
> Signed-off-by: Oleg Nesterov 

Yeah, looks right. I almost mentioned this while reading the other
cleanup patch, but then you fixed it too. :)

Acked-by: Kees Cook 

Thanks,

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] exec: introduce exec_binprm() for "depth == 0" code

2013-08-03 Thread Kees Cook
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov  wrote:
> task_pid_nr_ns() and trace/ptrace code in the middle of the
> recursive search_binary_handler() looks confusing and imho
> annoying. We only need this code if "depth == 0", lets add
> a simple helper which calls search_binary_handler() and does
> trace_sched_process_exec() + ptrace_event().
>
> The patch also moves the setting of task->did_exec, we need
> to do this only once.
>
> Note: we can kill either task->did_exec or PF_FORKNOEXEC.
>
> Signed-off-by: Oleg Nesterov 
> ---
>  fs/exec.c |   36 ++--
>  1 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9c73def..a9ae4f2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1373,7 +1373,6 @@ int search_binary_handler(struct linux_binprm *bprm)
> unsigned int depth = bprm->recursion_depth;
> int try,retval;
> struct linux_binfmt *fmt;
> -   pid_t old_pid, old_vpid;
>
> /* This allows 4 levels of binfmt rewrites before failing hard. */
> if (depth > 5)
> @@ -1387,12 +1386,6 @@ int search_binary_handler(struct linux_binprm *bprm)
> if (retval)
> return retval;
>
> -   /* Need to fetch pid before load_binary changes it */
> -   old_pid = current->pid;
> -   rcu_read_lock();
> -   old_vpid = task_pid_nr_ns(current, 
> task_active_pid_ns(current->parent));
> -   rcu_read_unlock();
> -
> retval = -ENOENT;
> for (try=0; try<2; try++) {
> read_lock(&binfmt_lock);
> @@ -1407,16 +1400,11 @@ int search_binary_handler(struct linux_binprm *bprm)
> retval = fn(bprm);
> bprm->recursion_depth = depth;
> if (retval >= 0) {
> -   if (depth == 0) {
> -   trace_sched_process_exec(current, 
> old_pid, bprm);
> -   ptrace_event(PTRACE_EVENT_EXEC, 
> old_vpid);
> -   }
> put_binfmt(fmt);
> allow_write_access(bprm->file);
> if (bprm->file)
> fput(bprm->file);
> bprm->file = NULL;
> -   current->did_exec = 1;
> proc_exec_connector(current);
> return retval;
> }
> @@ -1450,9 +1438,29 @@ int search_binary_handler(struct linux_binprm *bprm)
> }
> return retval;
>  }
> -
>  EXPORT_SYMBOL(search_binary_handler);
>
> +static int exec_binprm(struct linux_binprm *bprm)
> +{
> +   pid_t old_pid, old_vpid;
> +   int ret;
> +
> +   /* Need to fetch pid before load_binary changes it */
> +   old_pid = current->pid;
> +   rcu_read_lock();
> +   old_vpid = task_pid_nr_ns(current, 
> task_active_pid_ns(current->parent));
> +   rcu_read_unlock();
> +
> +   ret = search_binary_handler(bprm);
> +   if (ret >= 0) {
> +   trace_sched_process_exec(current, old_pid, bprm);
> +   ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
> +   current->did_exec = 1;
> +   }

Cleanup looks good. One idea here, though: this could be made more
pretty by doing:

if (ret < 0)
return ret;

to avoid the indentation for the "expected" code path.

-Kees

> +
> +   return ret;
> +}
> +
>  /*
>   * sys_execve() executes a new program.
>   */
> @@ -1541,7 +1549,7 @@ static int do_execve_common(const char *filename,
> if (retval < 0)
> goto out;
>
> -   retval = search_binary_handler(bprm);
> +   retval = exec_binprm(bprm);
> if (retval < 0)
> goto out;
>
> --
> 1.5.5.1
>



-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/3] exec: kill "int depth" in search_binary_handler()

2013-08-03 Thread Oleg Nesterov
On 08/03, Kees Cook wrote:
>
> On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov  wrote:
> > Nobody except search_binary_handler() should touch ->recursion_depth,
> > "int depth" buys nothing but complicates the code, kill it.
>
> I'd like to see a comment added to binfmts.h's recursion_depth field
> that reminds people that recursion_depth is for
> search_binary_handler()'s use only, and a binfmt loader shouldn't
> touch it.

And this comment probably makes sense even without this change

> Besides that, yeah, sensible clean up.

OK, thanks, please see v2. The only change is the comment in .h

--
Subject: [PATCH 1/1] exec: kill "int depth" in search_binary_handler()

Nobody except search_binary_handler() should touch ->recursion_depth,
"int depth" buys nothing but complicates the code, kill it.

Probably we should also kill "fn" and the !NULL check, ->load_binary
should be always defined. And it can not go away after read_unlock()
or this code is buggy anyway.

v2: add the comment about linux_binprm->recursion_depth

Signed-off-by: Oleg Nesterov 
---
 fs/exec.c   |9 -
 include/linux/binfmts.h |2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a9ae4f2..f32079c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1370,12 +1370,11 @@ EXPORT_SYMBOL(remove_arg_zero);
  */
 int search_binary_handler(struct linux_binprm *bprm)
 {
-   unsigned int depth = bprm->recursion_depth;
-   int try,retval;
+   int try, retval;
struct linux_binfmt *fmt;
 
/* This allows 4 levels of binfmt rewrites before failing hard. */
-   if (depth > 5)
+   if (bprm->recursion_depth > 5)
return -ELOOP;
 
retval = security_bprm_check(bprm);
@@ -1396,9 +1395,9 @@ int search_binary_handler(struct linux_binprm *bprm)
if (!try_module_get(fmt->module))
continue;
read_unlock(&binfmt_lock);
-   bprm->recursion_depth = depth + 1;
+   bprm->recursion_depth++;
retval = fn(bprm);
-   bprm->recursion_depth = depth;
+   bprm->recursion_depth--;
if (retval >= 0) {
put_binfmt(fmt);
allow_write_access(bprm->file);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 70cf138..e8112ae 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -31,7 +31,7 @@ struct linux_binprm {
 #ifdef __alpha__
unsigned int taso:1;
 #endif
-   unsigned int recursion_depth;
+   unsigned int recursion_depth; /* only for search_binary_handler() */
struct file * file;
struct cred *cred;  /* new credentials */
int unsafe; /* how unsafe this exec is (mask of 
LSM_UNSAFE_*) */
-- 
1.5.5.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] exec: kill "int depth" in search_binary_handler()

2013-08-03 Thread Kees Cook
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov  wrote:
> Nobody except search_binary_handler() should touch ->recursion_depth,
> "int depth" buys nothing but complicates the code, kill it.

I'd like to see a comment added to binfmts.h's recursion_depth field
that reminds people that recursion_depth is for
search_binary_handler()'s use only, and a binfmt loader shouldn't
touch it. Besides that, yeah, sensible clean up.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 001/001] CHAR DRIVERS: a simple device to give daemons a /sys-like interface

2013-08-03 Thread Bob Smith

Greg
   Thanks for your reply.  I'll reply to your comments in reverse order.


Greg Kroah-Hartman wrote:

And how does this have anything to do with /sys?  I can't see any sysfs
interaction in the code, or am I missing it?


Yes, you are right.  I'll change the subject and brief descriptions to
something like:
"Proxy, a simple bidirectional character device that almost transparently
proxies opens, reads, writes, and closes from one side of the device to
the other side."

I'll take "/sys" from all descriptions of the device, but I might leave it
in the Documentation/proxy.txt file since a major use case of proxy is to
give user space drivers and daemons the same kind of interface the kernel
enjoys with /sys and /proc.  The similarity is very deliberate on my part
for commands like
echo 1 > /proc/sys/net/ipv4/ip_forward   # procfs
echo 75 > /dev/motors/left/speed # proxy dev




Greg Kroah-Hartman wrote:
> Why not just use the cuse interface instead?  How does this differ from
> that /dev node interaction?

I am a big fan of FUSE and CUSE but they do not support what I need.  CUSE
is OK if _ALL_ interaction is through its interface.  What is lacking is
an ability to open, say, a USB serial port and add its file descriptor to
CUSE's FD_SET.  This is a pretty deep problem since a CUSE takes away main()
from the application developer.  A CUSE application looks kind of like this:
main(argc, argv)
{
(check and process command line)
cuse_lowlevel_main(argc, argv, ...)
}

Another difference is that no language bindings are needed.  There is no
equivalent of libfuse.so.  Since proxy is just a character device and there
are no language bindings, someone could, in the unlikely case it ever made
sense, write a user space device driver in node.js.


thanks
Bob Smith





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] ARM: dove: DT updates for v3.11-rc2

2013-08-03 Thread Jason Cooper
On Mon, Jul 29, 2013 at 02:29:02PM +0200, Sebastian Hesselbarth wrote:
> This patch set comprises some DT updates and cleanup patches that
> piled up in the past. The first patch adds a cpu node for the
> Marvell Sheeva PJ4(A) CPU found on Dove SoCs. While touching the
> dtsi, also some nodes are renamed to achieve a consitent naming
> scheme. The second patch adds some common pinmux settings to
> the SoC's pinctrl node and moves the default pinctrl properties to
> the corresponding nodes. The third patch adds a node for the
> IR diode connected to a GPIO pin on SolidRun CuBox. Finally,
> the last patch adds an initial DT file for the Globalscale D2Plug
> which is also based on Marvell Dove SoC.
> 
> The whole patch set is based on v3.11-rc2 with DT part of
> mv643xx_eth and irqchip/clocksource patches applied.
> 
> Although this patches are Marvell Dove only, the whole set is also
> sent to devicetree mailing list to raise attention for potential DT
> binding review.
> 
> Sebastian Hesselbarth (4):
>   ARM: dove: add cpu device tree node
>   ARM: dove: add common pinmux functions to DT
>   ARM: dove: add GPIO IR receiver node to SolidRun CuBox
>   ARM: dove: add initial DT file for Globalscale D2Plug
> 
>  arch/arm/boot/dts/Makefile|1 +
>  arch/arm/boot/dts/dove-cubox.dts  |   30 ++---
>  arch/arm/boot/dts/dove-d2plug.dts |   69 +++
>  arch/arm/boot/dts/dove.dtsi   |  235 
> +
>  4 files changed, 295 insertions(+), 40 deletions(-)
>  create mode 100644 arch/arm/boot/dts/dove-d2plug.dts

Whole series applied to mvebu/boards.

thx,

Jason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ARM: DTS: AM33XX: Add PMU support

2013-08-03 Thread Alexandre Belloni
ARM Performance Monitor Units are available on the am33xx, add the support in
the dtsi.

Tested with perf and oprofile on a regular beaglebone.

Signed-off-by: Alexandre Belloni 
---
 arch/arm/boot/dts/am33xx.dtsi | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index 38b446b..cfeac96 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -53,6 +53,11 @@
};
};
 
+   pmu {
+   compatible = "arm,cortex-a8-pmu";
+   interrupts = <3>;
+   };
+
/*
 * The soc node represents the soc top level view. It is uses for IPs
 * that are not memory mapped in the MPU view or for the MPU itself.
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] cpufreq: Cleanup header files included in core

2013-08-03 Thread Viresh Kumar
On 3 August 2013 17:19, Viresh Kumar  wrote:
> This patch intends to cleanup following issues in the header files included in
> cpufreq core layers:
> - Include headers in ascending order, so that we don't add same multiple times
>   by mistake.
> -  must be included after , so that they override whatever they
>   need.
> - Remove unnecessary header files
> - Don't include files already included by cpufreq.h or cpufreq_governor.h
>
> Signed-off-by: Viresh Kumar 

Fixup due to compilation warning reported by Fengguang's kbuild system:
[Latest stuff pushed in my cpufreq-next branch]

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 70399ea..ccaf025 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -19,6 +19,7 @@

 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] cpufreq: Re-arrange declarations in cpufreq.h

2013-08-03 Thread Viresh Kumar
On 3 August 2013 17:19, Viresh Kumar  wrote:
> They are pretty much mixed up. Although generic headers are present but
> definitions/declarations are present outside them too..
>
> This patch just moves stuff up and down to make it look better and consistent.
>
> Signed-off-by: Viresh Kumar 
> ---
>  include/linux/cpufreq.h | 370 
> +++-
>  1 file changed, 177 insertions(+), 193 deletions(-)

Fixup due to compilation reported by Fengguang's kbuild system:
[Will post the series again once I receive more comments on it]

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index a6b97e2..d568f39 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -268,6 +268,19 @@ int cpufreq_unregister_notifier(struct
notifier_block *nb, unsigned int list);
 void cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state);

+#else /* CONFIG_CPU_FREQ */
+static inline int cpufreq_register_notifier(struct notifier_block *nb,
+   unsigned int list)
+{
+   return 0;
+}
+static inline int cpufreq_unregister_notifier(struct notifier_block *nb,
+   unsigned int list)
+{
+   return 0;
+}
+#endif /* !CONFIG_CPU_FREQ */
+
 /**
  * cpufreq_scale - "old * mult / div" calculation for large values (32-bit-arch
  * safe)
@@ -282,32 +295,16 @@ static inline unsigned long
cpufreq_scale(unsigned long old, u_int div,
u_int mult)
 {
 #if BITS_PER_LONG == 32
-
u64 result = ((u64) old) * ((u64) mult);
do_div(result, div);
return (unsigned long) result;

 #elif BITS_PER_LONG == 64
-
unsigned long result = old * ((u64) mult);
result /= div;
return result;
-
 #endif
-};
-
-#else /* CONFIG_CPU_FREQ */
-static inline int cpufreq_register_notifier(struct notifier_block *nb,
-   unsigned int list)
-{
-   return 0;
 }
-static inline int cpufreq_unregister_notifier(struct notifier_block *nb,
-   unsigned int list)
-{
-   return 0;
-}
-#endif /* !CONFIG_CPU_FREQ */

 /*
  *  CPUFREQ GOVERNORS*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] improve memcg oom killer robustness v2

2013-08-03 Thread Johannes Weiner
Hi azur,

here is the x86-only rollup of the series for 3.2.

Thanks!
Johannes
---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..314fe53 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline int
+static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
   unsigned long address, unsigned int fault)
 {
-   /*
-* Pagefault was interrupted by SIGKILL. We have no reason to
-* continue pagefault.
-*/
-   if (fatal_signal_pending(current)) {
-   if (!(fault & VM_FAULT_RETRY))
-   up_read(¤t->mm->mmap_sem);
-   if (!(error_code & PF_USER))
-   no_context(regs, error_code, address);
-   return 1;
+   if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+   up_read(¤t->mm->mmap_sem);
+   no_context(regs, error_code, address);
+   return;
}
-   if (!(fault & VM_FAULT_ERROR))
-   return 0;
 
if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(¤t->mm->mmap_sem);
no_context(regs, error_code, address);
-   return 1;
+   return;
}
 
out_of_memory(regs, error_code, address);
@@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long 
error_code,
else
BUG();
}
-   return 1;
 }
 
 static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long 
error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+   flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1167,9 +1159,17 @@ good_area:
 */
fault = handle_mm_fault(mm, vma, address, flags);
 
-   if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
-   if (mm_fault_error(regs, error_code, address, fault))
-   return;
+   /*
+* If we need to retry but a fatal signal is pending, handle the
+* signal first. We do not need to release the mmap_sem because it
+* would already be released in __lock_page_or_retry in mm/filemap.c.
+*/
+   if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+   return;
+
+   if (unlikely(fault & VM_FAULT_ERROR)) {
+   mm_fault_error(regs, error_code, address, fault);
+   return;
}
 
/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b113c0f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,48 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
 
+/**
+ * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM.  Returns the previous toggle state.
+ *
+ * NOTE: Any path that enables the OOM killer before charging must
+ *   call mem_cgroup_oom_synchronize() afterward to finalize the
+ *   OOM handling and clean up.
+ */
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+   bool old;
+
+   old = current->memcg_oom.may_oom;
+   current->memcg_oom.may_oom = new;
+
+   return old;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+   bool old = mem_cgroup_toggle_oom(true);
+
+   WARN_ON(old == true);
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+   bool old = mem_cgroup_toggle_oom(false);
+
+   WARN_ON(old == false);
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+   return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -333,6 +375,29 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+   return false;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+   return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+   return false;
+}
+

Re: [PATCHv2] Add per-process flag to control thp

2013-08-03 Thread Oleg Nesterov
On 08/02, Alex Thorlton wrote:
>
> This patch implements functionality to allow processes to disable the use of
> transparent hugepages through the prctl syscall.
>
> We've determined that some jobs perform significantly better with thp 
> disabled,
> and we needed a way to control thp on a per-process basis, without relying on
> madvise.

Well, I think the changelog should explain why madvise() is bad.

> @@ -1311,6 +1311,10 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>   p->sequential_io_avg= 0;
>  #endif
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + p->thp_disabled = current->thp_disabled;
> +#endif

Unneeded. It will be copied by dup_task_struct() automagically.

But I simply can't understand why this flag is per-thread. It should be
mm flag, no?

Oleg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 0/7] improve memcg oom killer robustness v2

2013-08-03 Thread Johannes Weiner
Changes in version 2:
o use user_mode() instead of open coding it on s390 (Heiko Carstens)
o clean up memcg OOM enable/disable toggling (Michal Hocko & KOSAKI
  Motohiro)
o add a separate patch to rework and document OOM locking
o fix a problem with lost wakeups when sleeping on the OOM lock
o fix OOM unlocking & wakeups with userspace OOM handling

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved.  They can hold all kinds of locks
(fs, mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that
is started in the charge context, but any waiting is done after the
fault stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead,
kernel faults, because they can gracefully unwind the stack with
-ENOMEM.  OOM handling is restricted to user triggered faults that
have no other option.

Patch 6 reworks memcg's hierarchical OOM locking to make it a little
more obvious wth is going on in there: reduce locked regions, rename
locking functions, reorder and document.

Patch 7 implements the two-part OOM handling such that tasks are never
trapped with the full charge stack in an OOM situation.

 arch/alpha/mm/fault.c  |   7 +-
 arch/arc/mm/fault.c|  11 +--
 arch/arm/mm/fault.c|  23 +++--
 arch/arm64/mm/fault.c  |  23 +++--
 arch/avr32/mm/fault.c  |   4 +-
 arch/cris/mm/fault.c   |   6 +-
 arch/frv/mm/fault.c|  10 +-
 arch/hexagon/mm/vm_fault.c |   6 +-
 arch/ia64/mm/fault.c   |   6 +-
 arch/m32r/mm/fault.c   |  10 +-
 arch/m68k/mm/fault.c   |   2 +
 arch/metag/mm/fault.c  |   6 +-
 arch/microblaze/mm/fault.c |   7 +-
 arch/mips/mm/fault.c   |   8 +-
 arch/mn10300/mm/fault.c|   2 +
 arch/openrisc/mm/fault.c   |   1 +
 arch/parisc/mm/fault.c |   7 +-
 arch/powerpc/mm/fault.c|   7 +-
 arch/s390/mm/fault.c   |   2 +
 arch/score/mm/fault.c  |  13 ++-
 arch/sh/mm/fault.c |   9 +-
 arch/sparc/mm/fault_32.c   |  12 ++-
 arch/sparc/mm/fault_64.c   |   8 +-
 arch/tile/mm/fault.c   |  13 +--
 arch/um/kernel/trap.c  |  22 +++--
 arch/unicore32/mm/fault.c  |  22 +++--
 arch/x86/mm/fault.c|  43 -
 arch/xtensa/mm/fault.c |   2 +
 include/linux/memcontrol.h |  65 +
 include/linux/mm.h |   1 +
 include/linux/sched.h  |   7 ++
 mm/filemap.c   |  11 ++-
 mm/memcontrol.c| 229 +
 mm/memory.c|  43 +++--
 mm/oom_kill.c  |   7 +-
 35 files changed, 444 insertions(+), 211 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/7] arch: mm: pass userspace fault flag to generic fault handler

2013-08-03 Thread Johannes Weiner
Unlike global OOM handling, memory cgroup code will invoke the OOM
killer in any OOM situation because it has no way of telling faults
occuring in kernel context - which could be handled more gracefully -
from user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
---
 arch/alpha/mm/fault.c  |  7 ---
 arch/arc/mm/fault.c|  6 --
 arch/arm/mm/fault.c|  9 ++---
 arch/arm64/mm/fault.c  |  9 ++---
 arch/avr32/mm/fault.c  |  2 ++
 arch/cris/mm/fault.c   |  6 --
 arch/frv/mm/fault.c| 10 ++
 arch/hexagon/mm/vm_fault.c |  6 --
 arch/ia64/mm/fault.c   |  6 --
 arch/m32r/mm/fault.c   | 10 ++
 arch/m68k/mm/fault.c   |  2 ++
 arch/metag/mm/fault.c  |  6 --
 arch/microblaze/mm/fault.c |  7 +--
 arch/mips/mm/fault.c   |  6 --
 arch/mn10300/mm/fault.c|  2 ++
 arch/openrisc/mm/fault.c   |  1 +
 arch/parisc/mm/fault.c |  7 +--
 arch/powerpc/mm/fault.c|  7 ---
 arch/s390/mm/fault.c   |  2 ++
 arch/score/mm/fault.c  |  7 ++-
 arch/sh/mm/fault.c |  9 ++---
 arch/sparc/mm/fault_32.c   | 12 +---
 arch/sparc/mm/fault_64.c   |  8 +---
 arch/tile/mm/fault.c   |  7 +--
 arch/um/kernel/trap.c  | 20 
 arch/unicore32/mm/fault.c  |  8 ++--
 arch/x86/mm/fault.c|  8 +---
 arch/xtensa/mm/fault.c |  2 ++
 include/linux/mm.h |  1 +
 29 files changed, 132 insertions(+), 61 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 0c4132d..98838a0 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
const struct exception_table_entry *fixup;
int fault, si_code = SEGV_MAPERR;
siginfo_t info;
-   unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (cause > 0 ? FAULT_FLAG_WRITE : 0));
+   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
   (or is suppressed by the PALcode).  Support that for older CPUs
@@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
if (address >= TASK_SIZE)
goto vmalloc_fault;
 #endif
-
+   if (user_mode(regs))
+   flags |= FAULT_FLAG_USER;
 retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -142,6 +142,7 @@ retry:
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+   flags |= FAULT_FLAG_WRITE;
}
 
/* If for any reason at all we couldn't handle the fault,
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6b0bb41..d63f3de 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long 
address)
siginfo_t info;
int fault, ret;
int write = regs->ecr_cause & ECR_C_PROTV_STORE;  /* ST/EX */
-   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
-   (write ? FAULT_FLAG_WRITE : 0);
+   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
/*
 * We fault-in kernel-space virtual memory on-demand. The
@@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long 
address)
if (in_atomic() || !mm)
goto no_context;
 
+   if (user_mode(regs))
+   flags |= FAULT_FLAG_USER;
 retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -117,6 +118,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+   flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 217bcbf..eb8830a 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
-   int write = fsr & FSR_WRITE;
-   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
-   (write ? FAULT_FLAG_WRITE : 0);
+   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
if (notify_page_fault(regs, fsr))
return 0;
@@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
if (in_atomic() || !mm)
goto no_

[patch 5/7] mm: memcg: enable memcg OOM killer only for user faults

2013-08-03 Thread Johannes Weiner
System calls and kernel faults (uaccess, gup) can handle an out of
memory situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really
the only option available.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 44 
 include/linux/sched.h  |  3 +++
 mm/filemap.c   | 11 ++-
 mm/memcontrol.c|  2 +-
 mm/memory.c| 40 ++--
 5 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..9c449c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -125,6 +125,37 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
 
+/**
+ * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM.  Returns the previous toggle state.
+ */
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+   bool old;
+
+   old = current->memcg_oom.may_oom;
+   current->memcg_oom.may_oom = new;
+
+   return old;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+   bool old = mem_cgroup_toggle_oom(true);
+
+   WARN_ON(old == true);
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+   bool old = mem_cgroup_toggle_oom(false);
+
+   WARN_ON(old == false);
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -348,6 +379,19 @@ static inline void mem_cgroup_end_update_page_stat(struct 
page *page,
 {
 }
 
+static inline bool mem_cgroup_toggle_oom(bool new)
+{
+   return false;
+}
+
+static inline void mem_cgroup_enable_oom(void)
+{
+}
+
+static inline void mem_cgroup_disable_oom(void)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc09d21..4b3effc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1398,6 +1398,9 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+   struct memcg_oom_info {
+   unsigned int may_oom:1;
+   } memcg_oom;
 #endif
 #ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/filemap.c b/mm/filemap.c
index a6981fe..4a73e1a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1618,6 +1618,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
struct page *page;
+   bool memcg_oom;
pgoff_t size;
int ret = 0;
 
@@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
return VM_FAULT_SIGBUS;
 
/*
-* Do we have something in the page cache already?
+* Do we have something in the page cache already?  Either
+* way, try readahead, but disable the memcg OOM killer for it
+* as readahead is optional and no errors are propagated up
+* the fault stack.  The OOM killer is enabled while trying to
+* instantiate the faulting page individually below.
 */
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
@@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * We found the page, so try async readahead before
 * waiting for the lock.
 */
+   memcg_oom = mem_cgroup_toggle_oom(false);
do_async_mmap_readahead(vma, ra, file, page, offset);
+   mem_cgroup_toggle_oom(memcg_oom);
} else if (!page) {
/* No page in the page cache at all */
+   memcg_oom = mem_cgroup_toggle_oom(false);
do_sync_mmap_readahead(vma, ra, file, offset);
+   mem_cgroup_toggle_oom(memcg_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00a7a66..30ae46a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, 
gfp_t gfp_mask,
return CHARGE_RETRY;
 
/* If we don't need to call oom-killer at el, return immediately */
-   if (!oom_check)
+   if (!oom_check || !current->memcg_oom.may_oom)
return CHARGE_NOMEM;
/* check OOM

[patch 1/7] arch: mm: remove obsolete init OOM protection

2013-08-03 Thread Johannes Weiner
Back before smart OOM killing, when faulting tasks where killed
directly on allocation failures, the arch-specific fault handlers
needed special protection for the init process.

Now that all fault handlers call into the generic OOM killer (609838c
"mm: invoke oom-killer from remaining unconverted page fault
handlers"), which already provides init protection, the arch-specific
leftovers can be removed.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
Acked-by: KOSAKI Motohiro 
---
 arch/arc/mm/fault.c   | 5 -
 arch/score/mm/fault.c | 6 --
 arch/tile/mm/fault.c  | 6 --
 3 files changed, 17 deletions(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 0fd1f0d..6b0bb41 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -122,7 +122,6 @@ good_area:
goto bad_area;
}
 
-survive:
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
@@ -201,10 +200,6 @@ no_context:
die("Oops", regs, address);
 
 out_of_memory:
-   if (is_global_init(tsk)) {
-   yield();
-   goto survive;
-   }
up_read(&mm->mmap_sem);
 
if (user_mode(regs)) {
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 6b18fb0..4b71a62 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -100,7 +100,6 @@ good_area:
goto bad_area;
}
 
-survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -167,11 +166,6 @@ no_context:
*/
 out_of_memory:
up_read(&mm->mmap_sem);
-   if (is_global_init(tsk)) {
-   yield();
-   down_read(&mm->mmap_sem);
-   goto survive;
-   }
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index f7f99f9..ac553ee 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -430,7 +430,6 @@ good_area:
goto bad_area;
}
 
- survive:
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
@@ -568,11 +567,6 @@ no_context:
  */
 out_of_memory:
up_read(&mm->mmap_sem);
-   if (is_global_init(tsk)) {
-   yield();
-   down_read(&mm->mmap_sem);
-   goto survive;
-   }
if (is_kernel_mode)
goto no_context;
pagefault_out_of_memory();
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 6/7] mm: memcg: rework and document OOM waiting and wakeup

2013-08-03 Thread Johannes Weiner
The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies.  However, it would be nice if this construct would
be clearly recognizable and not be as obfuscated as it is right now.
Clean up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
   makes it more obvious that the task has to be on the waitqueue
   before attempting to OOM-trylock the hierarchy, to not miss any
   wakeups before going to sleep.  It just didn't matter until now
   because it was all lumped together into the global memcg_oom_lock
   spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
   It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
   lock in any given hierarchy atomically.  Restrict its scope to
   mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
   function.  Only the lockholder has to wake up the next in line
   after releasing the lock.

   Note that the lockholder kicks off the OOM-killer, which in turn
   leads to wakeups from the uncharges of the exiting task.  But a
   contender is not guaranteed to see them if it enters the OOM path
   after the OOM kills but before the lockholder releases the lock.
   Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
   under OOM as that is the point where we start to receive wakeups.
   No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
   symmetry.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 85 +++--
 1 file changed, 47 insertions(+), 38 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30ae46a..3d0c1d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup 
*root_memcg,
return total;
 }
 
+static DEFINE_SPINLOCK(memcg_oom_lock);
+
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
- * Has to be called with memcg_oom_lock
  */
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
 {
struct mem_cgroup *iter, *failed = NULL;
 
+   spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree(iter, memcg) {
if (iter->oom_lock) {
/*
@@ -2098,33 +2101,33 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup 
*memcg)
iter->oom_lock = true;
}
 
-   if (!failed)
-   return true;
-
-   /*
-* OK, we failed to lock the whole subtree so we have to clean up
-* what we set up to the failing subtree
-*/
-   for_each_mem_cgroup_tree(iter, memcg) {
-   if (iter == failed) {
-   mem_cgroup_iter_break(memcg, iter);
-   break;
+   if (failed) {
+   /*
+* OK, we failed to lock the whole subtree so we have
+* to clean up what we set up to the failing subtree
+*/
+   for_each_mem_cgroup_tree(iter, memcg) {
+   if (iter == failed) {
+   mem_cgroup_iter_break(memcg, iter);
+   break;
+   }
+   iter->oom_lock = false;
}
-   iter->oom_lock = false;
-   }
-   return false;
+   }   
+
+   spin_unlock(&memcg_oom_lock);
+
+   return !failed;
 }
 
-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
 {
struct mem_cgroup *iter;
 
+   spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
-   return 0;
+   spin_unlock(&memcg_oom_lock);
 }
 
 static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -2148,7 +2151,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup 
*memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
 }
 
-static DEFINE_SPINLOCK(memcg_oom_lock);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
 struct oom_wait_info {
@@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup 
*memcg, gfp_t mask,
  int order)
 {
struct oom_wait_info owait;
-   bool locked, need_to_kill;
+   bool locked;
 
owait.memcg = memcg;
owait.wait.fla

[patch 2/7] arch: mm: do not invoke OOM killer on kernel fault OOM

2013-08-03 Thread Johannes Weiner
Kernel faults are expected to handle OOM conditions gracefully (gup,
uaccess etc.), so they should never invoke the OOM killer.  Reserve
this for faults triggered in user context when it is the only option.

Most architectures already do this, fix up the remaining few.

Signed-off-by: Johannes Weiner 
Reviewed-by: Michal Hocko 
Acked-by: KOSAKI Motohiro 
---
 arch/arm/mm/fault.c   | 14 +++---
 arch/arm64/mm/fault.c | 14 +++---
 arch/avr32/mm/fault.c |  2 +-
 arch/mips/mm/fault.c  |  2 ++
 arch/um/kernel/trap.c |  2 ++
 arch/unicore32/mm/fault.c | 14 +++---
 6 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c97f794..217bcbf 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -349,6 +349,13 @@ retry:
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | 
VM_FAULT_BADACCESS
return 0;
 
+   /*
+* If we are in kernel mode at this point, we
+* have no context to handle this fault with.
+*/
+   if (!user_mode(regs))
+   goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
 * We ran out of memory, call the OOM killer, and return to
@@ -359,13 +366,6 @@ retry:
return 0;
}
 
-   /*
-* If we are in kernel mode at this point, we
-* have no context to handle this fault with.
-*/
-   if (!user_mode(regs))
-   goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
 * We had some memory, but were unable to
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0ecac89..dab1cfd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -294,6 +294,13 @@ retry:
  VM_FAULT_BADACCESS
return 0;
 
+   /*
+* If we are in kernel mode at this point, we have no context to
+* handle this fault with.
+*/
+   if (!user_mode(regs))
+   goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
 * We ran out of memory, call the OOM killer, and return to
@@ -304,13 +311,6 @@ retry:
return 0;
}
 
-   /*
-* If we are in kernel mode at this point, we have no context to
-* handle this fault with.
-*/
-   if (!user_mode(regs))
-   goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
 * We had some memory, but were unable to successfully fix up
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index b2f2d2d..2ca27b0 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -228,9 +228,9 @@ no_context:
 */
 out_of_memory:
up_read(&mm->mmap_sem);
-   pagefault_out_of_memory();
if (!user_mode(regs))
goto no_context;
+   pagefault_out_of_memory();
return;
 
 do_sigbus:
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 85df1cd..94d3a31 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -241,6 +241,8 @@ out_of_memory:
 * (which will retry the fault, or kill us if we got oom-killed).
 */
up_read(&mm->mmap_sem);
+   if (!user_mode(regs))
+   goto no_context;
pagefault_out_of_memory();
return;
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 089f398..b2f5adf 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -124,6 +124,8 @@ out_of_memory:
 * (which will retry the fault, or kill us if we got oom-killed).
 */
up_read(&mm->mmap_sem);
+   if (!is_user)
+   goto out_nosemaphore;
pagefault_out_of_memory();
return 0;
 }
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index f9b5c10..8ed3c45 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -278,6 +278,13 @@ retry:
   (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS
return 0;
 
+   /*
+* If we are in kernel mode at this point, we
+* have no context to handle this fault with.
+*/
+   if (!user_mode(regs))
+   goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
 * We ran out of memory, call the OOM killer, and return to
@@ -288,13 +295,6 @@ retry:
return 0;
}
 
-   /*
-* If we are in kernel mode at this point, we
-* have no context to handle this fault with.
-*/
-   if (!user_mode(regs))
-   goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
 * We had some memory, but were unable to
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majord

[patch 7/7] mm: memcg: do not trap chargers with full callstack on OOM

2013-08-03 Thread Johannes Weiner
The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex.  The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

OOM kill victim:
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations.  In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit.  But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM
and makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Reported-by: Reported-by: azurIt 
Debugged-by: Michal Hocko 
Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h |  21 +++
 include/linux/sched.h  |   4 ++
 mm/memcontrol.c| 154 +++--
 mm/memory.c|   3 +
 mm/oom_kill.c  |   7 ++-
 5 files changed, 140 insertions(+), 49 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9c449c1..cb84058 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -131,6 +131,10 @@ extern void mem_cgroup_replace_page_cache(struct page 
*oldpage,
  *
  * Toggle whether a failed memcg charge should invoke the OOM killer
  * or just return -ENOMEM.  Returns the previous toggle state.
+ *
+ * NOTE: Any path that enables the OOM killer before charging must
+ *   call mem_cgroup_oom_synchronize() afterward to finalize the
+ *   OOM handling and clean up.
  */
 static inline bool mem_cgroup_toggle_oom(bool new)
 {
@@ -156,6 +160,13 @@ static inline void mem_cgroup_disable_oom(void)
WARN_ON(old == false);
 }
 
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+   return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -392,6 +403,16 @@ static inline void mem_cgroup_disable_oom(void)
 {
 }
 
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+   return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+   return false;
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
 {
diff --git a/include/linux/sched.h b/include/l

  1   2   >