Re: [PATCH] acpi: video: improve quirk check
On Sat, Aug 3, 2013 at 8:47 PM, Aaron Lu wrote: > On Sun, Aug 4, 2013 at 6:20 AM, Felipe Contreras > wrote: >> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki wrote: >>> Do we still need to revert commit efaa14c if this patch is applied? >> >> I guess not. At least in this machine changing the backlight works >> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10 >> didn't work at all. I cannot see how it would affect negatively other >> machines. > > That commit makes hotkey emit notifications, and it's not the > problem of "booting into a black screen", that problem is due to > broken _BQC. The broken _BQC has been there for quite some time, hasn't it? Either way, without efaa14c, changing the backlight doesn't work at all either way, so there's no black screen, because there cannot be. > BTW, the efaa14c will also make screen off at level 0 according > to Felipe, who consider this is a bug. But since it is required to > let firmware emit notifications on hotkey press, I think user will > want it. With or without efaa14c, level 0 makes the screen off, or at least it would, if the control worked at all. So efaa14c is one step forward, but two back, my patch removes the two steps back, but we are still not at the level of what my blacklisting patch does, for that we would need to fix two issues: 1. Fix the retrieval of the last level at boot 2. Fix level 0 (yes, I consider that a regression) But we cannot achieve either of those for v3.11, the only possibilities seem to be either a) revert efaa14c, or b) keep it and apply my patch. Anything else doesn't seem to be a possible or sensible option, and I vote for b). -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Sat, Aug 3, 2013 at 8:18 PM, Aaron Lu wrote: > On 08/03/2013 07:34 PM, Rafael J. Wysocki wrote: >> On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >>> On 08/03/2013 07:47 AM, Rafael J. Wysocki wrote: On Friday, August 02, 2013 02:37:09 PM Felipe Contreras wrote: > If the _BCL package is descending, the first level (br->levels[2]) will > be 0, and if the number of levels matches the number of steps, we might > confuse a returned level to mean the index. > > For example: > > current_level = max_level = 100 > test_level = 0 > returned level = 100 > > In this case 100 means the level, not the index, and _BCM failed. But if > the _BCL package is descending, the index of level 0 is also 100, so we > assume _BQC is indexed, when it's not. > > This causes all _BQC calls to return bogus values causing weird behavior > from the user's perspective. For example: xbacklight -set 10; xbacklight > -set 20; would flash to 90% and then slowly down to the desired level > (20). > > The solution is simple; test anything other than the first level (e.g. > 1). > > Signed-off-by: Felipe Contreras Looks reasonable. Aaron, what do you think? >>> >>> Yes, the patch is correct, but I still prefer my own version :-) >>> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >>> >>> In case you want to take mine and mine needs refresh, please let me know >>> and I can do the re-base, thanks. >> >> Well, I prefer simpler, unless there's a good reason to use more complicated. >> >> Why exactly do you think your version is better? > > As explained here: > https://lkml.org/lkml/2013/8/2/81 > https://lkml.org/lkml/2013/8/2/112 > > And for the demo broken _BQC, mine patch will disable _BQC while still > make the backlight work, and this patch here is testing the max > brightness level and may fail. Yes, but that problem can *only* happen in such a simplified _BCL, which is very very unlikely. Still, it would make sense to fix the code for that case. However, we can fix the problem first for the real known cases with my simple one-liner patch that can even be merged for v3.11, and *later* fix the issue for the synthetic unlikely case. Personally I think there are better ways to fix the code for the synthetic case than what you patch does, which will also make _BQC work. That can be discussed later though, the one-liner is simple, and it works. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
aviso
-- setecientos cincuenta mil dolares han sido depositados a usted de Western Union. Envie su nombre, numero de telefono, direccion, ocupacion -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages
On Fri, Aug 2, 2013 at 12:17 AM, Aneesh Kumar K.V wrote: > Hillf Danton writes: > >> On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim wrote: >>> On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote: On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim wrote: > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote: >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim >> wrote: >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote: >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim >> >> wrote: >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without >> >> > any validation check, so it can steal reserved page >> >> > unconditionally. >> >> >> >> Well, why is it illegal to use reserved page here? >> > >> > If we use reserved page here, other processes which are promised to >> > use >> > enough hugepages cannot get enough hugepages and can die. This is >> > unexpected result to them. >> > >> But, how do you determine that a huge page is requested by a process >> that is not allowed to use reserved pages? > > Reserved page is just one for each address or file offset. If we need to > move this page, this means that it already use it's own reserved page, > this > page is it. So we should not use other reserved page for moving this > page. > Hm, how do you determine "this page" is not buddy? >>> >>> If this page comes from the buddy, it doesn't matter. It imply that >>> this mapping cannot use reserved page pool, because we always allocate >>> a page from reserved page pool first. >>> >> A buddy page also implies, if the mapping can use reserved pages, that no >> reserved page was available when requested. Now we can try reserved >> page again. > > I didn't quiet get that. My understanding is, the new page we are Neither did I ;) > allocating here for soft offline should not be allocated from the > reserve pool. If we do that we may possibly have allocation failure > later for a request that had done page reservation. Now to > avoid that we make sure we have enough free pages outside reserve pool > so that we can dequeue the huge page. If not we use buddy to allocate > the hugepage. > What if it is a mapping with HPAGE_RESV_OWNER set? Or can we block owner from using any page available here? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support
On Sat, 3 August 2013 20:33:16 -0400, Theodore Ts'o wrote: > > P.P.S. At least in theory, nothing of what I've described here has to > be ext4 specific. We could implement this in the VFS layer, at which > point not only ext4 would benefit, but also btrfs, xfs, f2fs, etc. Except for an inode bit that needs to be stored in the filesystem, agreed. The ugliness I see is in detecting how to treat the filesystem at hand. Filesystems with mandatory compression (jffs2, ubifs,...): - Just write the file, nothing to do. Filesystems with optional compression (logfs, ext2compr,...): - You may or may not want to chattr between file creation and writing the payload. Filesystems without compression (ext[234], xfs,...): - Just write the file, nothing can be done. - Alternatively fall back to a userspace version. Filesystems with optional uncompression (what is being proposed): - Write the file in compressed form, close, chattr. I would like to see the compression side done in the kernel as well. Then we can chattr right after creat() and, if that fails, either proceed anyway or go to a userspace fallback. All decisions can be made early on and we don't have to share the format with lots of userspace. Sure, we still have to share the format with fsck and similar filesystem tools. But not with installers. Jörn -- Man darf nicht das, was uns unwahrscheinlich und unnatürlich erscheint, mit dem verwechseln, was absolut unmöglich ist. -- Carl Friedrich Gauß -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] drm build fix
On Sat, Aug 3, 2013 at 6:08 PM, Dave Airlie wrote: > > Alex Deucher (1): > drm/radeon: fix 64 bit divide in SI spm code That code is stupid. You're asking for a 64-by-64 divide, when the divisor is clearly an "int" (100 and 1000 respectively). Why is it doing "div64_s64()" instead of the simpler and faster "div_s64()"? A full 64-by-64 divide is _expensive_ on 32-bit architecture (up to four divide instructions, each potentially expensive in its own right), which is the whole reason why we have that "math64.h" to begin with - to make people aware of it. Now, our lib/div64.c routines do notice that the upper bits of the divisor are zero and end up using the simpler 64-by-32 divide functions, but why explicitly ask for those more complex functions to begin with? So the code is likely not performance critical, and hey, our library routines do the simple optimizations to avoid the trivially excessive divide instructions, so this "doesn't matter". Except for the annoyance factor of you using a more complicated function for no reason. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support
On Fri, Jul 26, 2013 at 09:20:34AM -0400, Jörn Engel wrote: > > I don't think the e2compr patches are strictly necessary. They are a > good option, but not the only one. Sorry for not chiming in earlier; I've been travelling this past week, and between that and a bunch of other things I've fallen a bit earlier on my e-mail. > One trick to simplify the problem is to make Dhaval's compressed files > strictly read-only. It will require some dance to load the compressed > content, flip the switch, then uncompress data on the fly and disallow > writes. Not the most pleasing of interfaces, but yet another option. Yeah, this is something that I've wanted for a while. (In fact a few years ago I shopped around this design to some folks who were associated with Firefox.) MacOS has something rather similar to this. I haven't had a chance to look at Dhaval's patches yet, but the way I've been thinking about this is that the compression and building the table mapping compressed clusters to byte offsets in the file would be done in userspace. Once the compressed file plus the table is written to the disk, the userspace program would then close the file descriptor, and then set the "compressed" bit. When the bit is set, we flush all of its pages from the page cache, and the file becomes immutable. At that point, the kernel will handle the decompression, by implementing readpages() by reading the pages into the buffer cache, and then decompressing the compressed cluster of pages into the page cache. This gives us transparent compression, with a fraction of the complexity of supporting read/write compression. In addition, since we don't have to worry rewriting a cluster (and having the modified compressed cluster taking up more space), the on-disk representation can be a lot more efficient, since you don't have to use a stacker-style design. One of the cool things about this design is that the vast majority of files on a typical distribution are write-once, and better yet, they are written by the package manager. So once you teach dpkg, rpm, and the Android package installer how to write the file in this compressed format and set the compressed bit, we can the vast majority of the benefits of using compressed file with minimal effort. - Ted P.S. This is interesting not just for systems with slow HDD's, but also for cheap, single-channel MMC flash, the kind found in low-end handset and embedded systems. P.P.S. At least in theory, nothing of what I've described here has to be ext4 specific. We could implement this in the VFS layer, at which point not only ext4 would benefit, but also btrfs, xfs, f2fs, etc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
From: "Kirill A. Shutemov" For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64. Radix tree perload overhead can be significant on BASE_SMALL systems, so let's add dependency on !BASE_SMALL. /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is enabled. Signed-off-by: Kirill A. Shutemov --- Documentation/vm/transhuge.txt | 9 + include/linux/huge_mm.h| 9 + mm/Kconfig | 12 mm/huge_memory.c | 23 +++ 4 files changed, 53 insertions(+) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 4a63953..4cc15c4 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in +page cache if possible. It can be disable and re-enabled via sysfs: + +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache + +If it's disabled kernel will not add new huge pages to page cache and +split them on mapping, but already mapped pages will stay intakt. + It's also possible to limit defrag efforts in the VM to generate hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular pages diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3935428..1534e1e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -40,6 +40,7 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, + TRANSPARENT_HUGEPAGE_PAGECACHE, TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, #ifdef CONFIG_DEBUG_VM TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline bool transparent_hugepage_pagecache(void) +{ + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) + return false; + if (!(transparent_hugepage_flags & (1
[PATCH 07/23] mm: trace filemap: dump page order
From: "Kirill A. Shutemov" Dump page order to trace to be able to distinguish between small page and huge page in page cache. Signed-off-by: Kirill A. Shutemov Acked-by: Dave Hansen --- include/trace/events/filemap.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h index 0421f49..7e14b13 100644 --- a/include/trace/events/filemap.h +++ b/include/trace/events/filemap.h @@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache, __field(struct page *, page) __field(unsigned long, i_ino) __field(unsigned long, index) + __field(int, order) __field(dev_t, s_dev) ), @@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache, __entry->page = page; __entry->i_ino = page->mapping->host->i_ino; __entry->index = page->index; + __entry->order = compound_order(page); if (page->mapping->host->i_sb) __entry->s_dev = page->mapping->host->i_sb->s_dev; else __entry->s_dev = page->mapping->host->i_rdev; ), - TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu", + TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d", MAJOR(__entry->s_dev), MINOR(__entry->s_dev), __entry->i_ino, __entry->page, page_to_pfn(__entry->page), - __entry->index << PAGE_SHIFT) + __entry->index << PAGE_SHIFT, + __entry->order) ); DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache, -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/23] thp: represent file thp pages in meminfo and friends
From: "Kirill A. Shutemov" The patch adds new zone stat to count file transparent huge pages and adjust related places. For now we don't count mapped or dirty file thp pages separately. The patch depends on patch thp: account anon transparent huge pages into NR_ANON_PAGES Signed-off-by: Kirill A. Shutemov Acked-by: Dave Hansen --- drivers/base/node.c| 4 fs/proc/meminfo.c | 3 +++ include/linux/mmzone.h | 1 + mm/vmstat.c| 1 + 4 files changed, 9 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index bc9f43b..de261f5 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d SUnreclaim: %8lu kB\n" #ifdef CONFIG_TRANSPARENT_HUGEPAGE "Node %d AnonHugePages: %8lu kB\n" + "Node %d FileHugePages: %8lu kB\n" #endif , nid, K(node_page_state(nid, NR_FILE_DIRTY)), @@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)) , nid, K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) * + HPAGE_PMD_NR) + , nid, + K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) * HPAGE_PMD_NR)); #else nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))); diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 59d85d6..a62952c 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE "AnonHugePages: %8lu kB\n" + "FileHugePages: %8lu kB\n" #endif , K(i.totalram), @@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #ifdef CONFIG_TRANSPARENT_HUGEPAGE ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) * HPAGE_PMD_NR) + ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) * + HPAGE_PMD_NR) #endif ); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0c41d59..ba81833 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -142,6 +142,7 @@ enum zone_stat_item { NUMA_OTHER, /* allocation from other node */ #endif NR_ANON_TRANSPARENT_HUGEPAGES, + NR_FILE_TRANSPARENT_HUGEPAGES, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index 87228c5..ffe3fbd 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -739,6 +739,7 @@ const char * const vmstat_text[] = { "numa_other", #endif "nr_anon_transparent_hugepages", + "nr_file_transparent_hugepages", "nr_free_cma", "nr_dirty_threshold", "nr_dirty_background_threshold", -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin()
From: "Kirill A. Shutemov" Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE. If, for some reason, it's not possible allocate a huge page at this possition, it returns NULL. Caller should take care of fallback to small pages. Signed-off-by: Kirill A. Shutemov --- include/linux/fs.h | 1 + mm/filemap.c | 24 ++-- 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index b09ddc0..d5f58b3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -282,6 +282,7 @@ enum positive_aop_returns { #define AOP_FLAG_NOFS 0x0004 /* used by filesystem to direct * helper code (eg buffer layer) * to clear GFP_FS from alloc */ +#define AOP_FLAG_TRANSHUGE 0x0008 /* allocate transhuge page */ /* * oh the beauties of C type declarations. diff --git a/mm/filemap.c b/mm/filemap.c index 28f4927..b17ebb9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2313,18 +2313,38 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, gfp_t gfp_mask; struct page *page; gfp_t gfp_notmask = 0; + bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) && + IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE); + gfp_mask = mapping_gfp_mask(mapping); + if (must_use_thp) { + BUG_ON(index & HPAGE_CACHE_INDEX_MASK); + BUG_ON(!(gfp_mask & __GFP_COMP)); + } if (mapping_cap_account_dirty(mapping)) gfp_mask |= __GFP_WRITE; if (flags & AOP_FLAG_NOFS) gfp_notmask = __GFP_FS; repeat: page = find_lock_page(mapping, index); - if (page) + if (page) { + if (must_use_thp && !PageTransHuge(page)) { + unlock_page(page); + page_cache_release(page); + return NULL; + } goto found; + } - page = __page_cache_alloc(gfp_mask & ~gfp_notmask); + if (must_use_thp) { + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER); + if (page) + count_vm_event(THP_WRITE_ALLOC); + else + count_vm_event(THP_WRITE_ALLOC_FAILED); + } else + page = __page_cache_alloc(gfp_mask & ~gfp_notmask); if (!page) return NULL; status = add_to_page_cache_lru(page, mapping, index, -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/23] block: implement add_bdi_stat()
From: "Kirill A. Shutemov" We're going to add/remove a number of page cache entries at once. This patch implements add_bdi_stat() which adjusts bdi stats by arbitrary amount. It's required for batched page cache manipulations. Signed-off-by: Kirill A. Shutemov --- include/linux/backing-dev.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c388155..7060180 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi, __add_bdi_stat(bdi, item, -1); } +static inline void add_bdi_stat(struct backing_dev_info *bdi, + enum bdi_stat_item item, s64 amount) +{ + unsigned long flags; + + local_irq_save(flags); + __add_bdi_stat(bdi, item, amount); + local_irq_restore(flags); +} + static inline void dec_bdi_stat(struct backing_dev_info *bdi, enum bdi_stat_item item) { -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read
From: "Kirill A. Shutemov" Existing stats specify source of thp page: fault or collapse. We're going allocate a new huge page with write(2) and read(2). It's nither fault nor collapse. Let's introduce new events for that. Signed-off-by: Kirill A. Shutemov --- Documentation/vm/transhuge.txt | 7 +++ include/linux/huge_mm.h| 5 + include/linux/vm_event_item.h | 4 mm/vmstat.c| 4 4 files changed, 20 insertions(+) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 4cc15c4..a78f738 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found a range of pages to collapse into one huge page and has successfully allocated a new huge page to store the data. +thp_write_alloc and thp_read_alloc are incremented every time a huge + page is successfully allocated to handle write(2) to a file or + read(2) from file. + thp_fault_fallback is incremented if a page fault fails to allocate a huge page and instead falls back to using small pages. @@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range of pages that should be collapsed into one huge page but failed the allocation. +thp_write_alloc_failed and thp_read_alloc_failed are incremented if + huge page allocation failed when tried on write(2) or read(2). + thp_split is incremented every time a huge page is split into base pages. This can happen for a variety of reasons but a common reason is that a huge page is old and is being reclaimed. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4dc66c9..9a0a114 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) +#define THP_WRITE_ALLOC({ BUILD_BUG(); 0; }) +#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; }) +#define THP_READ_ALLOC ({ BUILD_BUG(); 0; }) +#define THP_READ_ALLOC_FAILED ({ BUILD_BUG(); 0; }) + #define hpage_nr_pages(x) 1 #define transparent_hugepage_enabled(__vma) 0 diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 1855f0a..8e071bb 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_FAULT_FALLBACK, THP_COLLAPSE_ALLOC, THP_COLLAPSE_ALLOC_FAILED, + THP_WRITE_ALLOC, + THP_WRITE_ALLOC_FAILED, + THP_READ_ALLOC, + THP_READ_ALLOC_FAILED, THP_SPLIT, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, diff --git a/mm/vmstat.c b/mm/vmstat.c index ffe3fbd..a80ea59 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -815,6 +815,10 @@ const char * const vmstat_text[] = { "thp_fault_fallback", "thp_collapse_alloc", "thp_collapse_alloc_failed", + "thp_write_alloc", + "thp_write_alloc_failed", + "thp_read_alloc", + "thp_read_alloc_failed", "thp_split", "thp_zero_page_alloc", "thp_zero_page_alloc_failed", -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
From: "Kirill A. Shutemov" For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head page for the specified index and HPAGE_CACHE_NR-1 tail pages for following indexes. Signed-off-by: Kirill A. Shutemov Acked-by: Dave Hansen --- include/linux/huge_mm.h| 24 ++ include/linux/page-flags.h | 33 ++ mm/filemap.c | 50 +++--- 3 files changed, 95 insertions(+), 12 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 1534e1e..4dc66c9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE + +#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT) +#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER) +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1) + +#else + +#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; }) +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; }) + +#endif + static inline bool transparent_hugepage_pagecache(void) { if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) @@ -238,4 +252,14 @@ static inline bool transparent_hugepage_pagecache(void) return false; return transparent_hugepage_flags & (1RADIX_TREE_PRELOAD_NR); + + nr = hpagecache_nr_pages(page); + + error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM); if (error) { mem_cgroup_uncharge_cache_page(page); return error; } - page_cache_get(page); - page->mapping = mapping; - page->index = offset; - spin_lock_irq(&mapping->tree_lock); - error = radix_tree_insert(&mapping->page_tree, offset, page); + page_cache_get(page); + for (i = 0; i < nr; i++) { + error = radix_tree_insert(&mapping->page_tree, + offset + i, page + i); + /* +* In the midle of THP we can collide with small page which was +* established before THP page cache is enabled or by other VMA +* with bad alignement (most likely MAP_FIXED). +*/ + if (error) + goto err_insert; + page[i].index = offset + i; + page[i].mapping = mapping; + } radix_tree_preload_end(); - if (unlikely(error)) - goto err_insert; - mapping->nrpages++; - __inc_zone_page_state(page, NR_FILE_PAGES); + mapping->nrpages += nr; + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr); + if (PageTransHuge(page)) + __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES); spin_unlock_irq(&mapping->tree_lock); trace_mm_filemap_add_to_page_cache(page); return 0; err_insert: - page->mapping = NULL; - /* Leave page->index set: truncation relies upon it */ + radix_tree_preload_end(); + if (i != 0) + error = -ENOSPC; /* no space for a huge page */ + + /* page[i] was not inserted to tree, skip it */ + i--; + + for (; i >= 0; i--) { + /* Leave page->index set: truncation relies upon it */ + page[i].mapping = NULL; + radix_tree_delete(&mapping->page_tree, offset + i); + } spin_unlock_irq(&mapping->tree_lock); mem_cgroup_uncharge_cache_page(page); page_cache_release(page); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative()
From: "Kirill A. Shutemov" For tail pages we need to take two refcounters: - ->_count for its head page; - ->_mapcount for the tail page; To protect against splitting we take compound lock and re-check that we still have tail page before taking ->_mapcount reference. If the page was split we drop ->_count reference from head page and return 0 to indicate caller that it must retry. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 47b5082..d459b38 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold); */ static inline int page_cache_get_speculative(struct page *page) { + struct page *page_head = compound_trans_head(page); + VM_BUG_ON(in_interrupt()); #ifdef CONFIG_TINY_RCU @@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page) * disabling preempt, and hence no need for the "speculative get" that * SMP requires. */ - VM_BUG_ON(page_count(page) == 0); - atomic_inc(&page->_count); + VM_BUG_ON(page_count(page_head) == 0); + atomic_inc(&page_head->_count); #else - if (unlikely(!get_page_unless_zero(page))) { + if (unlikely(!get_page_unless_zero(page_head))) { /* * Either the page has been freed, or will be freed. * In either case, retry here and the caller should @@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page) return 0; } #endif - VM_BUG_ON(PageTail(page)); + + if (unlikely(PageTransTail(page))) { + unsigned long flags; + int got = 0; + + flags = compound_lock_irqsave(page_head); + if (likely(PageTransTail(page))) { + atomic_inc(&page->_mapcount); + got = 1; + } + compound_unlock_irqrestore(page_head, flags); + + if (unlikely(!got)) + atomic_dec(&page_head->_count); + + return got; + } return 1; } -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP
From: "Kirill A. Shutemov" replace_page_cache_page() is only used by FUSE. It's unlikely that we will support THP in FUSE page cache any soon. Let's pospone implemetation of THP handling in replace_page_cache_page() until any will use it. -EINVAL and WARN_ONCE() for now. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index b75bdf5..28f4927 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -418,6 +418,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) { int error; + if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new), +"unexpected transhuge page\n")) + return -EINVAL; + VM_BUG_ON(!PageLocked(old)); VM_BUG_ON(!PageLocked(new)); VM_BUG_ON(new->mapping); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read
From: "Kirill A. Shutemov" I've noticed that we allocated unneeded page for cache on read beyond i_size. Simple test case (I checked it on ramfs): $ touch testfile $ cat testfile It triggers 'no_cached_page' code path in do_generic_file_read(). Looks like it's regression since commit a32ea1e. Let's fix it. Signed-off-by: Kirill A. Shutemov Cc: NeilBrown --- mm/filemap.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index 066bbff..c31d296 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos, loff_t isize; unsigned long nr, ret; + isize = i_size_read(inode); + if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT) + goto out; + cond_resched(); find_page: page = find_get_page(mapping, index); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/23] thp: handle file pages in split_huge_page()
From: "Kirill A. Shutemov" The base scheme is the same as for anonymous pages, but we walk by mapping->i_mmap rather then anon_vma->rb_root. When we add a huge page to page cache we take only reference to head page, but on split we need to take addition reference to all tail pages since they are still in page cache after splitting. Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c | 89 +++- 1 file changed, 76 insertions(+), 13 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 523946c..d7c6830 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page *page, struct zone *zone = page_zone(page); struct lruvec *lruvec; int tail_count = 0; + int initial_tail_refcount; /* prevent PageLRU to go away from under us, and freeze lru stats */ spin_lock_irq(&zone->lru_lock); @@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page *page, /* complete memcg works before add pages to LRU */ mem_cgroup_split_huge_fixup(page); + /* +* When we add a huge page to page cache we take only reference to head +* page, but on split we need to take addition reference to all tail +* pages since they are still in page cache after splitting. +*/ + initial_tail_refcount = PageAnon(page) ? 0 : 1; + for (i = HPAGE_PMD_NR - 1; i >= 1; i--) { struct page *page_tail = page + i; @@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page *page, * atomic_set() here would be safe on all archs (and * not only on x86), it's safer to use atomic_add(). */ - atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1, - &page_tail->_count); + atomic_add(initial_tail_refcount + page_mapcount(page) + + page_mapcount(page_tail) + 1, + &page_tail->_count); /* after clearing PageTail the gup refcount can be released */ smp_mb(); @@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page *page, */ page_tail->_mapcount = page->_mapcount; - BUG_ON(page_tail->mapping); page_tail->mapping = page->mapping; page_tail->index = page->index + i; page_nid_xchg_last(page_tail, page_nid_last(page)); - BUG_ON(!PageAnon(page_tail)); BUG_ON(!PageUptodate(page_tail)); BUG_ON(!PageDirty(page_tail)); - BUG_ON(!PageSwapBacked(page_tail)); lru_add_page_tail(page, page_tail, lruvec, list); } atomic_sub(tail_count, &page->_count); BUG_ON(atomic_read(&page->_count) <= 0); - __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1); + if (PageAnon(page)) + __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1); + else + __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1); ClearPageCompound(page); compound_unlock(page); @@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page, } /* must be called with anon_vma->root->rwsem held */ -static void __split_huge_page(struct page *page, +static void __split_anon_huge_page(struct page *page, struct anon_vma *anon_vma, struct list_head *list) { @@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page, * and establishes a child pmd before * __split_huge_page_splitting() freezes the parent pmd (so if * we fail to prevent copy_huge_pmd() from running until the -* whole __split_huge_page() is complete), we will still see +* whole __split_anon_huge_page() is complete), we will still see * the newly established pmd of the child later during the * walk, to be able to set it as pmd_trans_splitting too. */ @@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page, * from the hugepage. * Return 0 if the hugepage is split successfully otherwise return 1. */ -int split_huge_page_to_list(struct page *page, struct list_head *list) +static int split_anon_huge_page(struct page *page, struct list_head *list) { struct anon_vma *anon_vma; int ret = 1; - BUG_ON(is_huge_zero_page(page)); - BUG_ON(!PageAnon(page)); - /* * The caller does not necessarily hold an mmap_sem that would prevent * the anon_vma disappearing so we first we take a reference to it @@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) goto out_unlock; BUG_ON(!PageSwapBacked(pag
[PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()
From: "Kirill A. Shutemov" This is the second part of my transparent huge page cache work. It brings thp support for ramfs, but without mmap() -- it will be posted separately. Intro - The goal of the project is preparing kernel infrastructure to handle huge pages in page cache. To proof that the proposed changes are functional we enable the feature for the most simple file system -- ramfs. ramfs is not that useful by itself, but it's good pilot project. Design overview --- Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR (512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries for tail pages. Radix tree manipulations are implemented in batched way: we add and remove whole huge page at once, under one tree_lock. To make it possible, we extended radix-tree interface to be able to pre-allocate memory enough to insert a number of *contiguous* elements (kudos to Matthew Wilcox). Huge pages can be added to page cache three ways: - write(2) to file or page; - read(2) from sparse file; - fault sparse file. Potentially, one more way is collapsing small page, but it's outside initial implementation. For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's some room for speed up later. Since mmap() isn't targeted for this patchset, we just split huge page on page fault. To minimize memory overhead for small file we setup fops->release helper -- simple_thp_release() -- which splits the last page in file, when last writer goes away. truncate_inode_pages_range() drops whole huge page at once if it's fully inside the range. If a huge page is only partly in the range we zero out the part, exactly like we do for partial small pages. split_huge_page() for file pages works similar to anon pages, but we walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call truncate_inode_pages() to drop small pages beyond i_size, if any. Locking model around split_huge_page() rather complicated and I still don't feel myself confident enough with it. Looks like we need to serialize over i_mutex in split_huge_page(), but it breaks locking ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily. Any ideas are welcome. Performance indicators will be posted separately. Please, review. Kirill A. Shutemov (23): radix-tree: implement preload for multiple contiguous elements memcg, thp: charge huge cache pages thp: compile-time and sysfs knob for thp pagecache thp, mm: introduce mapping_can_have_hugepages() predicate thp: represent file thp pages in meminfo and friends thp, mm: rewrite add_to_page_cache_locked() to support huge pages mm: trace filemap: dump page order block: implement add_bdi_stat() thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: warn if we try to use replace_page_cache_page() with THP thp, mm: handle tail pages in page_cache_get_speculative() thp, mm: add event counters for huge page alloc on file write or read thp, mm: allocate huge pages in grab_cache_page_write_begin() thp, mm: naive support of thp in generic_perform_write mm, fs: avoid page allocation beyond i_size on read thp, mm: handle transhuge pages in do_generic_file_read() thp, libfs: initial thp support thp: libfs: introduce simple_thp_release() truncate: support huge pages thp: handle file pages in split_huge_page() thp: wait_split_huge_page(): serialize over i_mmap_mutex too thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache Documentation/vm/transhuge.txt | 16 drivers/base/node.c| 4 + fs/libfs.c | 80 ++- fs/proc/meminfo.c | 3 + fs/ramfs/file-mmu.c| 3 +- fs/ramfs/inode.c | 6 +- include/linux/backing-dev.h| 10 +++ include/linux/fs.h | 10 +++ include/linux/huge_mm.h| 53 - include/linux/mmzone.h | 1 + include/linux/page-flags.h | 33 include/linux/pagemap.h| 48 +++- include/linux/radix-tree.h | 11 +++ include/linux/vm_event_item.h | 4 + include/trace/events/filemap.h | 7 +- lib/radix-tree.c | 41 +++--- mm/Kconfig | 12 +++ mm/filemap.c | 171 +++-- mm/huge_memory.c | 116 mm/memcontrol.c| 2 - mm/memory.c| 4 +- mm/truncate.c | 108 -- mm/vmstat.c| 5 ++ 23 files changed, 658 insertions(+), 90 deletions(-) -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/23] thp: libfs: introduce simple_thp_release()
From: "Kirill A. Shutemov" simple_thp_release() is a dummy implementation of fops->release with transparent huge page support. It's required to minimize memory overhead of huge pages for small files. It checks whether we should split the last page in the file to give memory back to the system. We split the page if it meets following criteria: - nobody has the file opened on write; - spliting will actually free any memory (at least one small page); - if it's a huge page ;) Signed-off-by: Kirill A. Shutemov --- fs/libfs.c | 27 +++ include/linux/fs.h | 2 ++ 2 files changed, 29 insertions(+) diff --git a/fs/libfs.c b/fs/libfs.c index 934778b..c43b055 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -488,6 +488,33 @@ int simple_thp_write_begin(struct file *file, struct address_space *mapping, } return 0; } + +int simple_thp_release(struct inode *inode, struct file *file) +{ + pgoff_t last_index; + struct page *page; + + /* check if anybody still writes to file */ + if (atomic_read(&inode->i_writecount) != !!(file->f_mode & FMODE_WRITE)) + return 0; + + last_index = i_size_read(inode) >> PAGE_CACHE_SHIFT; + + /* check if splitting the page will free any memory */ + if ((last_index & HPAGE_CACHE_INDEX_MASK) + 1 == HPAGE_CACHE_NR) + return 0; + + page = find_get_page(file->f_mapping, + last_index & ~HPAGE_CACHE_INDEX_MASK); + if (!page) + return 0; + + if (PageTransHuge(page)) + split_huge_page(page); + + page_cache_release(page); + return 0; +} #endif /* diff --git a/include/linux/fs.h b/include/linux/fs.h index c1dbf43..b594f10 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2557,8 +2557,10 @@ extern int simple_write_end(struct file *file, struct address_space *mapping, extern int simple_thp_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); +extern int simple_thp_release(struct inode *inode, struct file *file); #else #define simple_thp_write_begin simple_write_begin +#define simple_thp_release NULL #endif extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/23] truncate: support huge pages
From: "Kirill A. Shutemov" truncate_inode_pages_range() drops whole huge page at once if it's fully inside the range. If a huge page is only partly in the range we zero out the part, exactly like we do for partial small pages. invalidate_mapping_pages() just skips huge pages if they are not fully in the range. Signed-off-by: Kirill A. Shutemov --- mm/truncate.c | 108 +- 1 file changed, 84 insertions(+), 24 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 353b683..fcef7cb 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping, { pgoff_t start; /* inclusive */ pgoff_t end;/* exclusive */ - unsigned intpartial_start; /* inclusive */ - unsigned intpartial_end;/* exclusive */ + boolpartial_thp_start = false, partial_thp_end = false; struct pagevec pvec; pgoff_t index; int i; @@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping, if (mapping->nrpages == 0) return; - /* Offsets within partial pages */ - partial_start = lstart & (PAGE_CACHE_SIZE - 1); - partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); - /* * 'start' and 'end' always covers the range of pages to be fully -* truncated. Partial pages are covered with 'partial_start' at the -* start of the range and 'partial_end' at the end of the range. -* Note that 'end' is exclusive while 'lend' is inclusive. +* truncated. Note that 'end' is exclusive while 'lend' is inclusive. */ start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (lend == -1) @@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping, if (index >= end) break; + if (PageTransTailCache(page)) { + /* part of already handled huge page */ + if (!page->mapping) + continue; + /* the range starts in middle of huge page */ + partial_thp_start = true; + start = index & ~HPAGE_CACHE_INDEX_MASK; + continue; + } + /* the range ends on huge page */ + if (PageTransHugeCache(page) && + index == (end & ~HPAGE_CACHE_INDEX_MASK)) { + partial_thp_end = true; + end = index; + break; + } + if (!trylock_page(page)) continue; WARN_ON(page->index != index); @@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping, index++; } - if (partial_start) { - struct page *page = find_lock_page(mapping, start - 1); + if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) { + pgoff_t off; + struct page *page; + unsigned pstart, pend; + void (*zero_segment)(struct page *page, + unsigned start, unsigned len); +retry_partial_start: + if (partial_thp_start) { + zero_segment = zero_huge_user_segment; + off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK; + pstart = lstart & ~HPAGE_PMD_MASK; + if ((end & ~HPAGE_CACHE_INDEX_MASK) == off) + pend = (lend - 1) & ~HPAGE_PMD_MASK; + else + pend = HPAGE_PMD_SIZE; + } else { + zero_segment = zero_user_segment; + off = start - 1; + pstart = lstart & ~PAGE_CACHE_MASK; + if (start > end) + pend = (lend - 1) & ~PAGE_CACHE_MASK; + else + pend = PAGE_CACHE_SIZE; + } + + page = find_get_page(mapping, off); if (page) { - unsigned int top = PAGE_CACHE_SIZE; - if (start > end) { - /* Truncation within a single page */ - top = partial_end; - partial_end = 0; + /* the last tail page*/ + if (PageTransTailCache(page)) { + partial_thp_start = true; + page_cache_release(page);
[PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
From: "Kirill A. Shutemov" We're going to have huge pages backed by files, so we need to modify wait_split_huge_page() to support that. We have two options for: - check whether the page anon or not and serialize only over required lock; - always serialize over both locks; Current implementation, in fact, guarantees that *all* pages on the vma is not splitting, not only the pages pmd is pointing on. For now I prefer the second option since it's the safest: we provide the same level of guarantees. Signed-off-by: Kirill A. Shutemov --- include/linux/huge_mm.h | 15 --- mm/huge_memory.c| 4 ++-- mm/memory.c | 4 ++-- 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9a0a114..186f4f2 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma, __split_huge_page_pmd(__vma, __address, \ pmd); \ } while (0) -#define wait_split_huge_page(__anon_vma, __pmd) \ +#define wait_split_huge_page(__vma, __pmd) \ do {\ pmd_t *pmd = (__pmd); \ - anon_vma_lock_write(__anon_vma);\ - anon_vma_unlock_write(__anon_vma); \ + struct address_space *__mapping = (__vma)->vm_file ?\ + (__vma)->vm_file->f_mapping : NULL; \ + struct anon_vma *__anon_vma = (__vma)->anon_vma;\ + if (__mapping) \ + mutex_lock(&__mapping->i_mmap_mutex); \ + if (__anon_vma) { \ + anon_vma_lock_write(__anon_vma);\ + anon_vma_unlock_write(__anon_vma); \ + } \ + if (__mapping) \ + mutex_unlock(&__mapping->i_mmap_mutex); \ BUG_ON(pmd_trans_splitting(*pmd) || \ pmd_trans_huge(*pmd)); \ } while (0) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d7c6830..9af643d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -911,7 +911,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, spin_unlock(&dst_mm->page_table_lock); pte_free(dst_mm, pgtable); - wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */ + wait_split_huge_page(vma, src_pmd); /* src_vma */ goto out; } src_page = pmd_page(pmd); @@ -1493,7 +1493,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) if (likely(pmd_trans_huge(*pmd))) { if (unlikely(pmd_trans_splitting(*pmd))) { spin_unlock(&vma->vm_mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); + wait_split_huge_page(vma, pmd); return -1; } else { /* Thp mapped by 'pmd' is stable, so we can diff --git a/mm/memory.c b/mm/memory.c index 7d35f90..ea74ab1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -609,7 +609,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, if (new) pte_free(mm, new); if (wait_split_huge_page) - wait_split_huge_page(vma->anon_vma, pmd); + wait_split_huge_page(vma, pmd); return 0; } @@ -1522,7 +1522,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma, if (likely(pmd_trans_huge(*pmd))) { if (unlikely(pmd_trans_splitting(*pmd))) { spin_unlock(&mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); + wait_split_huge_page(vma, pmd); } else { page = follow_trans_huge_pmd(vma, address, pmd, flags); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 22/23] thp, mm: split huge page on mmap file page
From: "Kirill A. Shutemov" We are not ready to mmap file-backed tranparent huge pages. Let's split them on fault attempt. Later we'll implement mmap() properly and this code path be used for fallback cases. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index ed65af5..f7857ef 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1743,6 +1743,8 @@ retry_find: goto no_cached_page; } + if (PageTransCompound(page)) + split_huge_page(compound_trans_head(page)); if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { page_cache_release(page); return ret | VM_FAULT_RETRY; -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read()
From: "Kirill A. Shutemov" If a transhuge page is already in page cache (up to date and not readahead) we go usual path: read from relevant subpage (head or tail). If page is not cached (sparse file in ramfs case) and the mapping can have hugepage we try allocate a new one and read it. If a page is not up to date or in readahead, we have to move 'page' to head page of the compound page, since it represents state of whole transhuge page. We will switch back to relevant subpage when page is ready to be read ('page_ok' label). Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 57 +++-- 1 file changed, 55 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index c31d296..ed65af5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1175,8 +1175,28 @@ find_page: ra, filp, index, last_index - index); page = find_get_page(mapping, index); - if (unlikely(page == NULL)) - goto no_cached_page; + if (unlikely(page == NULL)) { + if (mapping_can_have_hugepages(mapping)) + goto no_cached_page_thp; + else + goto no_cached_page; + } + } + if (PageTransCompound(page)) { + struct page *head = compound_trans_head(page); + + if (!PageReadahead(head) && PageUptodate(page)) + goto page_ok; + + /* +* Switch 'page' to head page. That's needed to handle +* readahead or make page uptodate. +* It will be switched back to the right tail page at +* the begining 'page_ok'. +*/ + page_cache_get(head); + page_cache_release(page); + page = head; } if (PageReadahead(page)) { page_cache_async_readahead(mapping, @@ -1198,6 +1218,18 @@ find_page: unlock_page(page); } page_ok: + /* Switch back to relevant tail page, if needed */ + if (PageTransCompoundCache(page) && !PageTransTail(page)) { + int off = index & HPAGE_CACHE_INDEX_MASK; + if (off){ + page_cache_get(page + off); + page_cache_release(page); + page += off; + } + } + + VM_BUG_ON(page->index != index); + /* * i_size must be checked after we know the page is Uptodate. * @@ -1329,6 +1361,27 @@ readpage_error: page_cache_release(page); goto out; +no_cached_page_thp: + page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD, + HPAGE_PMD_ORDER); + if (!page) { + count_vm_event(THP_READ_ALLOC_FAILED); + goto no_cached_page; + } + count_vm_event(THP_READ_ALLOC); + + error = add_to_page_cache_lru(page, mapping, + index & ~HPAGE_CACHE_INDEX_MASK, GFP_KERNEL); + if (!error) + goto readpage; + + page_cache_release(page); + if (error != -EEXIST && error != -ENOSPC) { + desc->error = error; + goto out; + } + + /* Fallback to small page */ no_cached_page: /* * Ok, it wasn't cached, so we need to create a new -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/23] thp, libfs: initial thp support
From: "Kirill A. Shutemov" simple_readpage() and simple_write_end() are modified to handle huge pages. simple_thp_write_begin() is introduced to allocate huge pages on write. Signed-off-by: Kirill A. Shutemov --- fs/libfs.c | 53 + include/linux/fs.h | 7 +++ include/linux/pagemap.h | 8 3 files changed, 64 insertions(+), 4 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index 3a3a9b5..934778b 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr); int simple_readpage(struct file *file, struct page *page) { - clear_highpage(page); + clear_pagecache_page(page); flush_dcache_page(page); SetPageUptodate(page); unlock_page(page); @@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping, /* zero the stale part of the page if we did a short copy */ if (copied < len) { - unsigned from = pos & (PAGE_CACHE_SIZE - 1); - - zero_user(page, from + copied, len - copied); + unsigned from; + if (PageTransHuge(page)) { + from = pos & ~HPAGE_PMD_MASK; + zero_huge_user(page, from + copied, len - copied); + } else { + from = pos & ~PAGE_CACHE_MASK; + zero_user(page, from + copied, len - copied); + } } if (!PageUptodate(page)) @@ -445,6 +450,46 @@ int simple_write_end(struct file *file, struct address_space *mapping, return copied; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE +int simple_thp_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct page *page = NULL; + pgoff_t index; + + index = pos >> PAGE_CACHE_SHIFT; + + if (mapping_can_have_hugepages(mapping)) { + page = grab_cache_page_write_begin(mapping, + index & ~HPAGE_CACHE_INDEX_MASK, + flags | AOP_FLAG_TRANSHUGE); + /* fallback to small page */ + if (!page) { + unsigned long offset; + offset = pos & ~PAGE_CACHE_MASK; + /* adjust the len to not cross small page boundary */ + len = min_t(unsigned long, + len, PAGE_CACHE_SIZE - offset); + } + BUG_ON(page && !PageTransHuge(page)); + } + if (!page) + return simple_write_begin(file, mapping, pos, len, flags, + pagep, fsdata); + + *pagep = page; + + if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) { + unsigned from = pos & ~HPAGE_PMD_MASK; + + zero_huge_user_segment(page, 0, from); + zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE); + } + return 0; +} +#endif + /* * the inodes created here are not hashed. If you use iunique to generate * unique inode values later for this filesystem, then you must take care diff --git a/include/linux/fs.h b/include/linux/fs.h index d5f58b3..c1dbf43 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2553,6 +2553,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping, extern int simple_write_end(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE +extern int simple_thp_write_begin(struct file *file, + struct address_space *mapping, loff_t pos, unsigned len, + unsigned flags, struct page **pagep, void **fsdata); +#else +#define simple_thp_write_begin simple_write_begin +#endif extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags); extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index d459b38..eb484f2 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -591,4 +591,12 @@ static inline int add_to_page_cache(struct page *page, return error; } +static inline void clear_pagecache_page(struct page *page) +{ + if (PageTransHuge(page)) + zero_huge_user(page, 0, HPAGE_PMD_SIZE); + else + clear_highpage(page); +} + #endif /* _LINUX_PAGEMAP_H */ -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 23/23] ramfs: enable transparent huge page cache
From: "Kirill A. Shutemov" ramfs is the most simple fs from page cache point of view. Let's start transparent huge page cache enabling here. ramfs pages are not movable[1] and switching to transhuge pages doesn't affect that. We need to fix this eventually. [1] http://lkml.org/lkml/2013/4/2/720 Signed-off-by: Kirill A. Shutemov --- fs/ramfs/file-mmu.c | 3 ++- fs/ramfs/inode.c| 6 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c index 4884ac5..3236e41 100644 --- a/fs/ramfs/file-mmu.c +++ b/fs/ramfs/file-mmu.c @@ -32,7 +32,7 @@ const struct address_space_operations ramfs_aops = { .readpage = simple_readpage, - .write_begin= simple_write_begin, + .write_begin= simple_thp_write_begin, .write_end = simple_write_end, .set_page_dirty = __set_page_dirty_no_writeback, }; @@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = { .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .release= simple_thp_release, }; const struct inode_operations ramfs_file_inode_operations = { diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index 39d1465..5dafdfc 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb, inode_init_owner(inode, dir, mode); inode->i_mapping->a_ops = &ramfs_aops; inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + /* +* TODO: make ramfs pages movable +*/ + mapping_set_gfp_mask(inode->i_mapping, + GFP_TRANSHUGE & ~__GFP_MOVABLE); mapping_set_unevictable(inode->i_mapping); inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate
From: "Kirill A. Shutemov" Returns true if mapping can have huge pages. Just check for __GFP_COMP in gfp mask of the mapping for now. Signed-off-by: Kirill A. Shutemov --- include/linux/pagemap.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e8ca8cf..47b5082 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) (__force unsigned long)mask; } +static inline bool mapping_can_have_hugepages(struct address_space *m) +{ + gfp_t gfp_mask = mapping_gfp_mask(m); + + if (!transparent_hugepage_pagecache()) + return false; + + /* +* It's up to filesystem what gfp mask to use. +* The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP. +*/ + return !!(gfp_mask & __GFP_COMP); +} + /* * The page cache can done in larger chunks than * one page, because it allows for more efficient -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/23] memcg, thp: charge huge cache pages
From: "Kirill A. Shutemov" mem_cgroup_cache_charge() has check for PageCompound(). The check prevents charging huge cache pages. I don't see a reason why the check is present. Looks like it's just legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot). Let's just drop it. Signed-off-by: Kirill A. Shutemov Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Acked-by: Dave Hansen --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b6cd870..dc50c1a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, if (mem_cgroup_disabled()) return 0; - if (PageCompound(page)) - return 0; if (!PageSwapCache(page)) ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/23] thp, mm: naive support of thp in generic_perform_write
From: "Kirill A. Shutemov" For now we still write/read at most PAGE_CACHE_SIZE bytes a time. This implementation doesn't cover address spaces with backing storage. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index b17ebb9..066bbff 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2382,6 +2382,7 @@ static ssize_t generic_perform_write(struct file *file, unsigned long bytes;/* Bytes to write to page */ size_t copied; /* Bytes copied from user */ void *fsdata; + int subpage_nr = 0; offset = (pos & (PAGE_CACHE_SIZE - 1)); bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, @@ -2411,8 +2412,14 @@ again: if (mapping_writably_mapped(mapping)) flush_dcache_page(page); + if (PageTransHuge(page)) { + off_t huge_offset = pos & ~HPAGE_PMD_MASK; + subpage_nr = huge_offset >> PAGE_CACHE_SHIFT; + } + pagefault_disable(); - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); + copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i, + offset, bytes); pagefault_enable(); flush_dcache_page(page); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages
From: "Kirill A. Shutemov" As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a time. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 619e6cb..b75bdf5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -115,6 +115,7 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + int i, nr; trace_mm_filemap_delete_from_page_cache(page); /* @@ -127,13 +128,21 @@ void __delete_from_page_cache(struct page *page) else cleancache_invalidate_page(mapping, page); - radix_tree_delete(&mapping->page_tree, page->index); + nr = hpagecache_nr_pages(page); + for (i = 0; i < nr; i++) { + page[i].mapping = NULL; + radix_tree_delete(&mapping->page_tree, page->index + i); + } + /* thp */ + if (nr > 1) + __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES); + page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ - mapping->nrpages--; - __dec_zone_page_state(page, NR_FILE_PAGES); + mapping->nrpages -= nr; + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr); if (PageSwapBacked(page)) - __dec_zone_page_state(page, NR_SHMEM); + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr); BUG_ON(page_mapped(page)); /* @@ -144,8 +153,8 @@ void __delete_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { - dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr); + add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr); } } -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
From: "Kirill A. Shutemov" The radix tree is variable-height, so an insert operation not only has to build the branch to its corresponding item, it also has to build the branch to existing items if the size has to be increased (by radix_tree_extend). The worst case is a zero height tree with just a single item at index 0, and then inserting an item at index ULONG_MAX. This requires 2 new branches of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. Radix tree is usually protected by spin lock. It means we want to pre-allocate required memory before taking the lock. Currently radix_tree_preload() only guarantees enough nodes to insert one element. It's a hard limit. For transparent huge page cache we want to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once. This patch introduces radix_tree_preload_count(). It allows to preallocate nodes enough to insert a number of *contiguous* elements. The feature costs about 5KiB per-CPU, details below. Worst case for adding N contiguous items is adding entries at indexes (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case item plus extra nodes if you cross the boundary from one node to the next. Preload uses per-CPU array to store nodes. The total cost of preload is "array size" * sizeof(void*) * NR_CPUS. We want to increase array size to be able to handle 512 entries at once. Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT. We have three possible RADIX_TREE_MAP_SHIFT: #ifdef __KERNEL__ #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) #else #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */ #endif On 64-bit system: For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107. For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63. For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30. On 32-bit system: For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84. For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46. For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19. On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case, on 64-bit system the per-CPU feature overhead is for preload array: (30 - 21) * sizeof(void*) = 72 bytes plus, if the preload array is full (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes total: 5112 bytes on 32-bit system the per-CPU feature overhead is for preload array: (19 - 11) * sizeof(void*) = 32 bytes plus, if the preload array is full (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes total: 2400 bytes Since only THP uses batched preload at the moment, we disable (set max preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed in the future. Signed-off-by: Matthew Wilcox Signed-off-by: Kirill A. Shutemov Acked-by: Dave Hansen --- include/linux/radix-tree.h | 11 +++ lib/radix-tree.c | 41 - 2 files changed, 43 insertions(+), 9 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 4039407..3bf0b3e 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -83,6 +83,16 @@ do { \ (root)->rnode = NULL; \ } while (0) +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE +/* + * At the moment only THP uses preload for more then on item for batched + * pagecache manipulations. + */ +#define RADIX_TREE_PRELOAD_NR 512 +#else +#define RADIX_TREE_PRELOAD_NR 1 +#endif + /** * Radix-tree synchronization * @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); int radix_tree_preload(gfp_t gfp_mask); int radix_tree_maybe_preload(gfp_t gfp_mask); +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 7811ed3..99ab73c 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep; * The worst case is a zero height tree with just a single item at index 0, * and then inserting an item at index ULONG_MAX. This requires 2 new branches * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared. + * + * Worst case for adding N contiguous items is adding entries at indexes + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case + * item plus extra nodes if you cross the boundary from one node to the next. + * * Hence: */ -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1) +#define RADIX_TREE
Re: [PATCH 1/2] ext4: fix handling of nodelalloc parameter
On Fri, Aug 02, 2013 at 02:03:46PM +0200, Piotr Sarna wrote: > Commit 26092bf ("ext4: use a table-driven handler for mount options") > introduced buggy handling of nodelalloc parameter in mount command. > > After explicitly using delalloc or nodelalloc parameter in mount command, > MOPT_EXPLICIT flag is set. After that, a test ensures that "data=journal" > and "delalloc" parameters are not simultaneously activated. > Unluckily, the mentioned test reports a bug in both situations: > - "data=journal,delalloc" > - "data=journal,nodelalloc" > whereas the second one is perfectly legal and acceptable. > > A simple solution to this problem is in setting EXPLICIT_DELALLOC flag > properly. This patch ensures that EXPLICIT_DELALLOC flag is set only > if "delalloc" parameter was used, and not set in case of "nodelalloc". Thanks for this bug report and patch. There is an even simpler way of fixing this doesn't involve adding an additional check in the code, though. Just make the following change the table entry: {Opt_nodelalloc, EXT4_MOUNT_DELALLOC, - MOPT_EXT4_ONLY | MOPT_CLEAR | MOPT_EXPLICIT}, + MOPT_EXT4_ONLY | MOPT_CLEAR}, I'll send out a patch which does this... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Sun, Aug 4, 2013 at 6:20 AM, Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki wrote: >> On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote: >>> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: >>> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >>> >>> >> Yes, the patch is correct, but I still prefer my own version :-) >>> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >>> >> >>> >> In case you want to take mine and mine needs refresh, please let me know >>> >> and I can do the re-base, thanks. >>> > >>> > Well, I prefer simpler, unless there's a good reason to use more >>> > complicated. >>> >>> Note that these are not exclusionary; his patch can be applied on top >>> of mine. I don't think his patch is needed though. >> >> OK >> >> Do we still need to revert commit efaa14c if this patch is applied? > > I guess not. At least in this machine changing the backlight works > correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10 > didn't work at all. I cannot see how it would affect negatively other > machines. That commit makes hotkey emit notifications, and it's not the problem of "booting into a black screen", that problem is due to broken _BQC. BTW, the efaa14c will also make screen off at level 0 according to Felipe, who consider this is a bug. But since it is required to let firmware emit notifications on hotkey press, I think user will want it. -Aaron > > That being said, the blacklisting is still needed, because 1. the > level is not preserved between boots, and 2. level 0 turns off the > screen, which I personally consider a regression. > > At least it boots to level 100 instead of 0. > > -- > Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] EXT4: LazyInit Mount Bug Fix
On Mon, Jul 15, 2013 at 12:14:18PM +0530, Nitin Singla wrote: > -sbi->s_itb_per_group = sbi->s_inodes_per_group / > -sbi->s_inodes_per_block; > +sbi->s_itb_per_group = DIV_ROUND_UP(sbi->s_inodes_per_group, > +sbi->s_inodes_per_block); This would only matter if s_inodes_per_group is not a multiple of s_inodes_per_block. Which is never supposed to happen; mke2fs doesn't create file systems like this. Ancient Android build systems, before the bug was fixed, did do this in the past, but that was a long time ago. Where did this file system come from? I could apply this patch, but you should be warned that there may be other bugs hiding here for file systems like this, both in the kernel and in e2fsprogs. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] regulator: 88pm800: Fix checking whether num_regulator is valid
The code to check whether num_regulator is valid is wrong because it should iterate all array entries rather than break from the for loop if pdata->regulators[i] is NULL. Signed-off-by: Axel Lin --- drivers/regulator/88pm800.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/regulator/88pm800.c b/drivers/regulator/88pm800.c index c72fe95..58e9b74 100644 --- a/drivers/regulator/88pm800.c +++ b/drivers/regulator/88pm800.c @@ -299,10 +299,13 @@ static int pm800_regulator_probe(struct platform_device *pdev) return -ENODEV; } } else if (pdata->num_regulators) { - /* Check whether num_regulator is valid. */ unsigned int count = 0; - for (i = 0; pdata->regulators[i]; i++) - count++; + + /* Check whether num_regulator is valid. */ + for (i = 0; ARRAY_SIZE(pdata->regulators); i++) { + if (pdata->regulators[i]) + count++; + } if (count != pdata->num_regulators) return -EINVAL; } else { -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On 08/03/2013 07:34 PM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >> On 08/03/2013 07:47 AM, Rafael J. Wysocki wrote: >>> On Friday, August 02, 2013 02:37:09 PM Felipe Contreras wrote: If the _BCL package is descending, the first level (br->levels[2]) will be 0, and if the number of levels matches the number of steps, we might confuse a returned level to mean the index. For example: current_level = max_level = 100 test_level = 0 returned level = 100 In this case 100 means the level, not the index, and _BCM failed. But if the _BCL package is descending, the index of level 0 is also 100, so we assume _BQC is indexed, when it's not. This causes all _BQC calls to return bogus values causing weird behavior from the user's perspective. For example: xbacklight -set 10; xbacklight -set 20; would flash to 90% and then slowly down to the desired level (20). The solution is simple; test anything other than the first level (e.g. 1). Signed-off-by: Felipe Contreras >>> >>> Looks reasonable. >>> >>> Aaron, what do you think? >> >> Yes, the patch is correct, but I still prefer my own version :-) >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >> >> In case you want to take mine and mine needs refresh, please let me know >> and I can do the re-base, thanks. > > Well, I prefer simpler, unless there's a good reason to use more complicated. > > Why exactly do you think your version is better? As explained here: https://lkml.org/lkml/2013/8/2/81 https://lkml.org/lkml/2013/8/2/112 And for the demo broken _BQC, mine patch will disable _BQC while still make the backlight work, and this patch here is testing the max brightness level and may fail. -Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ARM: remove dmacap,memset from Device tree binding
DT Maintainers, It's been a week with no comment. Shall I assume it's ok to apply this? thx, Jason. On Thu, Jul 25, 2013 at 11:31:04AM -0400, Jason Cooper wrote: > On Tue, Jul 02, 2013 at 12:54:12PM +0200, Sebastian Hesselbarth wrote: > > DMA_MEMSET support has been removed, so update the device tree files > > and corresponding binding documentation for Marvell SoCs. > > > > Signed-off-by: Sebastian Hesselbarth > > --- > > Cc: Russell King > > Cc: Jason Cooper > > Cc: Andrew Lunn > > Cc: Thomas Petazzoni > > Cc: Gregory CLEMENT > > Cc: devicetree-disc...@lists.ozlabs.org > > Cc: linux-kernel@vger.kernel.org > > Cc: linux-arm-ker...@lists.infradead.org > > --- > > Documentation/devicetree/bindings/dma/mv-xor.txt |2 -- > > arch/arm/boot/dts/armada-370.dtsi|2 -- > > arch/arm/boot/dts/armada-xp.dtsi |2 -- > > arch/arm/boot/dts/dove.dtsi |2 -- > > arch/arm/boot/dts/kirkwood.dtsi |2 -- > > arch/arm/boot/dts/orion5x.dtsi |1 - > > 6 files changed, 0 insertions(+), 11 deletions(-) > > Adding the new devicetree ml to the Cc: > > I'm fine with the changes to the dts{i} files, but I think the binding > document should be handled differently. > > thx, > > Jason. > > > > > diff --git a/Documentation/devicetree/bindings/dma/mv-xor.txt > > b/Documentation/devicetree/bindings/dma/mv-xor.txt > > index 7c6cb7f..68f7004 100644 > > --- a/Documentation/devicetree/bindings/dma/mv-xor.txt > > +++ b/Documentation/devicetree/bindings/dma/mv-xor.txt > > @@ -14,7 +14,6 @@ properties: > > > > And the following optional properties: > > - dmacap,memcpy to indicate that the XOR channel is capable of memcpy > > operations > > -- dmacap,memset to indicate that the XOR channel is capable of memset > > operations > > - dmacap,xor to indicate that the XOR channel is capable of xor operations > > > > Example: > > @@ -35,6 +34,5 @@ xor@d0060900 { > > interrupts = <52>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > > }; > > diff --git a/arch/arm/boot/dts/armada-370.dtsi > > b/arch/arm/boot/dts/armada-370.dtsi > > index fa3dfc6..a315ad1 100644 > > --- a/arch/arm/boot/dts/armada-370.dtsi > > +++ b/arch/arm/boot/dts/armada-370.dtsi > > @@ -132,7 +132,6 @@ > > interrupts = <52>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > > }; > > > > @@ -151,7 +150,6 @@ > > interrupts = <95>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > > }; > > > > diff --git a/arch/arm/boot/dts/armada-xp.dtsi > > b/arch/arm/boot/dts/armada-xp.dtsi > > index 416eb94..4b3dd56 100644 > > --- a/arch/arm/boot/dts/armada-xp.dtsi > > +++ b/arch/arm/boot/dts/armada-xp.dtsi > > @@ -114,7 +114,6 @@ > > interrupts = <52>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > > }; > > > > @@ -134,7 +133,6 @@ > > interrupts = <95>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > > }; > > > > diff --git a/arch/arm/boot/dts/dove.dtsi b/arch/arm/boot/dts/dove.dtsi > > index 6cab468..2cef34f 100644 > > --- a/arch/arm/boot/dts/dove.dtsi > > +++ b/arch/arm/boot/dts/dove.dtsi > > @@ -232,7 +232,6 @@ > > > > channel1 { > > interrupts = <40>; > > - dmacap,memset; > > dmacap,memcpy; > > dmacap,xor; > > }; > > @@ -253,7 +252,6 @@ > > > > channel1 { > > interrupts = <43>; > > - dmacap,memset; > > dmacap,memcpy; > > dmacap,xor; > > }; > > diff --git a/arch/arm/boot/dts/kirkwood.dtsi > > b/arch/arm/boot/dts/kirkwood.dtsi > > index 9809fc1..078637c 100644 > > --- a/arch/arm/boot/dts/kirkwood.dtsi > > +++ b/arch/arm/boot/dts/kirkwood.dtsi > > @@ -126,7 +126,6 @@ > > interrupts = <6>; > > dmacap,memcpy; > > dmacap,xor; > > - dmacap,memset; > > }; > >
[git pull] drm build fix
Hi Linus, just a quick fix that a few people have reported, be nice to have in asap. Dave. The following changes since commit 72a67a94bcba71a5fddd6b3596a20604d2b5dcd6: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2013-08-03 15:00:23 -0700) are available in the git repository at: git://people.freedesktop.org/~airlied/linux drm-fixes for you to fetch changes up to adfb8e51332153016857194b85309150ac560286: drm/radeon: fix 64 bit divide in SI spm code (2013-08-04 11:03:14 +1000) Alex Deucher (1): drm/radeon: fix 64 bit divide in SI spm code drivers/gpu/drm/radeon/si_dpm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Cannot hot remove a memory device
On Sat, 2013-08-03 at 03:01 +0200, Rafael J. Wysocki wrote: > On Friday, August 02, 2013 06:04:40 PM Toshi Kani wrote: > > On Sat, 2013-08-03 at 01:43 +0200, Rafael J. Wysocki wrote: > > > On Friday, August 02, 2013 03:46:15 PM Toshi Kani wrote: > > > > On Thu, 2013-08-01 at 23:43 +0200, Rafael J. Wysocki wrote: > > > > > Hi, > > > > > > > > > > Thanks for your report. > > > > > > > > > > On Thursday, August 01, 2013 05:37:21 PM Yasuaki Ishimatsu wrote: > > > > > > By following commit, I cannot hot remove a memory device. > > > > > > > > > > > > ACPI / memhotplug: Bind removable memory blocks to ACPI device nodes > > > > > > commit e2ff39400d81233374e780b133496a2296643d7d > > > > > > > > > > > > Details are follows: > > > > > > When I add a memory device, acpi_memory_enable_device() always fails > > > > > > as follows: > > > > > > > > > > > > ... > > > > > > [ 1271.114116] [ea121c40-ea121c7f] PMD -> > > > > > > [880813c0-880813ff] on node 3 > > > > > > [ 1271.128682] [ea121c80-ea121cbf] PMD -> > > > > > > [88081380-880813bf] on node 3 > > > > > > [ 1271.143298] [ea121cc0-ea121cff] PMD -> > > > > > > [88081300-8808133f] on node 3 > > > > > > [ 1271.157799] [ea121d00-ea121d3f] PMD -> > > > > > > [880812c0-880812ff] on node 3 > > > > > > [ 1271.172341] [ea121d40-ea121d7f] PMD -> > > > > > > [88081280-880812bf] on node 3 > > > > > > [ 1271.186872] [ea121d80-ea121dbf] PMD -> > > > > > > [88081240-8808127f] on node 3 > > > > > > [ 1271.201481] [ea121dc0-ea121dff] PMD -> > > > > > > [88081200-8808123f] on node 3 > > > > > > [ 1271.216041] [ea121e00-ea121e3f] PMD -> > > > > > > [880811c0-880811ff] on node 3 > > > > > > [ 1271.230623] [ea121e40-ea121e7f] PMD -> > > > > > > [88081180-880811bf] on node 3 > > > > > > [ 1271.245148] [ea121e80-ea121ebf] PMD -> > > > > > > [88081140-8808117f] on node 3 > > > > > > [ 1271.259683] [ea121ec0-ea121eff] PMD -> > > > > > > [88081100-8808113f] on node 3 > > > > > > [ 1271.274194] [ea121f00-ea121f3f] PMD -> > > > > > > [880810c0-880810ff] on node 3 > > > > > > [ 1271.288764] [ea121f40-ea121f7f] PMD -> > > > > > > [88081080-880810bf] on node 3 > > > > > > > > It appears that each memory object only has 64MB of memory. This is > > > > less than the memory block size, which is 128MB. This means that a > > > > single memory block associates with two ACPI memory device objects. > > > > > > That'd be bad. > > > > > > How did that work before if that indeed is the case? > > > > Well, it looks to me that it has never worked before... > > > > > > > > ... > > > > > > [ 1271.325841] acpi PNP0C80:03: acpi_memory_enable_device() error > > > > > > > > > > Well, the only new way acpi_memory_enable_device() can fail after > > > > > that commit > > > > > is a failure in acpi_bind_memory_blocks(). > > > > > > > > Agreed. > > > > > > > > > This means that either handle is NULL, which I think we can exclude, > > > > > because > > > > > acpi_memory_enable_device() wouldn't be called at all if that were > > > > > the case, or > > > > > there's a more subtle error in acpi_bind_one(). > > > > > > > > > > One that comes to mind is that we may be calling acpi_bind_one() > > > > > twice for the > > > > > same memory region, in which it will trigger -EINVAL from the sanity > > > > > check in > > > > > there. > > > > > > > > I think it fails with -EINVAL at the place with dev_warn(dev, "ACPI > > > > handle is already set\n"). When two ACPI memory objects associate with > > > > a same memory block, the bind procedure of the 2nd ACPI memory object > > > > sees that ACPI_HANDLE(dev) is already set to the 1st ACPI memory object. > > > > > > That sound's plausible, but I wonder how we can fix that? > > > > > > There's no way for a single physical device to have two different ACPI > > > "companions". It looks like the memory blocks should be 64 M each in that > > > case. Or we need to create two child devices for each memory block and > > > associate each of them with an ACPI object. That would lead to > > > complications > > > in the user space interface, though. > > > > Right. Even bigger issue is that I do not think __add_pages() and > > __remove_pages() can add / delete a memory chunk that is less than > > 128MB. 128MB is the granularity of them. So, we may just have to fail > > this case gracefully. > > Sigh. > > BTW, why do you think they are 64 M each (it's late and I'm obviously tired)? Oops! Sorry, I had confused the above messages with the one in init_memory_mapping(), which shows a memory range being added, i.e. the size of an ACPI
Re: [PATCH] ACPI: Do not fail acpi_bind_one() if device is already bound correctly
On Sat, 2013-08-03 at 02:47 +0200, Rafael J. Wysocki wrote: > On Friday, August 02, 2013 04:38:38 PM Toshi Kani wrote: > > On Fri, 2013-08-02 at 00:33 +0200, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki > > > > > > Modify acpi_bind_one() so that it doesn't fail if the device > > > represented by its first argument has already been bound to the > > > given ACPI handle (second argument), because that is not a good > > > enough reason for returning an error code. > > > > While it seems reasonable to allow such case, I do not think we will hit > > this case under the normal scenarios. So, I do not think we need to > > make this change now unless it actually solves Yasuaki's issue (which I > > am guessing not). > > In theory it should be possible to call acpi_bind_one() twice in a row > for the same dev and the same handle without failure, that simply is > logical. The patch may not fix any problems visible now, but returning an > error code in such a case is simply incorrect. We changed acpi_bus_device_attach() to not call the handler or driver again if it is already bound. So, I was under impression that we prevent from attaching a same device twice. But I may be missing something... Thanks, -Toshi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, Aug 3, 2013 at 6:27 PM, Igor Gnatenko wrote: > On Sat, 2013-08-03 at 15:21 -0500, Felipe Contreras wrote: >> On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko >> wrote: >> >> > I am opposed to this patch. On ThinkPad X230 I had problems with it. >> > Felipe, come over to dark side. They have cookies. >> >> And v3.11-rc3 works's fine out-of-the box? >> > > rc2 work's fine. Since rc3 Rafael reverted patch and I think backlight > broken again (I don't have time to check rc3 yet). You think? We are talking about this patch [1]. Did you apply that patch on top of rc3 and booted without any arguments? If you did, what problems *exactly* did you have? [1] https://bugzilla.kernel.org/attachment.cgi?id=107084&action=diff&context=patch&collapsed=&headers=1&format=raw -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel causes excessive delay mounting USB device
* Vern Clark (vmcl...@verizon.net) wrote: > > Plugging in any USB flash stick takes 5-6 minutes before its mounted. > > === > Using the current kernel-3.8.0-28-generic, the USB fails to load in proper > time. > I found this message in syslog: timeout: killing '/sbin/blkid -o udev -p > /dev/sdb'. Right after which, the USB device was mounted. > > Using an earlier kernel-3.2.37-030237-generic, fixed the problem. Did you try going back to one of their previous 3.8.0 versions rather than a big jump like that? > Aug 2 09:39:16 u kernel: [ 2269.914570] sd 11:0:0:0: [sdb] Assuming drive > cache: write through > Aug 2 09:39:47 u kernel: [ 2300.739825] usb 1-8: reset high-speed USB device > number 11 using ehci-pci Curious; I've just started getting something similar on the Ubuntu 3.10.0-6 kernels, and last nights v3.11-rc3 build. It's been suggested perhaps: http://www.spinics.net/lists/stable/msg16022.html is the culprit, but I haven't tried it yet. Dave -- -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux | Happy \ \ gro.gilbert @ treblig.org | | In Hex / \ _|_ http://www.treblig.org |___/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: named anonymous vmas
On Thu, Aug 1, 2013 at 4:36 AM, Rich Felker wrote: > On Thu, Aug 01, 2013 at 01:29:51AM -0700, Christoph Hellwig wrote: >> Btw, FreeBSD has an extension to shm_open to create unnamed but fd >> passable segments. From their man page: >> >> As a FreeBSD extension, the constant SHM_ANON may be used for the path >> argument to shm_open(). In this case, an anonymous, unnamed shared >> memory object is created. Since the object has no name, it cannot be >> removed via a subsequent call to shm_unlink(). Instead, the shared >> memory object will be garbage collected when the last reference to the >> shared memory object is removed. The shared memory object may be shared >> with other processes by sharing the file descriptor via fork(2) or >> sendmsg(2). Attempting to open an anonymous shared memory object with >> O_RDONLY will fail with EINVAL. All other flags are ignored. >> >> To me this sounds like the best way to expose this functionality to the >> user. Implementing it is another question as shm_open sits in libc, >> we could either take it and shm_unlink to the kernel, or use O_TMPFILE >> on tmpfs as the backend. > > I'm not sure what the purpose is. shm_open with a long random filename > and O_EXCL|O_CREAT, followed immediately by shm_unlink, is just as > good except in the case where you have a malicious user killing the > process in between these two operations. Practically, filename length is restricted by NAME_MAX(255bytes). Several people don't think it is enough long length. The point is, race free API. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, 2013-08-03 at 15:21 -0500, Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko > wrote: > > > I am opposed to this patch. On ThinkPad X230 I had problems with it. > > Felipe, come over to dark side. They have cookies. > > And v3.11-rc3 works's fine out-of-the box? > rc2 work's fine. Since rc3 Rafael reverted patch and I think backlight broken again (I don't have time to check rc3 yet). -- Igor Gnatenko Fedora release 19 (Schrödinger’s Cat) Linux 3.10.4-300.fc19.x86_64 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
On Sat, Aug 03, 2013 at 02:38:45PM -0700, Andres Salomon wrote: > On Sat, 3 Aug 2013 23:36:15 +0200 > Jens Frederich wrote: > > > On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon > > wrote: > > > Please Cc Daniel on these. Cjb and myself are no longer at olpc. > > > > > > > Do you know what's with Jon Nettleton? He is also on the TODO list? > > Let's ask. Jon, do you still want to be Cc'd on DCON and other OLPC > patches? No one reads the TODO files... Just update MAINTAINERS so that get_maintainer.pl CC's the right people automatically. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
On Sat, Aug 03, 2013 at 10:44:35PM +0200, Jens Frederich wrote: > @@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv *dcon, int > is_powered_down) > power_up: > if (is_powered_down) { > x = 1; > - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0); > + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0); You didn't introduce this but using "x" as the inbuf here messy. It should be char instead of an int. The code won't work on big endian systems. I know this hardware is only available on little endian systems and that's why it's not a bug. It's just an ugly thing to do. (Since you didn't introduce this, it means your patch is fine and you can ignore this email. I am just commenting in case anyone wants to fix clean it up). > if (x) { > pr_warn("unable to force dcon to power up: %d!\n", x); > return x; regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/99] 3.10.5-stable review
On Sat, Aug 03, 2013 at 07:02:50AM -0700, Guenter Roeck wrote: > On 08/02/2013 03:07 AM, Greg Kroah-Hartman wrote: > >This is the start of the stable review cycle for the 3.10.5 release. > >There are 99 patches in this series, all will be posted as a response > >to this one. If anyone has any issues with these being applied, please > >let me know. > > > >Responses should be made by Sun Aug 4 10:00:45 UTC 2013. > >Anything received after that time might be too late. > > > >The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.10.5-rc1.gz > >and the diffstat can be found below. > > > Cross build looks good. > > Summary: > Total builds: 64 Total build errors: 3 > > Details: > > http://desktop.roeck-us.net/buildlogs/v3.10/v3.10.4-99-g069bbeb.2013-08-03.02:22:12 Thanks for testing and letting me know. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Sat, Aug 3, 2013 at 5:38 PM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 05:20:33 PM Felipe Contreras wrote: >> On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki wrote: >> > On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote: >> >> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: >> >> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >> >> >> >> >> Yes, the patch is correct, but I still prefer my own version :-) >> >> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >> >> >> >> >> >> In case you want to take mine and mine needs refresh, please let me >> >> >> know >> >> >> and I can do the re-base, thanks. >> >> > >> >> > Well, I prefer simpler, unless there's a good reason to use more >> >> > complicated. >> >> >> >> Note that these are not exclusionary; his patch can be applied on top >> >> of mine. I don't think his patch is needed though. >> > >> > OK >> > >> > Do we still need to revert commit efaa14c if this patch is applied? >> >> I guess not. At least in this machine changing the backlight works >> correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10 >> didn't work at all. I cannot see how it would affect negatively other >> machines. >> >> That being said, the blacklisting is still needed, because 1. the >> level is not preserved between boots, and 2. level 0 turns off the >> screen, which I personally consider a regression. >> >> At least it boots to level 100 instead of 0. > > OK > > I'll push this patch to Linus for -rc5 then without the revert of commit > commit efaa14c. That's all I'm going to do for 3.11 in the ACPI video > area at this point. That seems fair. > As far as the blacklisting is concerned, I still have the blacklist of > your Asus machine queued up for 3.12. Since you're claiming that it > doesn't have any side effects on that machine, I think I can apply it. > > However, for other machines to be added to that blacklist I need to see > requests from their users with confirmation that there are no visible side > effects there. Good, that's the purpose of bug 60682. https://bugzilla.kernel.org/show_bug.cgi?id=60682 -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible mmap() write() problem in SLES11 SP2 kernel
On Thu, 1 Aug 2013, Ulrich Windl wrote: > Hi folks! > > I think I'd let you know (maybe I'm wrong, and the kernel is right): > > I write a C-program that maps a file into an private writable map. Then I > modify the area a bit and use one write to write that area back to a file. > > This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel > 3.0.80-0.7 the write() fails with EFAULT if the output file is the same as > the input file. I wonder if you actually did exactly the same on both kernels. > > The strace is amazingly short (I removed the unrelated calls): Providing that was very helpful. > open("xxx", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0 > mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000 > close(3)= 0 > open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 The crucial point is the above O_TRUNC when you now open the file for writing: that truncates the file to 0-length, which unmaps any pages mapped from it into userspace. Even the privately modified COW pages: that often seems surprising, but it is how mmap versus truncate is specified to work. > write(3, 0x7f85ac045000, 4414) = -1 EFAULT (Bad address) If your program now touched a part of the mapping, it would get SIGBUS, there being no pages of underlying object to page in from. But since you're accessing the area from within a system call, that simply fails with EFAULT. > close(3)= 0 > munmap(0x7f85ac045000, 4414)= 0 > > I want to have your attention if this should work, and you get my attention > if this should not work. It should not work. > Note that the input file is closed before it's opened for write again. As the > output file is typically shorter than the input, I didn't want to use a > non-private mapping and a truncate, just in case you wonder... (I didn't understand your logic there.) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 001/001] CHAR DRIVERS: a simple device to give daemons a /sys-like interface
On Fri, Aug 02, 2013 at 06:19:19PM -0700, Bob Smith wrote: > This character device can give daemons an interface similar to > the kernel's /sys and /proc interfaces. It is a nice way to > give user space drivers real device nodes in /dev. Other comments about this patch: > From 7ee4391af95b828179cf5627f8b431c3301c5057 Mon Sep 17 00:00:00 2001 > From: Bob Smith > Date: Fri, 2 Aug 2013 16:44:48 -0700 > Subject: [PATCH] PROXY, a driver that gives daemons a /sys like interface > No signed-off-by:, or body of text here that explains what this is and why it should be accepted. > --- > Documentation/proxy.txt | 36 > drivers/char/Kconfig|8 + > drivers/char/Makefile |2 +- > drivers/char/proxy.c| 539 > +++ > 4 files changed, 584 insertions(+), 1 deletion(-) > create mode 100644 Documentation/proxy.txt > create mode 100644 drivers/char/proxy.c > > diff --git a/Documentation/proxy.txt b/Documentation/proxy.txt > new file mode 100644 > index 000..6b9206a > --- /dev/null > +++ b/Documentation/proxy.txt > @@ -0,0 +1,36 @@ > +Proxy Character Devices > + > + > +Proxy is a small character device that connects two user space > +processes. It is intended to give user space daemons a /sys like > +interface for configuration and status. > + > +As an example consider a daemon that controls a stepper motor. The > +daemon would create and open one proxy device to read and write > +configuration (/dev/stepper/config) and another proxy device to > +accept a motor step count (/dev/stepper/count). > +Shell commands to illustrate this example: > + $ stepper_daemon# start the stepper control daemon > + $ # Set config to full steps, clockwise and 400 step/sec > + $ echo "full, cw, 400" > /dev/stepper/config > + $ # Now tell the motor to step 4000 steps > + $ echo 4000 > /dev/stepper/count > + $ sleep 2 > + $ # How many steps remain? > + $ cat /dev/stepper/count > + > + > +Proxy has some unique features that make ideal for providing a > +/sys like interface. It has no internal buffering. The means > +the daemon can not write until a client program is listening. > +Both named pipes and pseudo-ttys have internal buffers. So what is wrong with internal buffers? Named pipes have been around for a long time, they should be able to be used much like this, right? > +Proxy will succeed on a write of zero bytes. A zero byte write > +gives the client an EOF. The daemon in the example above would > +use a zero byte write in the last command after it had written the > +number of steps remaining. No other IPC mechanism can close one > +side of a device and leave the other side open. No "direct" IPC, but you can always emulate this just fine with existing IPC mechanisms. > +Proxy works well with select(), an important feature for daemons. > +In contrast, the FUSE filesystem has some issues with select() on > +the client side. What are those issues? Why not just fix them? Adding a new IPC function to the kernel should not be burried down in drivers/char/. We have 10+ different IPC mechanisms already, some simple, some more complex. Are you _sure_ none of the existing ones will not work for you? Maybe a simple userspace library that wraps the existing mechanisms would be better (no kernel changes needed, portable to any kernel release, etc.)? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, Aug 3, 2013 at 5:21 PM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 05:06:10 PM Felipe Contreras wrote: >> On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki wrote: > > [...] > >> > Whatever you are thinking you will achieve this way, it doesn't work. >> >> It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken, >> v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be >> broken too, and perhaps even v3.13. > > Be precise and say "backlight control on a number of machines in broken in > those kernels". Yes, it is. It needs to be fixed. Not necessarily your way, > though. "My" way, can be done *today*. "Your" way can be done whenever it's ready. >> Who benefits from this? > > Clearly, no one. > > And who benefits from your "crusade"? Who benefits from yours? It's all very simple; we ask the owners of these machines if acpi_osi="!Windows 2012" makes things work better, if they do, they go into the blacklist, if not, they don't. That would fix the problems for these machines *today*. If later it turns out there are other issues introduced, they get removed, just like you removed the intel_backlight switch patch. Igor Gnatenko said he had problems, but didn't mention which problem. In bug #51231, tons of people reported that things improved for ThinkPad X230. And when the proper fix is done, everyone will be happy and we can drop the blacklist. Everybody benefits this way. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Saturday, August 03, 2013 05:20:33 PM Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki wrote: > > On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote: > >> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: > >> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: > >> > >> >> Yes, the patch is correct, but I still prefer my own version :-) > >> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 > >> >> > >> >> In case you want to take mine and mine needs refresh, please let me know > >> >> and I can do the re-base, thanks. > >> > > >> > Well, I prefer simpler, unless there's a good reason to use more > >> > complicated. > >> > >> Note that these are not exclusionary; his patch can be applied on top > >> of mine. I don't think his patch is needed though. > > > > OK > > > > Do we still need to revert commit efaa14c if this patch is applied? > > I guess not. At least in this machine changing the backlight works > correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10 > didn't work at all. I cannot see how it would affect negatively other > machines. > > That being said, the blacklisting is still needed, because 1. the > level is not preserved between boots, and 2. level 0 turns off the > screen, which I personally consider a regression. > > At least it boots to level 100 instead of 0. OK I'll push this patch to Linus for -rc5 then without the revert of commit commit efaa14c. That's all I'm going to do for 3.11 in the ACPI video area at this point. As far as the blacklisting is concerned, I still have the blacklist of your Asus machine queued up for 3.12. Since you're claiming that it doesn't have any side effects on that machine, I think I can apply it. However, for other machines to be added to that blacklist I need to see requests from their users with confirmation that there are no visible side effects there. I think this is fair enough. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/4] ASoc: kirkwood: merge kirkwood-i2c and kirkwood-dma
On Wed, Jul 31, 2013 at 08:18:06AM +0200, Jean-Francois Moine wrote: > To avoid the declaration of a 'kirkwood-pcm-audio' device in the DT, > this patch merges the kirkwood-i2c and kirkwood-dma drivers into one > module associated with 'kirkwood-i2s'. I suggest holding off on this stuff at the moment. I think Mark and Liam (who now have a whole raft of emails from me today) have some work to do to fix ASoC to work how they're saying it should work, because to me ASoC seems to contradict what they're telling me. To put it another way, it must be buggy. The DAPM stuff seems to be the worst thing since mouldy bread. I'm chasing what seems to be multiple bugs through this stuff, and many of them are not particularly nice. For example - we register multiple copies of DAPM widgets for the same set of declarations if both a CPU DAI and a platform share the same struct device, and thereby end up overwriting some pointers in the DAI structure. It seems that CPU DAIs themselves aren't supposed to have DAPM widgets, but the core creates some... but there's no explicit cleanup of them unlike the other DAPMs, and there's no debugfs for them. The bias levels are stuck at off/standby all the time, which makes any kind of PM with spdif-dit impossible - or even any routing between stream input and output. Basically, I'm tearing my hair out today looking at this stuff and getting nowhere fast - all I'm doing is finding more and more problems. For example, where a codec has no input/output widgets declared, it used to be powered up automatically by ASoC as a matter of course. Things like UDA134x and other things. Things like spdif-dit. That "mysteriously" stopped happening. ASoC: dapm: Treat DAI widgets like AIF widgets for power seems to be the cause, this results in such setups having zero connected inputs/outputs reported, which causes them to remain powered down - because what used to happen before was the DAI links would report their powered state depending on whether they were active, and they are set active in soc_dapm_stream_event(). Now, when playback widget for the Codec and CPU DAIs get marked as active. The playback widget is created as a snd_soc_dapm_dai_in. It's power check function is set to dapm_dac_check_power. Since this widget is active, it checks for connected outputs via is_connected_output_ep(). We drop through to the second switch statement (why do we have two there? They're both switching on the same damned variable and its not like a widget can change its ID.) This is where the different behaviour has appeared - when these were just a simple snd_soc_dapm_dai, we used to just do the snd_soc_dapm_suspend_check() here, but we don't anymore. This is a snd_soc_dapm_dai_in type of widget, so we fall through this switch statement now and start searching the paths. As these codecs have no paths, this ultimately ends up returning no connections. Hence why these codecs with no DAPM widgets declared but have PM support via the bias level stuff have stopped working. Now, about the spdif-dit, if we're going to have to add "pin" widgets to it, what the output of a SPDIF in terms of DAPM widgets? At a guess, it's a "codec pin" despite there being no codec and no "pin" on that codec in reality, and that "pin" is always active. With that "fixed" (rather, altered to a state where it behaves the same way as it used to before the above commit) if I set up a DAPM route between the CPU DAI playback stream, an AIF (for spdif output) and the spdif-dit playback stream, I see the playback streams marked active and powered up, but the AIF stream remains powered down: spdif-dit/dapm/Playback:Playback: On in 1 out 1 spdif-dit/dapm/Playback: stream Playback active spdif-dit/dapm/Playback: in "static" "spdif-playback" spdif-dit/dapm/bias_level:On kirkwood-audio.1/dapm/spdif-playback:spdif-playback: Off in 1 out 0 kirkwood-audio.1/dapm/spdif-playback: in "static" "dma-playback" kirkwood-audio.1/dapm/spdif-playback: out "static" "Playback" kirkwood-audio.1/dapm/dma-playback:dma-playback: On in 1 out 1 kirkwood-audio.1/dapm/dma-playback: stream dma-playback active kirkwood-audio.1/dapm/dma-playback: out "static" "spdif-playback" kirkwood-audio.1/dapm/dma-playback: out "static" "i2s-playback" It looks to me like the DAPM stuff is - in one plain and simple word - buggered. I've no idea what the right fixes are in this area. It needs someone like Mark or Liam who supposedly understand this to spend time checking that it actually operates as they _think_ it should operate, because at the moment it plainly doesn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Sat, Aug 3, 2013 at 4:40 PM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote: >> On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: >> > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >> >> >> Yes, the patch is correct, but I still prefer my own version :-) >> >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >> >> >> >> In case you want to take mine and mine needs refresh, please let me know >> >> and I can do the re-base, thanks. >> > >> > Well, I prefer simpler, unless there's a good reason to use more >> > complicated. >> >> Note that these are not exclusionary; his patch can be applied on top >> of mine. I don't think his patch is needed though. > > OK > > Do we still need to revert commit efaa14c if this patch is applied? I guess not. At least in this machine changing the backlight works correctly, whereas in v3.11-rc3 it was all weird, and v3.7-v3.10 didn't work at all. I cannot see how it would affect negatively other machines. That being said, the blacklisting is still needed, because 1. the level is not preserved between boots, and 2. level 0 turns off the screen, which I personally consider a regression. At least it boots to level 100 instead of 0. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Saturday, August 03, 2013 05:06:10 PM Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki wrote: [...] > > Whatever you are thinking you will achieve this way, it doesn't work. > > It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken, > v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be > broken too, and perhaps even v3.13. Be precise and say "backlight control on a number of machines in broken in those kernels". Yes, it is. It needs to be fixed. Not necessarily your way, though. > Who benefits from this? Clearly, no one. And who benefits from your "crusade"? Again, no one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, Aug 3, 2013 at 4:59 PM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 03:20:22 PM Felipe Contreras wrote: >> On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki wrote: >> > On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote: >> >> >> Yes, that's fine, either the revert, or the patch I mentioned, or >> >> something else, but something has to be done, and it was better to do >> >> it in v3.11-rc4 than in v3.11-rc5, because that change itself can >> >> cause further problems. >> > >> > A revert can be done in -rc5 just fine. If we don't have a working fix >> > this >> > week, I'll do the revert. >> >> I think you are waiting for miracles. But whatever. >> >> >> Let's do a though experiment, let's say you are right, and they can >> >> survive the 6 months it would take you to find the "perfect" solution, >> >> say in v3.13. What's wrong with having a partial solution in v3.12? If >> >> the blacklisting doesn't work properly (there's absolutely no evidence >> >> for that), then you revert it on v3.12.1. >> >> >> >> What's wrong with that approach? >> > >> > If the blacklisting leads to problems, they may not be reported in the 3.12 >> > time frame, but much later. For example because people won't realize that >> > the problems are caused by the blacklisting until much much later. And >> > then >> > we'll be in a spot where whatever we do will break things for someone. >> >> The key word is *may*, you don't *know*. Why do you insist in >> committing this reification fallacy? >> >> This threat is not real, it's theoretical. > > I believe it is real. That is called wishful thinking. Believing so doesn't make it so. > Igor has told you that already, hasn't he? He said he "had problems", that doesn't mean much. What kind of problems? Did he have those problems in v3.10? Does v3.11-rc3 work correctly? Did he boot without any boot arguments? "I had problems" is almost meaningless. >> But let's suppose you are right, and there are issues, and those don't >> get reported in v3.12. That is actually GOOD, if people don't report >> issues, it means the issues are not that big, or not even *there*. > > You don't even realize how wrong you are. The issues that aren't visible > in testing from the start are often much *worse* than those we can see > immediately, because usually they are much more difficult to fix and they > cause much more pain overall as a result. How convenient. In other words; there's absolutely no empirical way to prove you wrong. It's like trying to prove that your invisible pink unicorn doesn't exist. If we don't find evidence of problems, that is evidence that there are no problems. And even if there are "invisible" "worse" issues, it doesn't matter, because you will fix them properly in six months, right? I'd say, fix the visible (aka. real) issues, don't worry about the invisible (aka. imaginary) ones. Worry when they are visible. >> And no, we won't be in a spot where whatever we do will break things, >> because if the intel backlight driver works correctly, that solution >> would work for everyone. And if it doesn't, we should stay with what >> works best. >> >> > And we had situations like that in the past, which is the source of my >> > concern. >> > You obviously don't have that experience, or you won't be so eager to >> > inflict >> > the blacklisting on everyone. >> > >> > Anyway, as you know, but conveniently don't mention, I asked some >> > experienced >> > people for opinions about that. If they agree with you, we will add the >> > blacklist. If they don't, we won't add it. >> >> Again, screw the users. We are stuck with broken backlight for several >> more months to come. Great. > > Whatever you are thinking you will achieve this way, it doesn't work. It is the reality: v3.7 is broken, v3.8 is broken, v3.9 is broken, v3.10 is broken, v3.11 is going to be broken, v3.12 will probably be broken too, and perhaps even v3.13. Who benefits from this? -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] cpufreq: Fix broken usage of governor->owner's refcount
On Saturday, August 03, 2013 06:57:11 PM Viresh Kumar wrote: > On 3 August 2013 17:38, Rafael J. Wysocki wrote: > > On Saturday, August 03, 2013 05:19:26 PM Viresh Kumar wrote: > >> Governor's owner refcount usage was broken. We should increment refcount > >> only > >> when CPUFREQ_GOV_POLICY_INIT event has come and should decrement only if > >> CPUFREQ_GOV_POLICY_EXIT has come. > >> > >> Lets fix it. > > > > OK, and what happens if we don't fix it? > > What about this changelog: > > Subject: [PATCH 08/10] cpufreq: Fix broken usage of governor->owner's > refcount > > Governor's owner refcount usage was broken. We should increment refcount only > when CPUFREQ_GOV_POLICY_INIT event has come and should decrement only if > CPUFREQ_GOV_POLICY_EXIT has come. > > Currently there can be situations where governor is in use but we have allowed > it to be unloaded. And that may cause in undefined behavior. "it to be unloaded which may result in undefined behavior." > > Lets fix it. > > Signed-off-by: Viresh Kumar Apart from the above looks good. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] CPUFreq: Fixes & Cleanups for 3.12
On Saturday, August 03, 2013 06:58:35 PM Viresh Kumar wrote: > On 3 August 2013 17:37, Rafael J. Wysocki wrote: > > On Saturday, August 03, 2013 05:19:18 PM Viresh Kumar wrote: > >> Hi Rafael, > >> > >> This patchset tries to fix & cleanup many existing cpufreq core issues. > >> First > >> four patches tries to cleanup basic problems in cpufreq core. Its first > >> patch > >> was earlier sent separately but now is part of this series. > >> > >> Fifth patch was also sent earlier as reply to your patches and was > >> reviewed by > >> Srivatsa. Sixth patch was picked from Lukasz's patchset on introducing > >> software > >> "boost" feature in core. It will be used by this patchset. > >> > >> And last four are the most significant part of this set. They try to make > >> many > >> things simple and robust. > >> > >> This is rebased of your bleeding-edge branch + two patches from you: > >> 18a6b03 cpufreq: Avoid double kobject_put() for the same kobject in error > >> code path > >> d0cde63 cpufreq: Do not hold driver module references for additional > >> policy CPUs > >> abe513f Merge branch 'acpi-sleep-next' into linux-next > >> > >> They are also pushed in my cpufreq-next branch > > > > How much testing has that received so far? > > I planned to add this information but forgot at the last moment. It > was partially > tested. As it was mostly developed over the weekend I wasn't able to do much > of testing. I posted it to get early comments, and testing would complete by > beginning of next week. OK, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Saturday, August 03, 2013 03:20:22 PM Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki wrote: > > On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote: > > >> Yes, that's fine, either the revert, or the patch I mentioned, or > >> something else, but something has to be done, and it was better to do > >> it in v3.11-rc4 than in v3.11-rc5, because that change itself can > >> cause further problems. > > > > A revert can be done in -rc5 just fine. If we don't have a working fix this > > week, I'll do the revert. > > I think you are waiting for miracles. But whatever. > > >> Let's do a though experiment, let's say you are right, and they can > >> survive the 6 months it would take you to find the "perfect" solution, > >> say in v3.13. What's wrong with having a partial solution in v3.12? If > >> the blacklisting doesn't work properly (there's absolutely no evidence > >> for that), then you revert it on v3.12.1. > >> > >> What's wrong with that approach? > > > > If the blacklisting leads to problems, they may not be reported in the 3.12 > > time frame, but much later. For example because people won't realize that > > the problems are caused by the blacklisting until much much later. And then > > we'll be in a spot where whatever we do will break things for someone. > > The key word is *may*, you don't *know*. Why do you insist in > committing this reification fallacy? > > This threat is not real, it's theoretical. I believe it is real. Igor has told you that already, hasn't he? > But let's suppose you are right, and there are issues, and those don't > get reported in v3.12. That is actually GOOD, if people don't report > issues, it means the issues are not that big, or not even *there*. You don't even realize how wrong you are. The issues that aren't visible in testing from the start are often much *worse* than those we can see immediately, because usually they are much more difficult to fix and they cause much more pain overall as a result. But I'm not sure why I'm still trying to explain things to you, because you don't understand basic stuff. > And no, we won't be in a spot where whatever we do will break things, > because if the intel backlight driver works correctly, that solution > would work for everyone. And if it doesn't, we should stay with what > works best. > > > And we had situations like that in the past, which is the source of my > > concern. > > You obviously don't have that experience, or you won't be so eager to > > inflict > > the blacklisting on everyone. > > > > Anyway, as you know, but conveniently don't mention, I asked some > > experienced > > people for opinions about that. If they agree with you, we will add the > > blacklist. If they don't, we won't add it. > > Again, screw the users. We are stuck with broken backlight for several > more months to come. Great. Whatever you are thinking you will achieve this way, it doesn't work. You seem to believe that the more you press people, the more they are likely to do what you want. Maybe that belief is a result of your previous experience, but this is not helping you here. Decisions made under pressure usually lead to wrong choices, so I avoid that as much as I can. As a result, the more you're pressing me, the more I'm concerned about the drawbacks and the less convincing you are to me. Consider this before you reply. Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
On Sat, 3 Aug 2013 23:36:15 +0200 Jens Frederich wrote: > On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon > wrote: > > Please Cc Daniel on these. Cjb and myself are no longer at olpc. > > > > Do you know what's with Jon Nettleton? He is also on the TODO list? Let's ask. Jon, do you still want to be Cc'd on DCON and other OLPC patches? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon wrote: > Please Cc Daniel on these. Cjb and myself are no longer at olpc. > Do you know what's with Jon Nettleton? He is also on the TODO list? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] cpufreq: Re-arrange declarations in cpufreq.h
On Saturday, August 03, 2013 10:43:45 PM Viresh Kumar wrote: > On 3 August 2013 17:19, Viresh Kumar wrote: > > They are pretty much mixed up. Although generic headers are present but > > definitions/declarations are present outside them too.. > > > > This patch just moves stuff up and down to make it look better and > > consistent. > > > > Signed-off-by: Viresh Kumar > > --- > > include/linux/cpufreq.h | 370 > > +++- > > 1 file changed, 177 insertions(+), 193 deletions(-) > > Fixup due to compilation reported by Fengguang's kbuild system: > [Will post the series again once I receive more comments on it] OK, thanks. I'm waiting for the update of the whole series, then. > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > index a6b97e2..d568f39 100644 > --- a/include/linux/cpufreq.h > +++ b/include/linux/cpufreq.h > @@ -268,6 +268,19 @@ int cpufreq_unregister_notifier(struct > notifier_block *nb, unsigned int list); > void cpufreq_notify_transition(struct cpufreq_policy *policy, > struct cpufreq_freqs *freqs, unsigned int state); > > +#else /* CONFIG_CPU_FREQ */ > +static inline int cpufreq_register_notifier(struct notifier_block *nb, > + unsigned int list) > +{ > + return 0; > +} > +static inline int cpufreq_unregister_notifier(struct notifier_block *nb, > + unsigned int list) > +{ > + return 0; > +} > +#endif /* !CONFIG_CPU_FREQ */ > + > /** > * cpufreq_scale - "old * mult / div" calculation for large values > (32-bit-arch > * safe) > @@ -282,32 +295,16 @@ static inline unsigned long > cpufreq_scale(unsigned long old, u_int div, > u_int mult) > { > #if BITS_PER_LONG == 32 > - > u64 result = ((u64) old) * ((u64) mult); > do_div(result, div); > return (unsigned long) result; > > #elif BITS_PER_LONG == 64 > - > unsigned long result = old * ((u64) mult); > result /= div; > return result; > - > #endif > -}; > - > -#else /* CONFIG_CPU_FREQ */ > -static inline int cpufreq_register_notifier(struct notifier_block *nb, > - unsigned int list) > -{ > - return 0; > } > -static inline int cpufreq_unregister_notifier(struct notifier_block *nb, > - unsigned int list) > -{ > - return 0; > -} > -#endif /* !CONFIG_CPU_FREQ */ > > /* > * CPUFREQ GOVERNORS* -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] cpufreq: Cleanup header files included in core
On Saturday, August 03, 2013 10:45:04 PM Viresh Kumar wrote: > On 3 August 2013 17:19, Viresh Kumar wrote: > > This patch intends to cleanup following issues in the header files included > > in > > cpufreq core layers: > > - Include headers in ascending order, so that we don't add same multiple > > times > > by mistake. > > - must be included after , so that they override whatever > > they > > need. > > - Remove unnecessary header files > > - Don't include files already included by cpufreq.h or cpufreq_governor.h > > > > Signed-off-by: Viresh Kumar > > Fixup due to compilation warning reported by Fengguang's kbuild system: > [Latest stuff pushed in my cpufreq-next branch] > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 70399ea..ccaf025 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -19,6 +19,7 @@ > > #include > #include > +#include > #include > #include > #include Can you please repost the complete patch with this folded in? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
On Sat, Aug 3, 2013 at 11:16 PM, Andres Salomon wrote: > Please Cc Daniel on these. Cjb and myself are no longer at olpc. > Sorry, I've forgotten it. I will update the the TODO list. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Saturday, August 03, 2013 03:24:16 PM Felipe Contreras wrote: > On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: > > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: > > >> Yes, the patch is correct, but I still prefer my own version :-) > >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 > >> > >> In case you want to take mine and mine needs refresh, please let me know > >> and I can do the re-base, thanks. > > > > Well, I prefer simpler, unless there's a good reason to use more > > complicated. > > Note that these are not exclusionary; his patch can be applied on top > of mine. I don't think his patch is needed though. OK Do we still need to revert commit efaa14c if this patch is applied? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Staging: olpc_dcon: replace some magic numbers
Please Cc Daniel on these. Cjb and myself are no longer at olpc. On Sat, 3 Aug 2013 22:44:35 +0200 Jens Frederich wrote: > This patch replace some magic numbers. I believe it makes > the driver more readable. > > The magic number 0x26 is the XO system embedded controller > (EC) command 'DCON power enable/disable'. > > Number 0x41, and 0x42 are special memory controller settings > register. The 0x41 initialize bit sequence 0x101 means: > enable memory power down function and special SDRAM clock > delay for synchronize SDRAM output and clock signal. > > The 0x42 initialize squence 0x101 is wrong. According to > the specification Bit 8 is reserved, thus not in use. > I removed it. > > Signed-off-by: Jens Frederich > > diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c > b/drivers/staging/olpc_dcon/olpc_dcon.c index 7c460f2..5ca4fa4 100644 > --- a/drivers/staging/olpc_dcon/olpc_dcon.c > +++ b/drivers/staging/olpc_dcon/olpc_dcon.c > @@ -90,9 +90,10 @@ static int dcon_hw_init(struct dcon_priv *dcon, > int is_init) > /* SDRAM setup/hold time */ > dcon_write(dcon, 0x3a, 0xc040); > - dcon_write(dcon, 0x41, 0x); > - dcon_write(dcon, 0x41, 0x0101); > - dcon_write(dcon, 0x42, 0x0101); > + dcon_write(dcon, DCON_REG_MEM_OPT_A, 0x); /* clear > option bits */ > + dcon_write(dcon, DCON_REG_MEM_OPT_A, > + MEM_DLL_CLOCK_DELAY | > MEM_POWER_DOWN); > + dcon_write(dcon, DCON_REG_MEM_OPT_B, MEM_SOFT_RESET); > > /* Colour swizzle, AA, no passthrough, backlight */ > if (is_init) { > @@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv > *dcon, int is_powered_down) power_up: > if (is_powered_down) { > x = 1; > - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, > 0); > + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, > NULL, 0); if (x) { > pr_warn("unable to force dcon to power up: > %d!\n", x); return x; > @@ -144,7 +145,7 @@ power_up: > pr_err("unable to stabilize dcon's smbus, > reasserting power and praying.\n"); > BUG_ON(olpc_board_at_least(olpc_board(0xc2))); x = 0; > - olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0); > + olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, > 0); msleep(100); > is_powered_down = 1; > goto power_up; /* argh, stupid hardware.. */ > @@ -208,7 +209,7 @@ static void dcon_sleep(struct dcon_priv *dcon, > bool sleep) > if (sleep) { > x = 0; > - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, > 0); > + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, > NULL, 0); if (x) > pr_warn("unable to force dcon to power down: > %d!\n", x); else > diff --git a/drivers/staging/olpc_dcon/olpc_dcon.h > b/drivers/staging/olpc_dcon/olpc_dcon.h index 997bded..524ee49 100644 > --- a/drivers/staging/olpc_dcon/olpc_dcon.h > +++ b/drivers/staging/olpc_dcon/olpc_dcon.h > @@ -22,15 +22,24 @@ > #define MODE_DEBUG (1<<14) > #define MODE_SELFTEST(1<<15) > > -#define DCON_REG_HRES2 > -#define DCON_REG_HTOTAL 3 > -#define DCON_REG_HSYNC_WIDTH 4 > -#define DCON_REG_VRES5 > -#define DCON_REG_VTOTAL 6 > -#define DCON_REG_VSYNC_WIDTH 7 > -#define DCON_REG_TIMEOUT 8 > -#define DCON_REG_SCAN_INT9 > -#define DCON_REG_BRIGHT 10 > +#define DCON_REG_HRES0x2 > +#define DCON_REG_HTOTAL 0x3 > +#define DCON_REG_HSYNC_WIDTH 0x4 > +#define DCON_REG_VRES0x5 > +#define DCON_REG_VTOTAL 0x6 > +#define DCON_REG_VSYNC_WIDTH 0x7 > +#define DCON_REG_TIMEOUT 0x8 > +#define DCON_REG_SCAN_INT0x9 > +#define DCON_REG_BRIGHT 0x10 > +#define DCON_REG_MEM_OPT_A 0x41 > +#define DCON_REG_MEM_OPT_B 0x42 > + > +/* Load Delay Locked Loop (DLL) settings for clock delay */ > +#define MEM_DLL_CLOCK_DELAY (1<<0) > +/* Memory controller power down function */ > +#define MEM_POWER_DOWN (1<<8) > +/* Memory controller software reset */ > +#define MEM_SOFT_RESET (1<<0) > > /* Status values */ > > diff --git a/include/linux/olpc-ec.h b/include/linux/olpc-ec.h > index 5bb6e76..2925df3 100644 > --- a/include/linux/olpc-ec.h > +++ b/include/linux/olpc-ec.h > @@ -6,6 +6,7 @@ > #define EC_WRITE_SCI_MASK0x1b > #define EC_WAKE_UP_WLAN 0x24 > #define EC_WLAN_LEAVE_RESET 0x25 > +#define EC_DCON_POWER_MODE 0x26 > #define EC_READ_EB_MODE 0x2a > #define EC_SET_SCI_INHIBIT 0x32 > #define EC_SET_SCI_INHIBIT_RELEASE 0x34 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Staging: olpc_dcon: Already completed TODO entry removed
The TODO entry - drop global variables, use a proper olpc_dcon_priv struct - is already finished. The driver has no global variables. It uses the private structure 'dcon_priv'. Signed-off-by: Jens Frederich diff --git a/drivers/staging/olpc_dcon/TODO b/drivers/staging/olpc_dcon/TODO index f378e84..886e46e 100644 --- a/drivers/staging/olpc_dcon/TODO +++ b/drivers/staging/olpc_dcon/TODO @@ -3,7 +3,6 @@ TODO: share more code - allow simultaneous XO-1 and XO-1.5 support - console event notifier support - - drop global variables, use a proper olpc_dcon_priv struct - audit code for unnecessary code; old unsupported prototype workarounds, ancient variables (noaa?), etc - verify sane i2c API usage, update to new stuff if necessary -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Staging: olpc_dcon: replace some magic numbers
This patch replace some magic numbers. I believe it makes the driver more readable. The magic number 0x26 is the XO system embedded controller (EC) command 'DCON power enable/disable'. Number 0x41, and 0x42 are special memory controller settings register. The 0x41 initialize bit sequence 0x101 means: enable memory power down function and special SDRAM clock delay for synchronize SDRAM output and clock signal. The 0x42 initialize squence 0x101 is wrong. According to the specification Bit 8 is reserved, thus not in use. I removed it. Signed-off-by: Jens Frederich diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c b/drivers/staging/olpc_dcon/olpc_dcon.c index 7c460f2..5ca4fa4 100644 --- a/drivers/staging/olpc_dcon/olpc_dcon.c +++ b/drivers/staging/olpc_dcon/olpc_dcon.c @@ -90,9 +90,10 @@ static int dcon_hw_init(struct dcon_priv *dcon, int is_init) /* SDRAM setup/hold time */ dcon_write(dcon, 0x3a, 0xc040); - dcon_write(dcon, 0x41, 0x); - dcon_write(dcon, 0x41, 0x0101); - dcon_write(dcon, 0x42, 0x0101); + dcon_write(dcon, DCON_REG_MEM_OPT_A, 0x); /* clear option bits */ + dcon_write(dcon, DCON_REG_MEM_OPT_A, + MEM_DLL_CLOCK_DELAY | MEM_POWER_DOWN); + dcon_write(dcon, DCON_REG_MEM_OPT_B, MEM_SOFT_RESET); /* Colour swizzle, AA, no passthrough, backlight */ if (is_init) { @@ -126,7 +127,7 @@ static int dcon_bus_stabilize(struct dcon_priv *dcon, int is_powered_down) power_up: if (is_powered_down) { x = 1; - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0); + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0); if (x) { pr_warn("unable to force dcon to power up: %d!\n", x); return x; @@ -144,7 +145,7 @@ power_up: pr_err("unable to stabilize dcon's smbus, reasserting power and praying.\n"); BUG_ON(olpc_board_at_least(olpc_board(0xc2))); x = 0; - olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0); + olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0); msleep(100); is_powered_down = 1; goto power_up; /* argh, stupid hardware.. */ @@ -208,7 +209,7 @@ static void dcon_sleep(struct dcon_priv *dcon, bool sleep) if (sleep) { x = 0; - x = olpc_ec_cmd(0x26, (unsigned char *)&x, 1, NULL, 0); + x = olpc_ec_cmd(EC_DCON_POWER_MODE, (u8 *)&x, 1, NULL, 0); if (x) pr_warn("unable to force dcon to power down: %d!\n", x); else diff --git a/drivers/staging/olpc_dcon/olpc_dcon.h b/drivers/staging/olpc_dcon/olpc_dcon.h index 997bded..524ee49 100644 --- a/drivers/staging/olpc_dcon/olpc_dcon.h +++ b/drivers/staging/olpc_dcon/olpc_dcon.h @@ -22,15 +22,24 @@ #define MODE_DEBUG (1<<14) #define MODE_SELFTEST (1<<15) -#define DCON_REG_HRES 2 -#define DCON_REG_HTOTAL3 -#define DCON_REG_HSYNC_WIDTH 4 -#define DCON_REG_VRES 5 -#define DCON_REG_VTOTAL6 -#define DCON_REG_VSYNC_WIDTH 7 -#define DCON_REG_TIMEOUT 8 -#define DCON_REG_SCAN_INT 9 -#define DCON_REG_BRIGHT10 +#define DCON_REG_HRES 0x2 +#define DCON_REG_HTOTAL0x3 +#define DCON_REG_HSYNC_WIDTH 0x4 +#define DCON_REG_VRES 0x5 +#define DCON_REG_VTOTAL0x6 +#define DCON_REG_VSYNC_WIDTH 0x7 +#define DCON_REG_TIMEOUT 0x8 +#define DCON_REG_SCAN_INT 0x9 +#define DCON_REG_BRIGHT0x10 +#define DCON_REG_MEM_OPT_A 0x41 +#define DCON_REG_MEM_OPT_B 0x42 + +/* Load Delay Locked Loop (DLL) settings for clock delay */ +#define MEM_DLL_CLOCK_DELAY(1<<0) +/* Memory controller power down function */ +#define MEM_POWER_DOWN (1<<8) +/* Memory controller software reset */ +#define MEM_SOFT_RESET (1<<0) /* Status values */ diff --git a/include/linux/olpc-ec.h b/include/linux/olpc-ec.h index 5bb6e76..2925df3 100644 --- a/include/linux/olpc-ec.h +++ b/include/linux/olpc-ec.h @@ -6,6 +6,7 @@ #define EC_WRITE_SCI_MASK 0x1b #define EC_WAKE_UP_WLAN0x24 #define EC_WLAN_LEAVE_RESET0x25 +#define EC_DCON_POWER_MODE 0x26 #define EC_READ_EB_MODE0x2a #define EC_SET_SCI_INHIBIT 0x32 #define EC_SET_SCI_INHIBIT_RELEASE 0x34 -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sparc64 WARNING: at mm/mmap.c:2757 exit_mmap+0x13c/0x160()
From: Aaro Koskinen Date: Mon, 17 Jun 2013 08:58:39 +0300 > On Mon, Jun 17, 2013 at 08:32:25AM +0300, Aaro Koskinen wrote: >> On Mon, Jun 17, 2013 at 12:06:00AM +0300, Meelis Roos wrote: >> > Got this in 3.10-rc6 whil testing debian unstable upgrade with aptitude. >> > 3.10-rc5 did not exhibit this (nor any other kernel recently tried, >> > including most -rc's). Does not seem to be reproducible. >> >> I get this regularly on Ultrasparc during long compilations. It's been >> there with all recent kernels (probably at least since 3.8). Latest I >> saw with 3.10-rc5. > > Two examples: Thanks for the reports, I'm actively looking into this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] acpi: video: trivial costmetic cleanups
On Sat, Aug 3, 2013 at 6:38 AM, Rafael J. Wysocki wrote: > On Friday, August 02, 2013 08:34:29 PM Felipe Contreras wrote: >> On Fri, Aug 2, 2013 at 7:07 PM, Rafael J. Wysocki wrote: >> > On Friday, August 02, 2013 12:52:18 PM Felipe Contreras wrote: >> >> On Fri, Aug 2, 2013 at 9:05 AM, Rafael J. Wysocki wrote: >> >> > On Thursday, August 01, 2013 11:15:38 PM Felipe Contreras wrote: >> >> >> On Thu, Aug 1, 2013 at 8:50 PM, Aaron Lu wrote: >> >> >> > On 08/02/2013 07:43 AM, Felipe Contreras wrote: >> >> >> >> Signed-off-by: Felipe Contreras >> >> >> > >> >> >> > Please add change log explaining what you have changed. >> >> >> > It seems that the patch modify comment style only, some add a space >> >> >> > and >> >> >> > some change spaces to tab, is it the case? >> >> >> >> >> >> The commit message already explains what the change is; trivial >> >> >> cosmetic cleanups. Cosmetic means it's completely superficial. >> >> > >> >> > And I have a rule not to apply patches without changelogs. So either >> >> > I'll >> >> > need to write it for you, or can you add one pretty please? >> >> >> >> The commit message is right there. Maybe Jiri can apply it then, if >> >> not, then stay happy with your untidy code. >> > >> > First of all, I didn't say I wouldn't apply the patch, did I? >> > >> > Second, I asked you *nicely* to add a changelog so that I don't need to >> > write >> > it for you. >> > >> > I don't know what made it difficult to understand. >> > >> > Anyway, I ask everyone to write changelogs and nobody has had any problems >> > with >> > that so far. I don't see why I should avoid asking you to follow the rules >> > that everybody else is asked to follow. If those rules are too difficult >> > for >> > you to follow, I'm sorry. >> >> The patch has a commit message that describes exactly what it does. > > No, it doesn't describe it exactly. You're contradicting facts. > >> Unless there is valid feedback I will not send another version. >> >> To me, a valid criticism to the commit message would be: "I read X, >> but I thought it would do Y". For example; "I didn't expect the patch >> to do white-space cleanups", but I think that's exactly what people >> expect when they read "trivial costmetic cleanups'. > > If what you're saying was correct, then it would be sufficient to use a > "this patch fixes a bug" commit message for every bug fix, but quite evidently > that is not the case. No, it wouldn't be sufficient, take a look a the Corbert's list you yourself mentioned: * the original motivation for the work is quickly forgotten "this patch fixes a bug" doesn't describe the motivation. * Andrew Morton also famously pushes developers to document the reasons explaining why a patch was written, including the user-visible effects of any bugs fixed The reason for the patch is not documented, nor the user-visible effects. * Kernel developers do not like having to reverse engineer the intent of a patch years after the fact. The intent of the patch is not mentioned. That is completely different with my patch. Personally I like to answer these questions: What is the patch is doing (motivation)? What is the current problem? What is the change? What are the side-effects? All those are clear with "trivial costmetic cleanups", they are not with "this patch fixes a bug". I think you are committing a hasty generalization fallacy. Not all patches are the same. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] acpi: video: improve quirk check
On Sat, Aug 3, 2013 at 6:34 AM, Rafael J. Wysocki wrote: > On Saturday, August 03, 2013 04:14:04 PM Aaron Lu wrote: >> Yes, the patch is correct, but I still prefer my own version :-) >> https://github.com/aaronlu/linux/commit/0a3d2c5b59caf80ae5bb1ca1fda0f7bf448b38c9 >> >> In case you want to take mine and mine needs refresh, please let me know >> and I can do the re-base, thanks. > > Well, I prefer simpler, unless there's a good reason to use more complicated. Note that these are not exclusionary; his patch can be applied on top of mine. I don't think his patch is needed though. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, Aug 3, 2013 at 11:08 AM, Igor Gnatenko wrote: > I am opposed to this patch. On ThinkPad X230 I had problems with it. > Felipe, come over to dark side. They have cookies. And v3.11-rc3 works's fine out-of-the box? -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ACPI and power management fixes for v3.11-rc4
On Sat, Aug 3, 2013 at 6:54 AM, Rafael J. Wysocki wrote: > On Friday, August 02, 2013 08:48:09 PM Felipe Contreras wrote: >> Yes, that's fine, either the revert, or the patch I mentioned, or >> something else, but something has to be done, and it was better to do >> it in v3.11-rc4 than in v3.11-rc5, because that change itself can >> cause further problems. > > A revert can be done in -rc5 just fine. If we don't have a working fix this > week, I'll do the revert. I think you are waiting for miracles. But whatever. >> Let's do a though experiment, let's say you are right, and they can >> survive the 6 months it would take you to find the "perfect" solution, >> say in v3.13. What's wrong with having a partial solution in v3.12? If >> the blacklisting doesn't work properly (there's absolutely no evidence >> for that), then you revert it on v3.12.1. >> >> What's wrong with that approach? > > If the blacklisting leads to problems, they may not be reported in the 3.12 > time frame, but much later. For example because people won't realize that > the problems are caused by the blacklisting until much much later. And then > we'll be in a spot where whatever we do will break things for someone. The key word is *may*, you don't *know*. Why do you insist in committing this reification fallacy? This threat is not real, it's theoretical. But let's suppose you are right, and there are issues, and those don't get reported in v3.12. That is actually GOOD, if people don't report issues, it means the issues are not that big, or not even *there*. And no, we won't be in a spot where whatever we do will break things, because if the intel backlight driver works correctly, that solution would work for everyone. And if it doesn't, we should stay with what works best. > And we had situations like that in the past, which is the source of my > concern. > You obviously don't have that experience, or you won't be so eager to inflict > the blacklisting on everyone. > > Anyway, as you know, but conveniently don't mention, I asked some experienced > people for opinions about that. If they agree with you, we will add the > blacklist. If they don't, we won't add it. Again, screw the users. We are stuck with broken backlight for several more months to come. Great. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH resend] drop_caches: add some documentation and info message
>>> You missed the "!". I'm proposing that setting the new bit 2 will >>> permit people to prevent the new printk if it is causing them problems. >> >> No I don't. I'm sure almost all abuse users think our usage is correct. Then, >> I can imagine all crazy applications start to use this flag eventually. > > I guess we do not care about those. If somebody wants to shoot his feet > then we cannot do much about it. The primary motivation was to find out > those that think this is right and they are willing to change the setup > once they know this is not the right way to do things. > > I think that giving a way to suppress the warning is a good step. Log > level might be to coarse and sysctl would be an overkill. When Dave Hansen reported this issue originally, he explained a lot of userland developer misuse /proc/drop_caches because they don't understand what drop_caches do. So, if they never understand the fact, why can we trust them? I have no idea. Or, if you have different motivation w/ Dave, please let me know it. While the purpose is to shoot misuse, I don't think we can trust userland app. If "If somebody wants to shoot his feet then we cannot do much about it." is true, this patch is useless. OK, we still catch the right user. But we never want to know who is the right users, right? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 4/5] Only set page reserved in the memblock region
On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote: > Currently we when we initialze each page struct is set as reserved upon > initialization. This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. > > I could restruture a bit to eliminate the perform hit. But I wanted to make > sure I am on track first. > > Signed-off-by: Robin Holt > Signed-off-by: Nathan Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c| 16 > 3 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct > page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init > free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index df3ec13..382223e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct > page *page, int order, > spin_unlock(&zone->lock); > } > > -static void __init_single_page(unsigned long pfn, unsigned long zone, int > nid) > +static void __init_single_page(unsigned long pfn, unsigned long zone, > +int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > set_page_links(page, zone, nid, pfn); > mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - SetPageReserved(page); > + set_page_count(page, page_count); > + ClearPageReserved(page); > > /* >* Mark the block movable so that blocks are reserved for > @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, > unsigned long zone, int nid) > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int > nid, unsigned long zone, > if (!early_pfn_in_nid(pfn, nid)) > continue; > } > - __init_single_page(pfn, zone, nid); > + __init_single_page(pfn, zone, nid, 1); > } > } > > -- > 1.8.2.1 > Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns. It should be: void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); } That also brings the timings back in line with the previous patch set. Nate -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT] Networking
1) Don't ignore user initiated wireless regulatory settings on cards with custom regulatory domains, from Arik Nemtsov. 2) Fix length check of bluetooth information responses, from Jaganath Kanakkassery. 3) Fix misuse of PTR_ERR in btusb, from Adam Lee. 4) Handle rfkill properly while iwlwifi devices are offline, from Emmanuel Grumbach. 5) Fix r815x devices DMA'ing to stack buffers, from Hayes Wang. 6) Kernel info leak in ATM packet scheduler, from Dan Carpenter. 7) 8139cp doesn't check for DMA mapping errors, from Neil Horman. 8) Fix bridge multicast code to not snoop when no querier exists, otherwise mutlicast traffic is lost. From Linus Lüssing. 9) Avoid soft lockups in fib6_run_gc(), from Michal Kubecek. 10) Fix races in automatic address asignment on ipv6, which can result in incorrect lifetime assignments. From Jiri Benc. 11) Cure build bustage when CONFIG_NET_LL_RX_POLL is not set and rename it CONFIG_NET_RX_BUSY_POLL to eliminate the last reference to the original naming of this feature. From Cong Wang. 12) Fix crash in TIPC when server socket creation fails, from Ying Xue. 13) macvlan_changelink() silently succeeds when it shouldn't, from Michael S. Tsirkin. 14) HTB packet scheduler can crash due to sign extension, fix from Stephen Hemminger. 15) With the cable unplugged, r8169 prints out a message every 10 seconds, make it netif_dbg() instead of netif_warn(). From Peter Wu. 16) Fix memory leak in rtm_to_ifaddr(), from Daniel Borkmann. 17) sis900 gets spurious TX queue timeouts due to mismanagement of link carrier state, from Denis Kirjanov. 18) Validate somaxconn sysctl to make sure it fits inside of a u16. From Roman Gushchin. 19) Fix MAC address filtering on qlcnic, from Shahed Shaikh. Please pull, thanks a lot! The following changes since commit 06693f305e60202d2795a10bee7fb7da23bc2acc: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2013-07-31 12:56:18 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net master for you to fetch changes up to 4bd8e7385961932d863ea976a67f384c3a8302cb: qlcnic: Fix for flash update failure on 83xx adapter (2013-08-03 12:03:04 -0700) AceLan Kao (2): Bluetooth: Add support for Atheros [0cf3:3121] Bluetooth: Add support for Atheros [0cf3:e003] Adam Lee (1): Bluetooth: fix wrong use of PTR_ERR() in btusb Arend van Spriel (1): brcmfmac: inform cfg80211 about disconnect when device is unplugged Arik Nemtsov (1): regulatory: use correct regulatory initiator on wiphy register Avinash Patil (2): mwifiex: check for bss_role instead of bss_mode for STA operations mwifiex: fix wrong data rates in P2P client Chun-Yeow Yeoh (1): mac80211: prevent the buffering or frame transmission to non-assoc mesh STA Cong Wang (2): net: fix a compile error when CONFIG_NET_LL_RX_POLL is not set net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL Dan Carpenter (1): net_sched: info leak in atm_tc_dump_class() Daniel Borkmann (1): net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing fails David S. Miller (1): Merge branch 'for-davem' of git://git.kernel.org/.../linville/wireless David Spinadel (1): iwlwifi: mvm: set SSID bits for passive channels Denis Kirjanov (1): sis900: Fix the tx queue timeout issue Emmanuel Grumbach (3): iwlwifi: add DELL SKU for 5150 HMC iwlwifi: pcie: reset the NIC before the bring up iwlwifi: pcie: clear RFKILL interrupt in AMPG Felipe Balbi (1): net: ethernet: cpsw: drop IRQF_DISABLED Frederic Danis (1): NFC: Fix NCI over SPI build Geert Uytterhoeven (1): ath10k: ATH10K should depend on HAS_DMA Gustavo Padovan (1): Bluetooth: Fix race between hci_register_dev() and hci_dev_open() Himanshu Madhani (2): qlcnic: Free up memory in error path. qlcnic: Fix for flash update failure on 83xx adapter Ilan Peer (1): iwlwifi: mvm: Disable managed PS when GO is added Jack Morgenstein (1): net/mlx4_core: VFs must ignore the enable_64b_cqe_eqe module param Jaganath Kanakkassery (1): Bluetooth: Fix invalid length check in l2cap_information_rsp() Jiri Benc (1): ipv6: prevent race between address creation and removal Jiri Pirko (1): ipv6: move peer_addr init into ipv6_add_addr() Joe Perches (1): ndisc: Add missing inline to ndisc_addr_option_pad Johan Hedberg (2): Bluetooth: Fix HCI init for BlueFRITZ! devices Bluetooth: Fix calling request callback more than once Johannes Berg (2): iwlwifi: mvm: use only a single GTK in D3 iwlwifi: mvm: fix flushing not started aggregation sessions John W. Linville (6): Merge branch 'for-linville-current' of git://github.com/kvalo/ath Merge branch 'for-john' of git://git.kernel.org
Re: [PATCH V2 1/9] perf tools: add test for reading object code
On 31/07/2013 8:46 p.m., Arnaldo Carvalho de Melo wrote: Em Wed, Jul 31, 2013 at 02:28:33PM -0300, Arnaldo Carvalho de Melo escreveu: Still investigating, but the attached patch is needed to handle such failure cases: [root@zoo ~]# perf test 21 21: Test object code reading : FAILED! [root@zoo ~]# perf test -v 21 Lowering the freq to 4kHz gets me to where I think you was at this point: [root@zoo ~]# perf test -v 21 21: Test object code reading : --- start --- Looking at the vmlinux_path (6 entries long) symsrc__init: cannot get elf header. Using /lib/modules/3.11.0-rc2+/build/vmlinux for symbols Parsing event 'cycles' Reading object code for memory address: 0x8101ce7d File is: /lib/modules/3.11.0-rc2+/build/vmlinux On file address is: 0x8101ce7d dso__data_read_offset failed end Test object code reading: FAILED! [root@zoo ~]# I.e. we need the follow on patches to fix this issue, right? Yes. It is using an "identity" map so the memory address and on-file address are the same - which doesn't work of course. I'll merge my changes with your first patch and continue from there. Please take V3. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add per-process flag to control thp
On Fri, Aug 2, 2013 at 1:34 PM, Alex Thorlton wrote: >> What kind of workloads are you talking about? > > Our benchmarking team has a list of several of the SPEC OMP benchmarks > that perform significantly better when THP is disabled. I tried to get > the list but one of our servers is acting up and I can't get to it > right now :/ > >> What's wrong with madvise? Could you elaborate? > > The main issue with using madvise is that it's not an option with static > binaries, but there are also some users who have legacy Fortran code > that they're not willing/able to change. > >> And I think thp_disabled should be reset to 0 on exec. > > The main purpose for this getting carried down from the parent process > is that we'd like to be able to have a userland program set this flag on > itself, and then spawn off children who will also carry the flag. > This allows us to set the flag for programs where we're unable to modify > the code, thus resolving the issues detailed above. This could be done with LD_PRELOAD for uncontrolled binaries. Though it does seem sensible to make it act like most personality flags do (i.e. inherited). -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] exec: kill "int depth" in search_binary_handler()
On Sat, Aug 3, 2013 at 11:55 AM, Oleg Nesterov wrote: > On 08/03, Kees Cook wrote: >> >> On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov wrote: >> > Nobody except search_binary_handler() should touch ->recursion_depth, >> > "int depth" buys nothing but complicates the code, kill it. >> >> I'd like to see a comment added to binfmts.h's recursion_depth field >> that reminds people that recursion_depth is for >> search_binary_handler()'s use only, and a binfmt loader shouldn't >> touch it. > > And this comment probably makes sense even without this change Yeah totally agreed -- I should have added this when I reorganized the depth handling earlier. :) >> Besides that, yeah, sensible clean up. > > OK, thanks, please see v2. The only change is the comment in .h > > -- > Subject: [PATCH 1/1] exec: kill "int depth" in search_binary_handler() > > Nobody except search_binary_handler() should touch ->recursion_depth, > "int depth" buys nothing but complicates the code, kill it. > > Probably we should also kill "fn" and the !NULL check, ->load_binary > should be always defined. And it can not go away after read_unlock() > or this code is buggy anyway. > > v2: add the comment about linux_binprm->recursion_depth > > Signed-off-by: Oleg Nesterov Acked-by: Kees Cook Thanks, -Kees > --- > fs/exec.c |9 - > include/linux/binfmts.h |2 +- > 2 files changed, 5 insertions(+), 6 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index a9ae4f2..f32079c 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1370,12 +1370,11 @@ EXPORT_SYMBOL(remove_arg_zero); > */ > int search_binary_handler(struct linux_binprm *bprm) > { > - unsigned int depth = bprm->recursion_depth; > - int try,retval; > + int try, retval; > struct linux_binfmt *fmt; > > /* This allows 4 levels of binfmt rewrites before failing hard. */ > - if (depth > 5) > + if (bprm->recursion_depth > 5) > return -ELOOP; > > retval = security_bprm_check(bprm); > @@ -1396,9 +1395,9 @@ int search_binary_handler(struct linux_binprm *bprm) > if (!try_module_get(fmt->module)) > continue; > read_unlock(&binfmt_lock); > - bprm->recursion_depth = depth + 1; > + bprm->recursion_depth++; > retval = fn(bprm); > - bprm->recursion_depth = depth; > + bprm->recursion_depth--; > if (retval >= 0) { > put_binfmt(fmt); > allow_write_access(bprm->file); > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h > index 70cf138..e8112ae 100644 > --- a/include/linux/binfmts.h > +++ b/include/linux/binfmts.h > @@ -31,7 +31,7 @@ struct linux_binprm { > #ifdef __alpha__ > unsigned int taso:1; > #endif > - unsigned int recursion_depth; > + unsigned int recursion_depth; /* only for search_binary_handler() */ > struct file * file; > struct cred *cred; /* new credentials */ > int unsafe; /* how unsafe this exec is (mask of > LSM_UNSAFE_*) */ > -- > 1.5.5.1 > > -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] exec: more cleanups
On Fri, Aug 2, 2013 at 12:27 PM, Oleg Nesterov wrote: > On top of "[PATCH 0/3] exec: minor cleanups + minor fix" I sent > yesterday. > > Perhaps too many patches for the poor search_binary_handler(), > but I do not know how to document the changes if I join them. > > Oleg. > > fs/exec.c | 82 ++-- > 1 files changed, 36 insertions(+), 46 deletions(-) > This all looks really good. Thanks for the cleanups! Besides the one comment on patch 1, consider the series: Acked-by: Kees Cook Thanks, -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] exec: move allow_write_access/fput to exec_binprm()
On Fri, Aug 2, 2013 at 12:27 PM, Oleg Nesterov wrote: > When search_binary_handler() succeeds it does allow_write_access() > and fput(), then it clears bprm->file to ensure the caller will not > do the same. > > We can simply move this code to exec_binprm() which is called only > once. In fact we could move this to free_bprm() and remove the same > code in do_execve_common's error path. > > Signed-off-by: Oleg Nesterov > --- > fs/exec.c |9 + > 1 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index ad7d624..ef70320 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1400,10 +1400,6 @@ int search_binary_handler(struct linux_binprm *bprm) > bprm->recursion_depth--; > if (retval >= 0) { > put_binfmt(fmt); > - allow_write_access(bprm->file); > - if (bprm->file) > - fput(bprm->file); > - bprm->file = NULL; > return retval; > } > read_lock(&binfmt_lock); > @@ -1455,6 +1451,11 @@ static int exec_binprm(struct linux_binprm *bprm) > ptrace_event(PTRACE_EVENT_EXEC, old_vpid); > current->did_exec = 1; > proc_exec_connector(current); > + > + if (bprm->file) { > + allow_write_access(bprm->file); > + fput(bprm->file); > + } Why not keep the bprm->file = NULL assignment? Seems reasonable to keep that just to be avoid use-after-free accidents. -Kees > } > > return ret; > -- > 1.5.5.1 > -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf tools: Renaming 'time' variable in perf_time_to_tsc due to name shadowing error
On 2/08/2013 4:33 p.m., Jiri Olsa wrote: The perf compilation fails with following error: ... CC arch/x86/util/tsc.o arch/x86/util/tsc.c: In function ‘perf_time_to_tsc’: arch/x86/util/tsc.c:13:6: error: declaration of ‘time’ shadows a global declaration [-Werror=shadow] cc1: all warnings being treated as errors Renaming the 'time' variable to prevent this. Did you see David did the same patch. Although David noted the gcc version. It doesn't happen for gcc 4.7.3. The commit message should probably reflect that it depends on the gcc version. Otherwise: Acked-by: Adrian Hunter Signed-off-by: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Corey Ashford Cc: Ingo Molnar Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: David Ahern Cc: Adrian Hunter --- tools/perf/arch/x86/util/tsc.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c index f111744..9570c2b 100644 --- a/tools/perf/arch/x86/util/tsc.c +++ b/tools/perf/arch/x86/util/tsc.c @@ -10,11 +10,11 @@ u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc) { - u64 time, quot, rem; + u64 t, quot, rem; - time = ns - tc->time_zero; - quot = time / tc->time_mult; - rem = time % tc->time_mult; + t = ns - tc->time_zero; + quot = t / tc->time_mult; + rem = t % tc->time_mult; return (quot<< tc->time_shift) + (rem<< tc->time_shift) / tc->time_mult; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] exec: proc_exec_connector() should be called only once
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov wrote: > A separate one-liner with the minor fix. > > PROC_EVENT_EXEC reports the "exec" event, but this message > is sent at least twice if search_binary_handler() is called > by ->load_binary() recursively, say, load_script(). > > Move it to exec_binprm(), this is "depth == 0" code too. > > Signed-off-by: Oleg Nesterov Yeah, looks right. I almost mentioned this while reading the other cleanup patch, but then you fixed it too. :) Acked-by: Kees Cook Thanks, -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] exec: introduce exec_binprm() for "depth == 0" code
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov wrote: > task_pid_nr_ns() and trace/ptrace code in the middle of the > recursive search_binary_handler() looks confusing and imho > annoying. We only need this code if "depth == 0", lets add > a simple helper which calls search_binary_handler() and does > trace_sched_process_exec() + ptrace_event(). > > The patch also moves the setting of task->did_exec, we need > to do this only once. > > Note: we can kill either task->did_exec or PF_FORKNOEXEC. > > Signed-off-by: Oleg Nesterov > --- > fs/exec.c | 36 ++-- > 1 files changed, 22 insertions(+), 14 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 9c73def..a9ae4f2 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1373,7 +1373,6 @@ int search_binary_handler(struct linux_binprm *bprm) > unsigned int depth = bprm->recursion_depth; > int try,retval; > struct linux_binfmt *fmt; > - pid_t old_pid, old_vpid; > > /* This allows 4 levels of binfmt rewrites before failing hard. */ > if (depth > 5) > @@ -1387,12 +1386,6 @@ int search_binary_handler(struct linux_binprm *bprm) > if (retval) > return retval; > > - /* Need to fetch pid before load_binary changes it */ > - old_pid = current->pid; > - rcu_read_lock(); > - old_vpid = task_pid_nr_ns(current, > task_active_pid_ns(current->parent)); > - rcu_read_unlock(); > - > retval = -ENOENT; > for (try=0; try<2; try++) { > read_lock(&binfmt_lock); > @@ -1407,16 +1400,11 @@ int search_binary_handler(struct linux_binprm *bprm) > retval = fn(bprm); > bprm->recursion_depth = depth; > if (retval >= 0) { > - if (depth == 0) { > - trace_sched_process_exec(current, > old_pid, bprm); > - ptrace_event(PTRACE_EVENT_EXEC, > old_vpid); > - } > put_binfmt(fmt); > allow_write_access(bprm->file); > if (bprm->file) > fput(bprm->file); > bprm->file = NULL; > - current->did_exec = 1; > proc_exec_connector(current); > return retval; > } > @@ -1450,9 +1438,29 @@ int search_binary_handler(struct linux_binprm *bprm) > } > return retval; > } > - > EXPORT_SYMBOL(search_binary_handler); > > +static int exec_binprm(struct linux_binprm *bprm) > +{ > + pid_t old_pid, old_vpid; > + int ret; > + > + /* Need to fetch pid before load_binary changes it */ > + old_pid = current->pid; > + rcu_read_lock(); > + old_vpid = task_pid_nr_ns(current, > task_active_pid_ns(current->parent)); > + rcu_read_unlock(); > + > + ret = search_binary_handler(bprm); > + if (ret >= 0) { > + trace_sched_process_exec(current, old_pid, bprm); > + ptrace_event(PTRACE_EVENT_EXEC, old_vpid); > + current->did_exec = 1; > + } Cleanup looks good. One idea here, though: this could be made more pretty by doing: if (ret < 0) return ret; to avoid the indentation for the "expected" code path. -Kees > + > + return ret; > +} > + > /* > * sys_execve() executes a new program. > */ > @@ -1541,7 +1549,7 @@ static int do_execve_common(const char *filename, > if (retval < 0) > goto out; > > - retval = search_binary_handler(bprm); > + retval = exec_binprm(bprm); > if (retval < 0) > goto out; > > -- > 1.5.5.1 > -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/3] exec: kill "int depth" in search_binary_handler()
On 08/03, Kees Cook wrote: > > On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov wrote: > > Nobody except search_binary_handler() should touch ->recursion_depth, > > "int depth" buys nothing but complicates the code, kill it. > > I'd like to see a comment added to binfmts.h's recursion_depth field > that reminds people that recursion_depth is for > search_binary_handler()'s use only, and a binfmt loader shouldn't > touch it. And this comment probably makes sense even without this change > Besides that, yeah, sensible clean up. OK, thanks, please see v2. The only change is the comment in .h -- Subject: [PATCH 1/1] exec: kill "int depth" in search_binary_handler() Nobody except search_binary_handler() should touch ->recursion_depth, "int depth" buys nothing but complicates the code, kill it. Probably we should also kill "fn" and the !NULL check, ->load_binary should be always defined. And it can not go away after read_unlock() or this code is buggy anyway. v2: add the comment about linux_binprm->recursion_depth Signed-off-by: Oleg Nesterov --- fs/exec.c |9 - include/linux/binfmts.h |2 +- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index a9ae4f2..f32079c 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1370,12 +1370,11 @@ EXPORT_SYMBOL(remove_arg_zero); */ int search_binary_handler(struct linux_binprm *bprm) { - unsigned int depth = bprm->recursion_depth; - int try,retval; + int try, retval; struct linux_binfmt *fmt; /* This allows 4 levels of binfmt rewrites before failing hard. */ - if (depth > 5) + if (bprm->recursion_depth > 5) return -ELOOP; retval = security_bprm_check(bprm); @@ -1396,9 +1395,9 @@ int search_binary_handler(struct linux_binprm *bprm) if (!try_module_get(fmt->module)) continue; read_unlock(&binfmt_lock); - bprm->recursion_depth = depth + 1; + bprm->recursion_depth++; retval = fn(bprm); - bprm->recursion_depth = depth; + bprm->recursion_depth--; if (retval >= 0) { put_binfmt(fmt); allow_write_access(bprm->file); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 70cf138..e8112ae 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -31,7 +31,7 @@ struct linux_binprm { #ifdef __alpha__ unsigned int taso:1; #endif - unsigned int recursion_depth; + unsigned int recursion_depth; /* only for search_binary_handler() */ struct file * file; struct cred *cred; /* new credentials */ int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ -- 1.5.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] exec: kill "int depth" in search_binary_handler()
On Thu, Aug 1, 2013 at 12:05 PM, Oleg Nesterov wrote: > Nobody except search_binary_handler() should touch ->recursion_depth, > "int depth" buys nothing but complicates the code, kill it. I'd like to see a comment added to binfmts.h's recursion_depth field that reminds people that recursion_depth is for search_binary_handler()'s use only, and a binfmt loader shouldn't touch it. Besides that, yeah, sensible clean up. -Kees -- Kees Cook Chrome OS Security -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 001/001] CHAR DRIVERS: a simple device to give daemons a /sys-like interface
Greg Thanks for your reply. I'll reply to your comments in reverse order. Greg Kroah-Hartman wrote: And how does this have anything to do with /sys? I can't see any sysfs interaction in the code, or am I missing it? Yes, you are right. I'll change the subject and brief descriptions to something like: "Proxy, a simple bidirectional character device that almost transparently proxies opens, reads, writes, and closes from one side of the device to the other side." I'll take "/sys" from all descriptions of the device, but I might leave it in the Documentation/proxy.txt file since a major use case of proxy is to give user space drivers and daemons the same kind of interface the kernel enjoys with /sys and /proc. The similarity is very deliberate on my part for commands like echo 1 > /proc/sys/net/ipv4/ip_forward # procfs echo 75 > /dev/motors/left/speed # proxy dev Greg Kroah-Hartman wrote: > Why not just use the cuse interface instead? How does this differ from > that /dev node interaction? I am a big fan of FUSE and CUSE but they do not support what I need. CUSE is OK if _ALL_ interaction is through its interface. What is lacking is an ability to open, say, a USB serial port and add its file descriptor to CUSE's FD_SET. This is a pretty deep problem since a CUSE takes away main() from the application developer. A CUSE application looks kind of like this: main(argc, argv) { (check and process command line) cuse_lowlevel_main(argc, argv, ...) } Another difference is that no language bindings are needed. There is no equivalent of libfuse.so. Since proxy is just a character device and there are no language bindings, someone could, in the unlikely case it ever made sense, write a user space device driver in node.js. thanks Bob Smith -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] ARM: dove: DT updates for v3.11-rc2
On Mon, Jul 29, 2013 at 02:29:02PM +0200, Sebastian Hesselbarth wrote: > This patch set comprises some DT updates and cleanup patches that > piled up in the past. The first patch adds a cpu node for the > Marvell Sheeva PJ4(A) CPU found on Dove SoCs. While touching the > dtsi, also some nodes are renamed to achieve a consitent naming > scheme. The second patch adds some common pinmux settings to > the SoC's pinctrl node and moves the default pinctrl properties to > the corresponding nodes. The third patch adds a node for the > IR diode connected to a GPIO pin on SolidRun CuBox. Finally, > the last patch adds an initial DT file for the Globalscale D2Plug > which is also based on Marvell Dove SoC. > > The whole patch set is based on v3.11-rc2 with DT part of > mv643xx_eth and irqchip/clocksource patches applied. > > Although this patches are Marvell Dove only, the whole set is also > sent to devicetree mailing list to raise attention for potential DT > binding review. > > Sebastian Hesselbarth (4): > ARM: dove: add cpu device tree node > ARM: dove: add common pinmux functions to DT > ARM: dove: add GPIO IR receiver node to SolidRun CuBox > ARM: dove: add initial DT file for Globalscale D2Plug > > arch/arm/boot/dts/Makefile|1 + > arch/arm/boot/dts/dove-cubox.dts | 30 ++--- > arch/arm/boot/dts/dove-d2plug.dts | 69 +++ > arch/arm/boot/dts/dove.dtsi | 235 > + > 4 files changed, 295 insertions(+), 40 deletions(-) > create mode 100644 arch/arm/boot/dts/dove-d2plug.dts Whole series applied to mvebu/boards. thx, Jason. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ARM: DTS: AM33XX: Add PMU support
ARM Performance Monitor Units are available on the am33xx, add the support in the dtsi. Tested with perf and oprofile on a regular beaglebone. Signed-off-by: Alexandre Belloni --- arch/arm/boot/dts/am33xx.dtsi | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi index 38b446b..cfeac96 100644 --- a/arch/arm/boot/dts/am33xx.dtsi +++ b/arch/arm/boot/dts/am33xx.dtsi @@ -53,6 +53,11 @@ }; }; + pmu { + compatible = "arm,cortex-a8-pmu"; + interrupts = <3>; + }; + /* * The soc node represents the soc top level view. It is uses for IPs * that are not memory mapped in the MPU view or for the MPU itself. -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] cpufreq: Cleanup header files included in core
On 3 August 2013 17:19, Viresh Kumar wrote: > This patch intends to cleanup following issues in the header files included in > cpufreq core layers: > - Include headers in ascending order, so that we don't add same multiple times > by mistake. > - must be included after , so that they override whatever they > need. > - Remove unnecessary header files > - Don't include files already included by cpufreq.h or cpufreq_governor.h > > Signed-off-by: Viresh Kumar Fixup due to compilation warning reported by Fengguang's kbuild system: [Latest stuff pushed in my cpufreq-next branch] diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 70399ea..ccaf025 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -19,6 +19,7 @@ #include #include +#include #include #include #include -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] cpufreq: Re-arrange declarations in cpufreq.h
On 3 August 2013 17:19, Viresh Kumar wrote: > They are pretty much mixed up. Although generic headers are present but > definitions/declarations are present outside them too.. > > This patch just moves stuff up and down to make it look better and consistent. > > Signed-off-by: Viresh Kumar > --- > include/linux/cpufreq.h | 370 > +++- > 1 file changed, 177 insertions(+), 193 deletions(-) Fixup due to compilation reported by Fengguang's kbuild system: [Will post the series again once I receive more comments on it] diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index a6b97e2..d568f39 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -268,6 +268,19 @@ int cpufreq_unregister_notifier(struct notifier_block *nb, unsigned int list); void cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state); +#else /* CONFIG_CPU_FREQ */ +static inline int cpufreq_register_notifier(struct notifier_block *nb, + unsigned int list) +{ + return 0; +} +static inline int cpufreq_unregister_notifier(struct notifier_block *nb, + unsigned int list) +{ + return 0; +} +#endif /* !CONFIG_CPU_FREQ */ + /** * cpufreq_scale - "old * mult / div" calculation for large values (32-bit-arch * safe) @@ -282,32 +295,16 @@ static inline unsigned long cpufreq_scale(unsigned long old, u_int div, u_int mult) { #if BITS_PER_LONG == 32 - u64 result = ((u64) old) * ((u64) mult); do_div(result, div); return (unsigned long) result; #elif BITS_PER_LONG == 64 - unsigned long result = old * ((u64) mult); result /= div; return result; - #endif -}; - -#else /* CONFIG_CPU_FREQ */ -static inline int cpufreq_register_notifier(struct notifier_block *nb, - unsigned int list) -{ - return 0; } -static inline int cpufreq_unregister_notifier(struct notifier_block *nb, - unsigned int list) -{ - return 0; -} -#endif /* !CONFIG_CPU_FREQ */ /* * CPUFREQ GOVERNORS* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/7] improve memcg oom killer robustness v2
Hi azur, here is the x86-only rollup of the series for 3.2. Thanks! Johannes --- diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 5db0490..314fe53 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, force_sig_info_fault(SIGBUS, code, address, tsk, fault); } -static noinline int +static noinline void mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) { - /* -* Pagefault was interrupted by SIGKILL. We have no reason to -* continue pagefault. -*/ - if (fatal_signal_pending(current)) { - if (!(fault & VM_FAULT_RETRY)) - up_read(¤t->mm->mmap_sem); - if (!(error_code & PF_USER)) - no_context(regs, error_code, address); - return 1; + if (fatal_signal_pending(current) && !(error_code & PF_USER)) { + up_read(¤t->mm->mmap_sem); + no_context(regs, error_code, address); + return; } - if (!(fault & VM_FAULT_ERROR)) - return 0; if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ if (!(error_code & PF_USER)) { up_read(¤t->mm->mmap_sem); no_context(regs, error_code, address); - return 1; + return; } out_of_memory(regs, error_code, address); @@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, else BUG(); } - return 1; } static int spurious_fault_check(unsigned long error_code, pte_t *pte) @@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code) if (user_mode_vm(regs)) { local_irq_enable(); error_code |= PF_USER; + flags |= FAULT_FLAG_USER; } else { if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); @@ -1167,9 +1159,17 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, flags); - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) { - if (mm_fault_error(regs, error_code, address, fault)) - return; + /* +* If we need to retry but a fatal signal is pending, handle the +* signal first. We do not need to release the mmap_sem because it +* would already be released in __lock_page_or_retry in mm/filemap.c. +*/ + if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))) + return; + + if (unlikely(fault & VM_FAULT_ERROR)) { + mm_fault_error(regs, error_code, address, fault); + return; } /* diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a..b113c0f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -120,6 +120,48 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +/** + * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task + * @new: true to enable, false to disable + * + * Toggle whether a failed memcg charge should invoke the OOM killer + * or just return -ENOMEM. Returns the previous toggle state. + * + * NOTE: Any path that enables the OOM killer before charging must + * call mem_cgroup_oom_synchronize() afterward to finalize the + * OOM handling and clean up. + */ +static inline bool mem_cgroup_toggle_oom(bool new) +{ + bool old; + + old = current->memcg_oom.may_oom; + current->memcg_oom.may_oom = new; + + return old; +} + +static inline void mem_cgroup_enable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(true); + + WARN_ON(old == true); +} + +static inline void mem_cgroup_disable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(false); + + WARN_ON(old == false); +} + +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return p->memcg_oom.in_memcg_oom; +} + +bool mem_cgroup_oom_synchronize(void); + #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif @@ -333,6 +375,29 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline bool mem_cgroup_toggle_oom(bool new) +{ + return false; +} + +static inline void mem_cgroup_enable_oom(void) +{ +} + +static inline void mem_cgroup_disable_oom(void) +{ +} + +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return false; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} +
Re: [PATCHv2] Add per-process flag to control thp
On 08/02, Alex Thorlton wrote: > > This patch implements functionality to allow processes to disable the use of > transparent hugepages through the prctl syscall. > > We've determined that some jobs perform significantly better with thp > disabled, > and we needed a way to control thp on a per-process basis, without relying on > madvise. Well, I think the changelog should explain why madvise() is bad. > @@ -1311,6 +1311,10 @@ static struct task_struct *copy_process(unsigned long > clone_flags, > p->sequential_io_avg= 0; > #endif > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + p->thp_disabled = current->thp_disabled; > +#endif Unneeded. It will be copied by dup_task_struct() automagically. But I simply can't understand why this flag is per-thread. It should be mm flag, no? Oleg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/7] improve memcg oom killer robustness v2
Changes in version 2: o use user_mode() instead of open coding it on s390 (Heiko Carstens) o clean up memcg OOM enable/disable toggling (Michal Hocko & KOSAKI Motohiro) o add a separate patch to rework and document OOM locking o fix a problem with lost wakeups when sleeping on the OOM lock o fix OOM unlocking & wakeups with userspace OOM handling The memcg code can trap tasks in the context of the failing allocation until an OOM situation is resolved. They can hold all kinds of locks (fs, mm) at this point, which makes it prone to deadlocking. This series converts memcg OOM handling into a two step process that is started in the charge context, but any waiting is done after the fault stack is fully unwound. Patches 1-4 prepare architecture handlers to support the new memcg requirements, but in doing so they also remove old cruft and unify out-of-memory behavior across architectures. Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel faults, because they can gracefully unwind the stack with -ENOMEM. OOM handling is restricted to user triggered faults that have no other option. Patch 6 reworks memcg's hierarchical OOM locking to make it a little more obvious wth is going on in there: reduce locked regions, rename locking functions, reorder and document. Patch 7 implements the two-part OOM handling such that tasks are never trapped with the full charge stack in an OOM situation. arch/alpha/mm/fault.c | 7 +- arch/arc/mm/fault.c| 11 +-- arch/arm/mm/fault.c| 23 +++-- arch/arm64/mm/fault.c | 23 +++-- arch/avr32/mm/fault.c | 4 +- arch/cris/mm/fault.c | 6 +- arch/frv/mm/fault.c| 10 +- arch/hexagon/mm/vm_fault.c | 6 +- arch/ia64/mm/fault.c | 6 +- arch/m32r/mm/fault.c | 10 +- arch/m68k/mm/fault.c | 2 + arch/metag/mm/fault.c | 6 +- arch/microblaze/mm/fault.c | 7 +- arch/mips/mm/fault.c | 8 +- arch/mn10300/mm/fault.c| 2 + arch/openrisc/mm/fault.c | 1 + arch/parisc/mm/fault.c | 7 +- arch/powerpc/mm/fault.c| 7 +- arch/s390/mm/fault.c | 2 + arch/score/mm/fault.c | 13 ++- arch/sh/mm/fault.c | 9 +- arch/sparc/mm/fault_32.c | 12 ++- arch/sparc/mm/fault_64.c | 8 +- arch/tile/mm/fault.c | 13 +-- arch/um/kernel/trap.c | 22 +++-- arch/unicore32/mm/fault.c | 22 +++-- arch/x86/mm/fault.c| 43 - arch/xtensa/mm/fault.c | 2 + include/linux/memcontrol.h | 65 + include/linux/mm.h | 1 + include/linux/sched.h | 7 ++ mm/filemap.c | 11 ++- mm/memcontrol.c| 229 + mm/memory.c| 43 +++-- mm/oom_kill.c | 7 +- 35 files changed, 444 insertions(+), 211 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/7] arch: mm: pass userspace fault flag to generic fault handler
Unlike global OOM handling, memory cgroup code will invoke the OOM killer in any OOM situation because it has no way of telling faults occuring in kernel context - which could be handled more gracefully - from user-triggered faults. Pass a flag that identifies faults originating in user space from the architecture-specific fault handlers to generic code so that memcg OOM handling can be improved. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko --- arch/alpha/mm/fault.c | 7 --- arch/arc/mm/fault.c| 6 -- arch/arm/mm/fault.c| 9 ++--- arch/arm64/mm/fault.c | 9 ++--- arch/avr32/mm/fault.c | 2 ++ arch/cris/mm/fault.c | 6 -- arch/frv/mm/fault.c| 10 ++ arch/hexagon/mm/vm_fault.c | 6 -- arch/ia64/mm/fault.c | 6 -- arch/m32r/mm/fault.c | 10 ++ arch/m68k/mm/fault.c | 2 ++ arch/metag/mm/fault.c | 6 -- arch/microblaze/mm/fault.c | 7 +-- arch/mips/mm/fault.c | 6 -- arch/mn10300/mm/fault.c| 2 ++ arch/openrisc/mm/fault.c | 1 + arch/parisc/mm/fault.c | 7 +-- arch/powerpc/mm/fault.c| 7 --- arch/s390/mm/fault.c | 2 ++ arch/score/mm/fault.c | 7 ++- arch/sh/mm/fault.c | 9 ++--- arch/sparc/mm/fault_32.c | 12 +--- arch/sparc/mm/fault_64.c | 8 +--- arch/tile/mm/fault.c | 7 +-- arch/um/kernel/trap.c | 20 arch/unicore32/mm/fault.c | 8 ++-- arch/x86/mm/fault.c| 8 +--- arch/xtensa/mm/fault.c | 2 ++ include/linux/mm.h | 1 + 29 files changed, 132 insertions(+), 61 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index 0c4132d..98838a0 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, const struct exception_table_entry *fixup; int fault, si_code = SEGV_MAPERR; siginfo_t info; - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (cause > 0 ? FAULT_FLAG_WRITE : 0)); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults (or is suppressed by the PALcode). Support that for older CPUs @@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr, if (address >= TASK_SIZE) goto vmalloc_fault; #endif - + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -142,6 +142,7 @@ retry: } else { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } /* If for any reason at all we couldn't handle the fault, diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 6b0bb41..d63f3de 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address) siginfo_t info; int fault, ret; int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */ - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* * We fault-in kernel-space virtual memory on-demand. The @@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address) if (in_atomic() || !mm) goto no_context; + if (user_mode(regs)) + flags |= FAULT_FLAG_USER; retry: down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -117,6 +118,7 @@ good_area: if (write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; + flags |= FAULT_FLAG_WRITE; } else { if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 217bcbf..eb8830a 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct task_struct *tsk; struct mm_struct *mm; int fault, sig, code; - int write = fsr & FSR_WRITE; - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE | - (write ? FAULT_FLAG_WRITE : 0); + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; if (notify_page_fault(regs, fsr)) return 0; @@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (in_atomic() || !mm) goto no_
[patch 5/7] mm: memcg: enable memcg OOM killer only for user faults
System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 44 include/linux/sched.h | 3 +++ mm/filemap.c | 11 ++- mm/memcontrol.c| 2 +- mm/memory.c| 40 ++-- 5 files changed, 88 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7b4d9d7..9c449c1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -125,6 +125,37 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); +/** + * mem_cgroup_toggle_oom - toggle the memcg OOM killer for the current task + * @new: true to enable, false to disable + * + * Toggle whether a failed memcg charge should invoke the OOM killer + * or just return -ENOMEM. Returns the previous toggle state. + */ +static inline bool mem_cgroup_toggle_oom(bool new) +{ + bool old; + + old = current->memcg_oom.may_oom; + current->memcg_oom.may_oom = new; + + return old; +} + +static inline void mem_cgroup_enable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(true); + + WARN_ON(old == true); +} + +static inline void mem_cgroup_disable_oom(void) +{ + bool old = mem_cgroup_toggle_oom(false); + + WARN_ON(old == false); +} + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -348,6 +379,19 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page, { } +static inline bool mem_cgroup_toggle_oom(bool new) +{ + return false; +} + +static inline void mem_cgroup_enable_oom(void) +{ +} + +static inline void mem_cgroup_disable_oom(void) +{ +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/linux/sched.h index fc09d21..4b3effc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1398,6 +1398,9 @@ struct task_struct { unsigned long memsw_nr_pages; /* uncharged mem+swap usage */ } memcg_batch; unsigned int memcg_kmem_skip_account; + struct memcg_oom_info { + unsigned int may_oom:1; + } memcg_oom; #endif #ifdef CONFIG_UPROBES struct uprobe_task *utask; diff --git a/mm/filemap.c b/mm/filemap.c index a6981fe..4a73e1a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1618,6 +1618,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct inode *inode = mapping->host; pgoff_t offset = vmf->pgoff; struct page *page; + bool memcg_oom; pgoff_t size; int ret = 0; @@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) return VM_FAULT_SIGBUS; /* -* Do we have something in the page cache already? +* Do we have something in the page cache already? Either +* way, try readahead, but disable the memcg OOM killer for it +* as readahead is optional and no errors are propagated up +* the fault stack. The OOM killer is enabled while trying to +* instantiate the faulting page individually below. */ page = find_get_page(mapping, offset); if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { @@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We found the page, so try async readahead before * waiting for the lock. */ + memcg_oom = mem_cgroup_toggle_oom(false); do_async_mmap_readahead(vma, ra, file, page, offset); + mem_cgroup_toggle_oom(memcg_oom); } else if (!page) { /* No page in the page cache at all */ + memcg_oom = mem_cgroup_toggle_oom(false); do_sync_mmap_readahead(vma, ra, file, offset); + mem_cgroup_toggle_oom(memcg_oom); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 00a7a66..30ae46a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return CHARGE_RETRY; /* If we don't need to call oom-killer at el, return immediately */ - if (!oom_check) + if (!oom_check || !current->memcg_oom.may_oom) return CHARGE_NOMEM; /* check OOM
[patch 1/7] arch: mm: remove obsolete init OOM protection
Back before smart OOM killing, when faulting tasks where killed directly on allocation failures, the arch-specific fault handlers needed special protection for the init process. Now that all fault handlers call into the generic OOM killer (609838c "mm: invoke oom-killer from remaining unconverted page fault handlers"), which already provides init protection, the arch-specific leftovers can be removed. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Acked-by: KOSAKI Motohiro --- arch/arc/mm/fault.c | 5 - arch/score/mm/fault.c | 6 -- arch/tile/mm/fault.c | 6 -- 3 files changed, 17 deletions(-) diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 0fd1f0d..6b0bb41 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -122,7 +122,6 @@ good_area: goto bad_area; } -survive: /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -201,10 +200,6 @@ no_context: die("Oops", regs, address); out_of_memory: - if (is_global_init(tsk)) { - yield(); - goto survive; - } up_read(&mm->mmap_sem); if (user_mode(regs)) { diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c index 6b18fb0..4b71a62 100644 --- a/arch/score/mm/fault.c +++ b/arch/score/mm/fault.c @@ -100,7 +100,6 @@ good_area: goto bad_area; } -survive: /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -167,11 +166,6 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - if (is_global_init(tsk)) { - yield(); - down_read(&mm->mmap_sem); - goto survive; - } if (!user_mode(regs)) goto no_context; pagefault_out_of_memory(); diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index f7f99f9..ac553ee 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -430,7 +430,6 @@ good_area: goto bad_area; } - survive: /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo @@ -568,11 +567,6 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - if (is_global_init(tsk)) { - yield(); - down_read(&mm->mmap_sem); - goto survive; - } if (is_kernel_mode) goto no_context; pagefault_out_of_memory(); -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 6/7] mm: memcg: rework and document OOM waiting and wakeup
The memcg OOM handler open-codes a sleeping lock for OOM serialization (trylock, wait, repeat) because the required locking is so specific to memcg hierarchies. However, it would be nice if this construct would be clearly recognizable and not be as obfuscated as it is right now. Clean up as follows: 1. Remove the return value of mem_cgroup_oom_unlock() 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock(). 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This makes it more obvious that the task has to be on the waitqueue before attempting to OOM-trylock the hierarchy, to not miss any wakeups before going to sleep. It just didn't matter until now because it was all lumped together into the global memcg_oom_lock spinlock section. 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope. It is proctected by the hierarchical OOM-lock. 5. The memcg_oom_lock spinlock is only required to propagate the OOM lock in any given hierarchy atomically. Restrict its scope to mem_cgroup_oom_(trylock|unlock). 6. Do not wake up the waitqueue unconditionally at the end of the function. Only the lockholder has to wake up the next in line after releasing the lock. Note that the lockholder kicks off the OOM-killer, which in turn leads to wakeups from the uncharges of the exiting task. But a contender is not guaranteed to see them if it enters the OOM path after the OOM kills but before the lockholder releases the lock. Thus there has to be an explicit wakeup after releasing the lock. 7. Put the OOM task on the waitqueue before marking the hierarchy as under OOM as that is the point where we start to receive wakeups. No point in listening before being on the waitqueue. 8. Likewise, unmark the hierarchy before finishing the sleep, for symmetry. Signed-off-by: Johannes Weiner Acked-by: Michal Hocko --- mm/memcontrol.c | 85 +++-- 1 file changed, 47 insertions(+), 38 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 30ae46a..3d0c1d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, return total; } +static DEFINE_SPINLOCK(memcg_oom_lock); + /* * Check OOM-Killer is already running under our hierarchy. * If someone is running, return false. - * Has to be called with memcg_oom_lock */ -static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg) +static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg) { struct mem_cgroup *iter, *failed = NULL; + spin_lock(&memcg_oom_lock); + for_each_mem_cgroup_tree(iter, memcg) { if (iter->oom_lock) { /* @@ -2098,33 +2101,33 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg) iter->oom_lock = true; } - if (!failed) - return true; - - /* -* OK, we failed to lock the whole subtree so we have to clean up -* what we set up to the failing subtree -*/ - for_each_mem_cgroup_tree(iter, memcg) { - if (iter == failed) { - mem_cgroup_iter_break(memcg, iter); - break; + if (failed) { + /* +* OK, we failed to lock the whole subtree so we have +* to clean up what we set up to the failing subtree +*/ + for_each_mem_cgroup_tree(iter, memcg) { + if (iter == failed) { + mem_cgroup_iter_break(memcg, iter); + break; + } + iter->oom_lock = false; } - iter->oom_lock = false; - } - return false; + } + + spin_unlock(&memcg_oom_lock); + + return !failed; } -/* - * Has to be called with memcg_oom_lock - */ -static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg) +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg) { struct mem_cgroup *iter; + spin_lock(&memcg_oom_lock); for_each_mem_cgroup_tree(iter, memcg) iter->oom_lock = false; - return 0; + spin_unlock(&memcg_oom_lock); } static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg) @@ -2148,7 +2151,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg) atomic_add_unless(&iter->under_oom, -1, 0); } -static DEFINE_SPINLOCK(memcg_oom_lock); static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); struct oom_wait_info { @@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { struct oom_wait_info owait; - bool locked, need_to_kill; + bool locked; owait.memcg = memcg; owait.wait.fla
[patch 2/7] arch: mm: do not invoke OOM killer on kernel fault OOM
Kernel faults are expected to handle OOM conditions gracefully (gup, uaccess etc.), so they should never invoke the OOM killer. Reserve this for faults triggered in user context when it is the only option. Most architectures already do this, fix up the remaining few. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Acked-by: KOSAKI Motohiro --- arch/arm/mm/fault.c | 14 +++--- arch/arm64/mm/fault.c | 14 +++--- arch/avr32/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 ++ arch/um/kernel/trap.c | 2 ++ arch/unicore32/mm/fault.c | 14 +++--- 6 files changed, 26 insertions(+), 22 deletions(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index c97f794..217bcbf 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -349,6 +349,13 @@ retry: if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS return 0; + /* +* If we are in kernel mode at this point, we +* have no context to handle this fault with. +*/ + if (!user_mode(regs)) + goto no_context; + if (fault & VM_FAULT_OOM) { /* * We ran out of memory, call the OOM killer, and return to @@ -359,13 +366,6 @@ retry: return 0; } - /* -* If we are in kernel mode at this point, we -* have no context to handle this fault with. -*/ - if (!user_mode(regs)) - goto no_context; - if (fault & VM_FAULT_SIGBUS) { /* * We had some memory, but were unable to diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 0ecac89..dab1cfd 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -294,6 +294,13 @@ retry: VM_FAULT_BADACCESS return 0; + /* +* If we are in kernel mode at this point, we have no context to +* handle this fault with. +*/ + if (!user_mode(regs)) + goto no_context; + if (fault & VM_FAULT_OOM) { /* * We ran out of memory, call the OOM killer, and return to @@ -304,13 +311,6 @@ retry: return 0; } - /* -* If we are in kernel mode at this point, we have no context to -* handle this fault with. -*/ - if (!user_mode(regs)) - goto no_context; - if (fault & VM_FAULT_SIGBUS) { /* * We had some memory, but were unable to successfully fix up diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c index b2f2d2d..2ca27b0 100644 --- a/arch/avr32/mm/fault.c +++ b/arch/avr32/mm/fault.c @@ -228,9 +228,9 @@ no_context: */ out_of_memory: up_read(&mm->mmap_sem); - pagefault_out_of_memory(); if (!user_mode(regs)) goto no_context; + pagefault_out_of_memory(); return; do_sigbus: diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 85df1cd..94d3a31 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -241,6 +241,8 @@ out_of_memory: * (which will retry the fault, or kill us if we got oom-killed). */ up_read(&mm->mmap_sem); + if (!user_mode(regs)) + goto no_context; pagefault_out_of_memory(); return; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 089f398..b2f5adf 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -124,6 +124,8 @@ out_of_memory: * (which will retry the fault, or kill us if we got oom-killed). */ up_read(&mm->mmap_sem); + if (!is_user) + goto out_nosemaphore; pagefault_out_of_memory(); return 0; } diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index f9b5c10..8ed3c45 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -278,6 +278,13 @@ retry: (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS return 0; + /* +* If we are in kernel mode at this point, we +* have no context to handle this fault with. +*/ + if (!user_mode(regs)) + goto no_context; + if (fault & VM_FAULT_OOM) { /* * We ran out of memory, call the OOM killer, and return to @@ -288,13 +295,6 @@ retry: return 0; } - /* -* If we are in kernel mode at this point, we -* have no context to handle this fault with. -*/ - if (!user_mode(regs)) - goto no_context; - if (fault & VM_FAULT_SIGBUS) { /* * We had some memory, but were unable to -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majord
[patch 7/7] mm: memcg: do not trap chargers with full callstack on OOM
The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0x OOM kill victim: [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0x The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Reported-by: Reported-by: azurIt Debugged-by: Michal Hocko Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 21 +++ include/linux/sched.h | 4 ++ mm/memcontrol.c| 154 +++-- mm/memory.c| 3 + mm/oom_kill.c | 7 ++- 5 files changed, 140 insertions(+), 49 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9c449c1..cb84058 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -131,6 +131,10 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage, * * Toggle whether a failed memcg charge should invoke the OOM killer * or just return -ENOMEM. Returns the previous toggle state. + * + * NOTE: Any path that enables the OOM killer before charging must + * call mem_cgroup_oom_synchronize() afterward to finalize the + * OOM handling and clean up. */ static inline bool mem_cgroup_toggle_oom(bool new) { @@ -156,6 +160,13 @@ static inline void mem_cgroup_disable_oom(void) WARN_ON(old == false); } +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return p->memcg_oom.in_memcg_oom; +} + +bool mem_cgroup_oom_synchronize(void); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -392,6 +403,16 @@ static inline void mem_cgroup_disable_oom(void) { } +static inline bool task_in_memcg_oom(struct task_struct *p) +{ + return false; +} + +static inline bool mem_cgroup_oom_synchronize(void) +{ + return false; +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/include/linux/sched.h b/include/l