Re: BUG_ON() in workingset_node_shadows_dec() triggers
On Thu, Oct 06, 2016 at 04:59:20PM -0700, Linus Torvalds wrote: > We should just switch BUG() over and be done with it. The whole point > it that since it should never trigger in the first place, the > semantics on BUG() should never matter. > > And if you have some code that depends on the semantics of BUG(), that > code is buggy crap *by*definition*. I totally agree with this. If a developer writes BUG() somewhere, it means he doesn't see how it is possible to end up in this situation. Thus we cannot hope that the BUG() call is doing anything right to fix what the code author didn't expect to happen. It just means "try to limit the risks but I don't really know which ones". Also we won't make things worse. Where people currently have an oops, they'll get one or more warnings. The side effects (lockups, panic, etc) will more or less be the same, but many of us already don't want to continue after an oops and despite this our systems work fine, so I don't see why anyone would suffer from such a change. However some developers may get more details about issues than what they could get in the past. Willy
Re: BUG_ON() in workingset_node_shadows_dec() triggers
On Thu, Oct 06, 2016 at 04:59:20PM -0700, Linus Torvalds wrote: > We should just switch BUG() over and be done with it. The whole point > it that since it should never trigger in the first place, the > semantics on BUG() should never matter. > > And if you have some code that depends on the semantics of BUG(), that > code is buggy crap *by*definition*. I totally agree with this. If a developer writes BUG() somewhere, it means he doesn't see how it is possible to end up in this situation. Thus we cannot hope that the BUG() call is doing anything right to fix what the code author didn't expect to happen. It just means "try to limit the risks but I don't really know which ones". Also we won't make things worse. Where people currently have an oops, they'll get one or more warnings. The side effects (lockups, panic, etc) will more or less be the same, but many of us already don't want to continue after an oops and despite this our systems work fine, so I don't see why anyone would suffer from such a change. However some developers may get more details about issues than what they could get in the past. Willy
Re: [PATCH 0/2 v2] userns: show current values of user namespace counters
Hello Eric, What do you think about this series? It should be useful to know current usage for user counters. Thanks, Andrei On Mon, Aug 15, 2016 at 01:10:20PM -0700, Andrei Vagin wrote: > Recently Eric added user namespace counters. User namespace counters is > a feature that allows to limit the number of various kernel objects a > user can create. These limits are set via /proc/sys/user/ sysctls on a > per user namespace basis and are applicable to all users in that > namespace. > > User namespace counters are not in the upstream tree yet, > you can find them in Eric's tree: > https://git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git/log/?h=for-testing > > This patch adds /proc//userns_counts files to provide current usage > of user namespace counters. > > > cat /proc/813/userns_counts > user_namespaces 101000 1 > pid_namespaces 101000 1 > ipc_namespaces 101000 4 > net_namespaces 101000 2 > mnt_namespaces 101000 5 > mnt_namespaces 10 1 > > The meanings of the columns are as follows, from left to right: > > Name Object name > UID User ID > UsageCurrent usage > > The full documentation is in the second patch. > > v2: - describe this file in Documentation/filesystems/proc.txt > - move and rename into /proc//userns_counts > > Cc: Serge Hallyn> Cc: Kees Cook > Cc: "Eric W. Biederman" > Signed-off-by: Andrei Vagin > > Andrei Vagin (1): > kernel: show current values of user namespace counters > > Kirill Kolyshkin (1): > Documentation: describe /proc//userns_counts > > Documentation/filesystems/proc.txt | 30 +++ > fs/proc/array.c| 55 > fs/proc/base.c | 1 + > fs/proc/internal.h | 1 + > include/linux/user_namespace.h | 8 +++ > kernel/ucount.c| 102 > + > 6 files changed, 197 insertions(+) > > -- > 2.5.5
Re: [PATCH 0/2 v2] userns: show current values of user namespace counters
Hello Eric, What do you think about this series? It should be useful to know current usage for user counters. Thanks, Andrei On Mon, Aug 15, 2016 at 01:10:20PM -0700, Andrei Vagin wrote: > Recently Eric added user namespace counters. User namespace counters is > a feature that allows to limit the number of various kernel objects a > user can create. These limits are set via /proc/sys/user/ sysctls on a > per user namespace basis and are applicable to all users in that > namespace. > > User namespace counters are not in the upstream tree yet, > you can find them in Eric's tree: > https://git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git/log/?h=for-testing > > This patch adds /proc//userns_counts files to provide current usage > of user namespace counters. > > > cat /proc/813/userns_counts > user_namespaces 101000 1 > pid_namespaces 101000 1 > ipc_namespaces 101000 4 > net_namespaces 101000 2 > mnt_namespaces 101000 5 > mnt_namespaces 10 1 > > The meanings of the columns are as follows, from left to right: > > Name Object name > UID User ID > UsageCurrent usage > > The full documentation is in the second patch. > > v2: - describe this file in Documentation/filesystems/proc.txt > - move and rename into /proc//userns_counts > > Cc: Serge Hallyn > Cc: Kees Cook > Cc: "Eric W. Biederman" > Signed-off-by: Andrei Vagin > > Andrei Vagin (1): > kernel: show current values of user namespace counters > > Kirill Kolyshkin (1): > Documentation: describe /proc//userns_counts > > Documentation/filesystems/proc.txt | 30 +++ > fs/proc/array.c| 55 > fs/proc/base.c | 1 + > fs/proc/internal.h | 1 + > include/linux/user_namespace.h | 8 +++ > kernel/ucount.c| 102 > + > 6 files changed, 197 insertions(+) > > -- > 2.5.5
Re: [PATCH 30/54] md/raid5: Delete two error messages for a failed memory allocation
On 10/06/2016 11:30 AM, SF Markus Elfring wrote: From: Markus ElfringDate: Wed, 5 Oct 2016 09:43:40 +0200 Omit extra messages for a memory allocation failure in this function. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Signed-off-by: Markus Elfring --- drivers/md/raid5.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index d864871..ef180c0 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6613,12 +6613,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; atomic_set(>empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); - if (grow_stripes(conf, conf->min_nr_stripes)) { - printk(KERN_ERR - "md/raid:%s: couldn't allocate %dkB for buffers\n", - mdname(mddev), memory); + if (grow_stripes(conf, conf->min_nr_stripes)) goto free_conf; - } else + else printk(KERN_INFO "md/raid:%s: allocated %dkB\n", mdname(mddev), memory); /* @@ -6640,12 +6637,8 @@ static struct r5conf *setup_conf(struct mddev *mddev) sprintf(pers_name, "raid%d", mddev->new_level); conf->thread = md_register_thread(raid5d, mddev, pers_name); - if (!conf->thread) { - printk(KERN_ERR - "md/raid:%s: couldn't allocate thread.\n", - mdname(mddev)); + if (!conf->thread) goto free_conf; - } return conf; free_conf: Actually I prefer having error messages, especially if you have several possible failures all leading to the same return value. Without it debugging becomes really hard. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 30/54] md/raid5: Delete two error messages for a failed memory allocation
On 10/06/2016 11:30 AM, SF Markus Elfring wrote: From: Markus Elfring Date: Wed, 5 Oct 2016 09:43:40 +0200 Omit extra messages for a memory allocation failure in this function. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Signed-off-by: Markus Elfring --- drivers/md/raid5.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index d864871..ef180c0 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6613,12 +6613,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; atomic_set(>empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); - if (grow_stripes(conf, conf->min_nr_stripes)) { - printk(KERN_ERR - "md/raid:%s: couldn't allocate %dkB for buffers\n", - mdname(mddev), memory); + if (grow_stripes(conf, conf->min_nr_stripes)) goto free_conf; - } else + else printk(KERN_INFO "md/raid:%s: allocated %dkB\n", mdname(mddev), memory); /* @@ -6640,12 +6637,8 @@ static struct r5conf *setup_conf(struct mddev *mddev) sprintf(pers_name, "raid%d", mddev->new_level); conf->thread = md_register_thread(raid5d, mddev, pers_name); - if (!conf->thread) { - printk(KERN_ERR - "md/raid:%s: couldn't allocate thread.\n", - mdname(mddev)); + if (!conf->thread) goto free_conf; - } return conf; free_conf: Actually I prefer having error messages, especially if you have several possible failures all leading to the same return value. Without it debugging becomes really hard. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: scripts/coccicheck: Update for a comment?
On Fri, 7 Oct 2016, SF Markus Elfring wrote: > Hello, > > Information from a commit like "docs: sphinxify coccinelle.txt and add it > to dev-tools" caught also my software development attention. > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/Documentation/coccinelle.txt?id=4b9033a33494ec9154d63e706e9e47f7eb3fd59e > > Did an other information from a comment become outdated in the script > "coccicheck" > because of such changes for the documentation format? > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/scripts/coccicheck?id=c802e87fbe2d4dd58982d01b3c39bc5a781223aa#n4 How about submitting a patch to fix the problem? julia
Re: scripts/coccicheck: Update for a comment?
On Fri, 7 Oct 2016, SF Markus Elfring wrote: > Hello, > > Information from a commit like "docs: sphinxify coccinelle.txt and add it > to dev-tools" caught also my software development attention. > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/Documentation/coccinelle.txt?id=4b9033a33494ec9154d63e706e9e47f7eb3fd59e > > Did an other information from a comment become outdated in the script > "coccicheck" > because of such changes for the documentation format? > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/scripts/coccicheck?id=c802e87fbe2d4dd58982d01b3c39bc5a781223aa#n4 How about submitting a patch to fix the problem? julia
scripts/coccicheck: Update for a comment?
Hello, Information from a commit like "docs: sphinxify coccinelle.txt and add it to dev-tools" caught also my software development attention. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/Documentation/coccinelle.txt?id=4b9033a33494ec9154d63e706e9e47f7eb3fd59e Did an other information from a comment become outdated in the script "coccicheck" because of such changes for the documentation format? https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/scripts/coccicheck?id=c802e87fbe2d4dd58982d01b3c39bc5a781223aa#n4 Regards, Markus
scripts/coccicheck: Update for a comment?
Hello, Information from a commit like "docs: sphinxify coccinelle.txt and add it to dev-tools" caught also my software development attention. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/Documentation/coccinelle.txt?id=4b9033a33494ec9154d63e706e9e47f7eb3fd59e Did an other information from a comment become outdated in the script "coccicheck" because of such changes for the documentation format? https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/scripts/coccicheck?id=c802e87fbe2d4dd58982d01b3c39bc5a781223aa#n4 Regards, Markus
[PATCH 2/4] mm: prevent double decrease of nr_reserved_highatomic
There is race between page freeing and unreserved highatomic. CPU 0 CPU 1 free_hot_cold_page mt = get_pfnblock_migratetype set_pcppage_migratetype(page, mt) unreserve_highatomic_pageblock spin_lock_irqsave(>lock) move_freepages_block set_pageblock_migratetype(page) spin_unlock_irqrestore(>lock) free_pcppages_bulk __free_one_page(mt) <- mt is stale By above race, a page on CPU 0 could go non-highorderatomic free list since the pageblock's type is changed. By that, unreserve logic of highorderatomic can decrease reserved count on a same pageblock several times and then it will make mismatch between nr_reserved_highatomic and the number of reserved pageblock. So, this patch verifies whether the pageblock is highatomic or not and decrease the count only if the pageblock is highatomic. Signed-off-by: Minchan Kim--- mm/page_alloc.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e7cbb3cc22fa..d110cd640264 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2133,13 +2133,25 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) continue; /* -* It should never happen but changes to locking could -* inadvertently allow a per-cpu drain to add pages -* to MIGRATE_HIGHATOMIC while unreserving so be safe -* and watch for underflows. +* In page freeing path, migratetype change is racy so +* we can counter several free pages in a pageblock +* in this loop althoug we changed the pageblock type +* from highatomic to ac->migratetype. So we should +* adjust the count once. */ - zone->nr_reserved_highatomic -= min(pageblock_nr_pages, - zone->nr_reserved_highatomic); + if (get_pageblock_migratetype(page) == + MIGRATE_HIGHATOMIC) { + /* +* It should never happen but changes to +* locking could inadvertently allow a per-cpu +* drain to add pages to MIGRATE_HIGHATOMIC +* while unreserving so be safe and watch for +* underflows. +*/ + zone->nr_reserved_highatomic -= min( + pageblock_nr_pages, + zone->nr_reserved_highatomic); + } /* * Convert to ac->migratetype and avoid the normal -- 2.7.4
[PATCH 3/4] mm: unreserve highatomic free pages fully before OOM
After fixing the race of highatomic page count, I still encounter OOM with many free memory reserved as highatomic. One of reason in my testing was we unreserve free pages only if reclaim has progress. Otherwise, we cannot have chance to unreseve. Other problem after fixing it was it doesn't guarantee every pages unreserving of highatomic pageblock because it just release *a* pageblock which could have few free pages so other context could steal it easily so that the process stucked with direct reclaim finally can encounter OOM although there are free pages which can be unreserved. This patch changes the logic so that it unreserves pageblocks with no_progress_loop proportionally. IOW, in first retrial of reclaim, it will try to unreserve a pageblock. In second retrial of reclaim, it will try to unreserve 1/MAX_RECLAIM_RETRIES * reserved_pageblock and finally all reserved pageblock before the OOM. Signed-off-by: Minchan Kim--- mm/page_alloc.c | 57 - 1 file changed, 44 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d110cd640264..eeb047bb0e9d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -71,6 +71,12 @@ #include #include "internal.h" +/* + * Maximum number of reclaim retries without any progress before OOM killer + * is consider as the only way to move forward. + */ +#define MAX_RECLAIM_RETRIES 16 + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) @@ -2107,7 +2113,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. */ -static void unreserve_highatomic_pageblock(const struct alloc_context *ac) +static int unreserve_highatomic_pageblock(const struct alloc_context *ac, + int no_progress_loops) { struct zonelist *zonelist = ac->zonelist; unsigned long flags; @@ -2115,15 +2122,40 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) struct zone *zone; struct page *page; int order; + int unreserved_pages = 0; for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, ac->nodemask) { - /* Preserve at least one pageblock */ - if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + unsigned long unreserve_pages_max; + + /* +* Try to preserve at least one pageblock but use up before +* OOM kill. +*/ + if (no_progress_loops < MAX_RECLAIM_RETRIES && + zone->nr_reserved_highatomic <= pageblock_nr_pages) continue; spin_lock_irqsave(>lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + if (no_progress_loops < MAX_RECLAIM_RETRIES) { + unreserve_pages_max = no_progress_loops * + zone->nr_reserved_highatomic / + MAX_RECLAIM_RETRIES; + unreserve_pages_max = max(unreserve_pages_max, + pageblock_nr_pages); + } else { + /* +* By race with page free functions, !highatomic +* pageblocks can have a free page in highatomic +* migratetype free list. So if we are about to +* kill some process, unreserve every free pages +* in highorderatomic. +*/ + unreserve_pages_max = -1UL; + } + + for (order = 0; order < MAX_ORDER && + unreserve_pages_max > 0; order++) { struct free_area *area = &(zone->free_area[order]); page = list_first_entry_or_null( @@ -2151,6 +2183,9 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) zone->nr_reserved_highatomic -= min( pageblock_nr_pages, zone->nr_reserved_highatomic); + unreserve_pages_max -= min(pageblock_nr_pages, + zone->nr_reserved_highatomic); + unreserved_pages += 1 << page_order(page); } /* @@ -2164,11 +2199,11 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) */
[PATCH 1/4] mm: adjust reserved highatomic count
In page freeing path, migratetype is racy so that a highorderatomic page could free into non-highorderatomic free list. If that page is allocated, VM can change the pageblock from higorderatomic to something. In that case, we should adjust nr_reserved_highatomic. Otherwise, VM cannot reserve highorderatomic pageblocks any more although it doesn't reach 1% limit. It means highorder atomic allocation failure would be higher. So, this patch decreases the account as well as migratetype if it was MIGRATE_HIGHATOMIC. Signed-off-by: Minchan Kim--- mm/page_alloc.c | 44 ++-- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 55ad0229ebf3..e7cbb3cc22fa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -282,6 +282,9 @@ EXPORT_SYMBOL(nr_node_ids); EXPORT_SYMBOL(nr_online_nodes); #endif +static void dec_highatomic_pageblock(struct zone *zone, struct page *page, + int migratetype); + int page_group_by_mobility_disabled __read_mostly; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -1935,7 +1938,14 @@ static void change_pageblock_range(struct page *pageblock_page, int nr_pageblocks = 1 << (start_order - pageblock_order); while (nr_pageblocks--) { - set_pageblock_migratetype(pageblock_page, migratetype); + if (get_pageblock_migratetype(pageblock_page) != + MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(pageblock_page, + migratetype); + else + dec_highatomic_pageblock(page_zone(pageblock_page), + pageblock_page, + migratetype); pageblock_page += pageblock_nr_pages; } } @@ -1996,8 +2006,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, /* Claim the whole block if over half of it is free */ if (pages >= (1 << (pageblock_order-1)) || - page_group_by_mobility_disabled) - set_pageblock_migratetype(page, start_type); + page_group_by_mobility_disabled) { + int mt = get_pageblock_migratetype(page); + + if (mt != MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(page, start_type); + else + dec_highatomic_pageblock(zone, page, start_type); + } } /* @@ -2037,6 +2053,17 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, return -1; } +static void dec_highatomic_pageblock(struct zone *zone, struct page *page, + int migratetype) +{ + if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + return; + + zone->nr_reserved_highatomic -= min(pageblock_nr_pages, + zone->nr_reserved_highatomic); + set_pageblock_migratetype(page, migratetype); +} + /* * Reserve a pageblock for exclusive use of high-order atomic allocations if * there are no empty page blocks that contain a page with a suitable order @@ -2555,9 +2582,14 @@ int __isolate_free_page(struct page *page, unsigned int order) struct page *endpage = page + (1 << order) - 1; for (; page < endpage; page += pageblock_nr_pages) { int mt = get_pageblock_migratetype(page); - if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)) - set_pageblock_migratetype(page, - MIGRATE_MOVABLE); + if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)) { + if (mt != MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(page, + MIGRATE_MOVABLE); + else + dec_highatomic_pageblock(zone, page, + MIGRATE_MOVABLE); + } } } -- 2.7.4
[PATCH 1/4] mm: adjust reserved highatomic count
In page freeing path, migratetype is racy so that a highorderatomic page could free into non-highorderatomic free list. If that page is allocated, VM can change the pageblock from higorderatomic to something. In that case, we should adjust nr_reserved_highatomic. Otherwise, VM cannot reserve highorderatomic pageblocks any more although it doesn't reach 1% limit. It means highorder atomic allocation failure would be higher. So, this patch decreases the account as well as migratetype if it was MIGRATE_HIGHATOMIC. Signed-off-by: Minchan Kim --- mm/page_alloc.c | 44 ++-- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 55ad0229ebf3..e7cbb3cc22fa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -282,6 +282,9 @@ EXPORT_SYMBOL(nr_node_ids); EXPORT_SYMBOL(nr_online_nodes); #endif +static void dec_highatomic_pageblock(struct zone *zone, struct page *page, + int migratetype); + int page_group_by_mobility_disabled __read_mostly; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -1935,7 +1938,14 @@ static void change_pageblock_range(struct page *pageblock_page, int nr_pageblocks = 1 << (start_order - pageblock_order); while (nr_pageblocks--) { - set_pageblock_migratetype(pageblock_page, migratetype); + if (get_pageblock_migratetype(pageblock_page) != + MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(pageblock_page, + migratetype); + else + dec_highatomic_pageblock(page_zone(pageblock_page), + pageblock_page, + migratetype); pageblock_page += pageblock_nr_pages; } } @@ -1996,8 +2006,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, /* Claim the whole block if over half of it is free */ if (pages >= (1 << (pageblock_order-1)) || - page_group_by_mobility_disabled) - set_pageblock_migratetype(page, start_type); + page_group_by_mobility_disabled) { + int mt = get_pageblock_migratetype(page); + + if (mt != MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(page, start_type); + else + dec_highatomic_pageblock(zone, page, start_type); + } } /* @@ -2037,6 +2053,17 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, return -1; } +static void dec_highatomic_pageblock(struct zone *zone, struct page *page, + int migratetype) +{ + if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + return; + + zone->nr_reserved_highatomic -= min(pageblock_nr_pages, + zone->nr_reserved_highatomic); + set_pageblock_migratetype(page, migratetype); +} + /* * Reserve a pageblock for exclusive use of high-order atomic allocations if * there are no empty page blocks that contain a page with a suitable order @@ -2555,9 +2582,14 @@ int __isolate_free_page(struct page *page, unsigned int order) struct page *endpage = page + (1 << order) - 1; for (; page < endpage; page += pageblock_nr_pages) { int mt = get_pageblock_migratetype(page); - if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)) - set_pageblock_migratetype(page, - MIGRATE_MOVABLE); + if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)) { + if (mt != MIGRATE_HIGHATOMIC) + set_pageblock_migratetype(page, + MIGRATE_MOVABLE); + else + dec_highatomic_pageblock(zone, page, + MIGRATE_MOVABLE); + } } } -- 2.7.4
[PATCH 2/4] mm: prevent double decrease of nr_reserved_highatomic
There is race between page freeing and unreserved highatomic. CPU 0 CPU 1 free_hot_cold_page mt = get_pfnblock_migratetype set_pcppage_migratetype(page, mt) unreserve_highatomic_pageblock spin_lock_irqsave(>lock) move_freepages_block set_pageblock_migratetype(page) spin_unlock_irqrestore(>lock) free_pcppages_bulk __free_one_page(mt) <- mt is stale By above race, a page on CPU 0 could go non-highorderatomic free list since the pageblock's type is changed. By that, unreserve logic of highorderatomic can decrease reserved count on a same pageblock several times and then it will make mismatch between nr_reserved_highatomic and the number of reserved pageblock. So, this patch verifies whether the pageblock is highatomic or not and decrease the count only if the pageblock is highatomic. Signed-off-by: Minchan Kim --- mm/page_alloc.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e7cbb3cc22fa..d110cd640264 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2133,13 +2133,25 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) continue; /* -* It should never happen but changes to locking could -* inadvertently allow a per-cpu drain to add pages -* to MIGRATE_HIGHATOMIC while unreserving so be safe -* and watch for underflows. +* In page freeing path, migratetype change is racy so +* we can counter several free pages in a pageblock +* in this loop althoug we changed the pageblock type +* from highatomic to ac->migratetype. So we should +* adjust the count once. */ - zone->nr_reserved_highatomic -= min(pageblock_nr_pages, - zone->nr_reserved_highatomic); + if (get_pageblock_migratetype(page) == + MIGRATE_HIGHATOMIC) { + /* +* It should never happen but changes to +* locking could inadvertently allow a per-cpu +* drain to add pages to MIGRATE_HIGHATOMIC +* while unreserving so be safe and watch for +* underflows. +*/ + zone->nr_reserved_highatomic -= min( + pageblock_nr_pages, + zone->nr_reserved_highatomic); + } /* * Convert to ac->migratetype and avoid the normal -- 2.7.4
[PATCH 3/4] mm: unreserve highatomic free pages fully before OOM
After fixing the race of highatomic page count, I still encounter OOM with many free memory reserved as highatomic. One of reason in my testing was we unreserve free pages only if reclaim has progress. Otherwise, we cannot have chance to unreseve. Other problem after fixing it was it doesn't guarantee every pages unreserving of highatomic pageblock because it just release *a* pageblock which could have few free pages so other context could steal it easily so that the process stucked with direct reclaim finally can encounter OOM although there are free pages which can be unreserved. This patch changes the logic so that it unreserves pageblocks with no_progress_loop proportionally. IOW, in first retrial of reclaim, it will try to unreserve a pageblock. In second retrial of reclaim, it will try to unreserve 1/MAX_RECLAIM_RETRIES * reserved_pageblock and finally all reserved pageblock before the OOM. Signed-off-by: Minchan Kim --- mm/page_alloc.c | 57 - 1 file changed, 44 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d110cd640264..eeb047bb0e9d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -71,6 +71,12 @@ #include #include "internal.h" +/* + * Maximum number of reclaim retries without any progress before OOM killer + * is consider as the only way to move forward. + */ +#define MAX_RECLAIM_RETRIES 16 + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_FRACTION (8) @@ -2107,7 +2113,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. */ -static void unreserve_highatomic_pageblock(const struct alloc_context *ac) +static int unreserve_highatomic_pageblock(const struct alloc_context *ac, + int no_progress_loops) { struct zonelist *zonelist = ac->zonelist; unsigned long flags; @@ -2115,15 +2122,40 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) struct zone *zone; struct page *page; int order; + int unreserved_pages = 0; for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, ac->nodemask) { - /* Preserve at least one pageblock */ - if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + unsigned long unreserve_pages_max; + + /* +* Try to preserve at least one pageblock but use up before +* OOM kill. +*/ + if (no_progress_loops < MAX_RECLAIM_RETRIES && + zone->nr_reserved_highatomic <= pageblock_nr_pages) continue; spin_lock_irqsave(>lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + if (no_progress_loops < MAX_RECLAIM_RETRIES) { + unreserve_pages_max = no_progress_loops * + zone->nr_reserved_highatomic / + MAX_RECLAIM_RETRIES; + unreserve_pages_max = max(unreserve_pages_max, + pageblock_nr_pages); + } else { + /* +* By race with page free functions, !highatomic +* pageblocks can have a free page in highatomic +* migratetype free list. So if we are about to +* kill some process, unreserve every free pages +* in highorderatomic. +*/ + unreserve_pages_max = -1UL; + } + + for (order = 0; order < MAX_ORDER && + unreserve_pages_max > 0; order++) { struct free_area *area = &(zone->free_area[order]); page = list_first_entry_or_null( @@ -2151,6 +2183,9 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) zone->nr_reserved_highatomic -= min( pageblock_nr_pages, zone->nr_reserved_highatomic); + unreserve_pages_max -= min(pageblock_nr_pages, + zone->nr_reserved_highatomic); + unreserved_pages += 1 << page_order(page); } /* @@ -2164,11 +2199,11 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) */
[PATCH 4/4] mm: skip to reserve pageblock crossed zone boundary for HIGHATOMIC
In CONFIG_SPARSEMEM, VM shares a pageblock_flags of a mem_section between two zones if the pageblock cross zone boundaries. It means a zone lock cannot protect pageblock migratype change's race. It might be not a problem because migratetype inherently was racy but intrdocuing with CMA, it was not true any more and have been fixed. (I hope it should be solved more general approach however...) And then, it's time for MIGRATE_HIGHATOMIC. More importantly, HIGHATOMIC migratetype is not big(i.e., 1%) reserve in system so let's skip such crippled pageblock to try to reserve full 1% free memory. Debugged-by: Joonsoo KimSigned-off-by: Minchan Kim --- mm/page_alloc.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeb047bb0e9d..d76bb50baf61 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2098,6 +2098,24 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, mt = get_pageblock_migratetype(page); if (mt != MIGRATE_HIGHATOMIC && !is_migrate_isolate(mt) && !is_migrate_cma(mt)) { + /* +* If the pageblock cross zone boundaries, we need both +* zone locks but doesn't want to make complex because +* highatomic pageblock is small so that we want to reserve +* sane(?) pageblock. +*/ + unsigned long start_pfn, end_pfn; + + start_pfn = page_to_pfn(page); + start_pfn = start_pfn & ~(pageblock_nr_pages - 1); + + if (!zone_spans_pfn(zone, start_pfn)) + goto out_unlock; + + end_pfn = start_pfn + pageblock_nr_pages - 1; + if (!zone_spans_pfn(zone, end_pfn)) + goto out_unlock; + zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC); -- 2.7.4
[PATCH 4/4] mm: skip to reserve pageblock crossed zone boundary for HIGHATOMIC
In CONFIG_SPARSEMEM, VM shares a pageblock_flags of a mem_section between two zones if the pageblock cross zone boundaries. It means a zone lock cannot protect pageblock migratype change's race. It might be not a problem because migratetype inherently was racy but intrdocuing with CMA, it was not true any more and have been fixed. (I hope it should be solved more general approach however...) And then, it's time for MIGRATE_HIGHATOMIC. More importantly, HIGHATOMIC migratetype is not big(i.e., 1%) reserve in system so let's skip such crippled pageblock to try to reserve full 1% free memory. Debugged-by: Joonsoo Kim Signed-off-by: Minchan Kim --- mm/page_alloc.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeb047bb0e9d..d76bb50baf61 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2098,6 +2098,24 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, mt = get_pageblock_migratetype(page); if (mt != MIGRATE_HIGHATOMIC && !is_migrate_isolate(mt) && !is_migrate_cma(mt)) { + /* +* If the pageblock cross zone boundaries, we need both +* zone locks but doesn't want to make complex because +* highatomic pageblock is small so that we want to reserve +* sane(?) pageblock. +*/ + unsigned long start_pfn, end_pfn; + + start_pfn = page_to_pfn(page); + start_pfn = start_pfn & ~(pageblock_nr_pages - 1); + + if (!zone_spans_pfn(zone, start_pfn)) + goto out_unlock; + + end_pfn = start_pfn + pageblock_nr_pages - 1; + if (!zone_spans_pfn(zone, end_pfn)) + goto out_unlock; + zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC); -- 2.7.4
[PATCH 0/4] use up highorder free pages before OOM
I got OOM report from production team with v4.4 kernel. It has enough free memory but failed to allocate order-0 page and finally encounter OOM kill. I could reproduce it with my test easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: GW OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88007f15bbc8 8138eb13 88007f15bd88 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08 8146a5ca 81c8df60 0015 0206 Call Trace: [] dump_stack+0x63/0x90 [] dump_header+0x5c/0x1ce [] ? virtballoon_oom_notify+0x2a/0x80 [] oom_kill_process+0x22e/0x400 [] out_of_memory+0x1ac/0x210 [] __alloc_pages_nodemask+0x101e/0x1040 [] handle_mm_fault+0xa0a/0xbf0 [] __do_page_fault+0x1dd/0x4d0 [] trace_do_page_fault+0x43/0x130 [] do_async_page_fault+0x1a/0xa0 [] async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned During the investigation, I found some problems with highatomic so this patch aims to solve the problems and the final goal is to unreserve every highatomic free pages before the OOM kill. Patch 1 fixes accounting bug in several places of page allocators Patch 2 fixes accounting bug caused by subtle race between freeing function and unreserve_highatomic_pageblock. Patch 3 changes unreseve scheme to use up every reserved pages Patch 4 fixes accounting bug caused by mem_section shared by two zones. Minchan Kim (4): mm: adjust reserved highatomic count mm: prevent double decrease of nr_reserved_highatomic mm: unreserve highatomic free pages fully before OOM mm: skip to reserve pageblock crossed zone boundary for HIGHATOMIC mm/page_alloc.c | 143 ++-- 1 file changed, 118 insertions(+), 25 deletions(-) -- 2.7.4
[PATCH 0/4] use up highorder free pages before OOM
I got OOM report from production team with v4.4 kernel. It has enough free memory but failed to allocate order-0 page and finally encounter OOM kill. I could reproduce it with my test easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: GW OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88007f15bbc8 8138eb13 88007f15bd88 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08 8146a5ca 81c8df60 0015 0206 Call Trace: [] dump_stack+0x63/0x90 [] dump_header+0x5c/0x1ce [] ? virtballoon_oom_notify+0x2a/0x80 [] oom_kill_process+0x22e/0x400 [] out_of_memory+0x1ac/0x210 [] __alloc_pages_nodemask+0x101e/0x1040 [] handle_mm_fault+0xa0a/0xbf0 [] __do_page_fault+0x1dd/0x4d0 [] trace_do_page_fault+0x43/0x130 [] do_async_page_fault+0x1a/0xa0 [] async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned During the investigation, I found some problems with highatomic so this patch aims to solve the problems and the final goal is to unreserve every highatomic free pages before the OOM kill. Patch 1 fixes accounting bug in several places of page allocators Patch 2 fixes accounting bug caused by subtle race between freeing function and unreserve_highatomic_pageblock. Patch 3 changes unreseve scheme to use up every reserved pages Patch 4 fixes accounting bug caused by mem_section shared by two zones. Minchan Kim (4): mm: adjust reserved highatomic count mm: prevent double decrease of nr_reserved_highatomic mm: unreserve highatomic free pages fully before OOM mm: skip to reserve pageblock crossed zone boundary for HIGHATOMIC mm/page_alloc.c | 143 ++-- 1 file changed, 118 insertions(+), 25 deletions(-) -- 2.7.4
[PATCH] hwmon: fix platform_no_drv_owner.cocci warnings
No need to set .owner here. The core will do it. Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci Signed-off-by: Julia LawallSigned-off-by: Fengguang Wu --- tree: https://github.com/0day-ci/linux Chris-Packham/hwmon-Add-tc654-driver/20161007-054116 head: 7b9f81e69fbc7077c55136daefe7546cf88925ae commit: 7b9f81e69fbc7077c55136daefe7546cf88925ae [1/1] hwmon: Add tc654 driver tc654.c |1 - 1 file changed, 1 deletion(-) --- a/drivers/hwmon/tc654.c +++ b/drivers/hwmon/tc654.c @@ -517,7 +517,6 @@ MODULE_DEVICE_TABLE(i2c, tc654_id); static struct i2c_driver tc654_driver = { .driver = { .name = "tc654", - .owner = THIS_MODULE, .of_match_table = of_match_ptr(tc654_dt_match), }, .probe = tc654_probe,
[PATCH] hwmon: fix platform_no_drv_owner.cocci warnings
No need to set .owner here. The core will do it. Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci Signed-off-by: Julia Lawall Signed-off-by: Fengguang Wu --- tree: https://github.com/0day-ci/linux Chris-Packham/hwmon-Add-tc654-driver/20161007-054116 head: 7b9f81e69fbc7077c55136daefe7546cf88925ae commit: 7b9f81e69fbc7077c55136daefe7546cf88925ae [1/1] hwmon: Add tc654 driver tc654.c |1 - 1 file changed, 1 deletion(-) --- a/drivers/hwmon/tc654.c +++ b/drivers/hwmon/tc654.c @@ -517,7 +517,6 @@ MODULE_DEVICE_TABLE(i2c, tc654_id); static struct i2c_driver tc654_driver = { .driver = { .name = "tc654", - .owner = THIS_MODULE, .of_match_table = of_match_ptr(tc654_dt_match), }, .probe = tc654_probe,
Re: [GIT PULL] MD update for 4.9
Mr. Li, There is another thread in [linux-raid] discussing pre-fetches in the raid-6 AVX2 code. My testing implies that the prefetch distance is too short. In your new AVX512 code, it looks like there are 24 instructions, each with latencies of 1, between the prefetch and the actual memory load. I don't have a AVX512 CPU to try this on, but the prefetch might do better at a bigger distance. If I am not mistaken, it takes a lot longer than 24 clocks to fetch 4 cache lines. Just a comment while the code is still fluid. Doug Dumitru EasyCo LLC On Thu, Oct 6, 2016 at 5:38 PM, Shaohua Liwrote: > Hi Linus, > Please pull MD update for 4.9. This update includes: > - new AVX512 instruction based raid6 gen/recovery algorithm > - A couple of md-cluster related bug fixes > - Fix a potential deadlock > - Set nonrotational bit for raid array with SSD > - Set correct max_hw_sectors for raid5/6, which hopefuly can improve > performance a little bit > - Other minor fixes > > Thanks, > Shaohua > > The following changes since commit 7d1e042314619115153a0f6f06e4552c09a50e13: > > Merge tag 'usercopy-v4.8-rc8' of > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux (2016-09-20 17:11:19 > -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.9-rc1 > > for you to fetch changes up to bb086a89a406b5d877ee616f1490fcc81f8e1b2b: > > md: set rotational bit (2016-10-03 10:20:27 -0700) > > > Chao Yu (1): > raid5: fix to detect failure of register_shrinker > > Gayatri Kammela (5): > lib/raid6: Add AVX512 optimized gen_syndrome functions > lib/raid6: Add AVX512 optimized recovery functions > lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions > lib/raid6: Add AVX512 optimized xor_syndrome functions > raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to > the char arrays > > Guoqing Jiang (9): > md-cluster: call md_kick_rdev_from_array once ack failed > md-cluster: use FORCEUNLOCK in lockres_free > md-cluster: remove some unnecessary dlm_unlock_sync > md: changes for MD_STILL_CLOSED flag > md-cluster: clean related infos of cluster > md-cluster: protect md_find_rdev_nr_rcu with rcu lock > md-cluster: convert the completion to wait queue > md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang > md-cluster: make resync lock also could be interruptted > > Shaohua Li (5): > raid5: allow arbitrary max_hw_sectors > md/bitmap: fix wrong cleanup > md: fix a potential deadlock > raid5: handle register_shrinker failure > md: set rotational bit > > arch/x86/Makefile| 5 +- > drivers/md/bitmap.c | 4 +- > drivers/md/md-cluster.c | 99 ++--- > drivers/md/md.c | 44 +++- > drivers/md/md.h | 5 +- > drivers/md/raid5.c | 11 +- > include/linux/raid/pq.h | 4 + > lib/raid6/Makefile | 2 +- > lib/raid6/algos.c| 12 + > lib/raid6/avx512.c | 569 > +++ > lib/raid6/recov_avx512.c | 388 > lib/raid6/test/Makefile | 5 +- > lib/raid6/test/test.c| 7 +- > lib/raid6/x86.h | 10 + > 14 files changed, insertions(+), 54 deletions(-) > create mode 100644 lib/raid6/avx512.c > create mode 100644 lib/raid6/recov_avx512.c > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC
Re: [GIT PULL] MD update for 4.9
Mr. Li, There is another thread in [linux-raid] discussing pre-fetches in the raid-6 AVX2 code. My testing implies that the prefetch distance is too short. In your new AVX512 code, it looks like there are 24 instructions, each with latencies of 1, between the prefetch and the actual memory load. I don't have a AVX512 CPU to try this on, but the prefetch might do better at a bigger distance. If I am not mistaken, it takes a lot longer than 24 clocks to fetch 4 cache lines. Just a comment while the code is still fluid. Doug Dumitru EasyCo LLC On Thu, Oct 6, 2016 at 5:38 PM, Shaohua Li wrote: > Hi Linus, > Please pull MD update for 4.9. This update includes: > - new AVX512 instruction based raid6 gen/recovery algorithm > - A couple of md-cluster related bug fixes > - Fix a potential deadlock > - Set nonrotational bit for raid array with SSD > - Set correct max_hw_sectors for raid5/6, which hopefuly can improve > performance a little bit > - Other minor fixes > > Thanks, > Shaohua > > The following changes since commit 7d1e042314619115153a0f6f06e4552c09a50e13: > > Merge tag 'usercopy-v4.8-rc8' of > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux (2016-09-20 17:11:19 > -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.9-rc1 > > for you to fetch changes up to bb086a89a406b5d877ee616f1490fcc81f8e1b2b: > > md: set rotational bit (2016-10-03 10:20:27 -0700) > > > Chao Yu (1): > raid5: fix to detect failure of register_shrinker > > Gayatri Kammela (5): > lib/raid6: Add AVX512 optimized gen_syndrome functions > lib/raid6: Add AVX512 optimized recovery functions > lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions > lib/raid6: Add AVX512 optimized xor_syndrome functions > raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to > the char arrays > > Guoqing Jiang (9): > md-cluster: call md_kick_rdev_from_array once ack failed > md-cluster: use FORCEUNLOCK in lockres_free > md-cluster: remove some unnecessary dlm_unlock_sync > md: changes for MD_STILL_CLOSED flag > md-cluster: clean related infos of cluster > md-cluster: protect md_find_rdev_nr_rcu with rcu lock > md-cluster: convert the completion to wait queue > md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang > md-cluster: make resync lock also could be interruptted > > Shaohua Li (5): > raid5: allow arbitrary max_hw_sectors > md/bitmap: fix wrong cleanup > md: fix a potential deadlock > raid5: handle register_shrinker failure > md: set rotational bit > > arch/x86/Makefile| 5 +- > drivers/md/bitmap.c | 4 +- > drivers/md/md-cluster.c | 99 ++--- > drivers/md/md.c | 44 +++- > drivers/md/md.h | 5 +- > drivers/md/raid5.c | 11 +- > include/linux/raid/pq.h | 4 + > lib/raid6/Makefile | 2 +- > lib/raid6/algos.c| 12 + > lib/raid6/avx512.c | 569 > +++ > lib/raid6/recov_avx512.c | 388 > lib/raid6/test/Makefile | 5 +- > lib/raid6/test/test.c| 7 +- > lib/raid6/x86.h | 10 + > 14 files changed, insertions(+), 54 deletions(-) > create mode 100644 lib/raid6/avx512.c > create mode 100644 lib/raid6/recov_avx512.c > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC
[PATCH V5 05/10] dmaengine: qcom_hidma: make pending_tre_count atomic
Getting ready for the MSI interrupts. The pending_tre_count is used in the interrupt handler to make sure all outstanding requests are serviced. The driver will allocate 11 MSI interrupts. Each MSI interrupt can be assigned to a different CPU. Then, we have a race condition for common variables as they share the same interrupt handler with a different cause bit and they can potentially be executed in parallel. Making this variable atomic so that it can be updated from multiple processor contexts. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma.h | 2 +- drivers/dma/qcom/hidma_dbg.c | 3 ++- drivers/dma/qcom/hidma_ll.c | 13 ++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/dma/qcom/hidma.h b/drivers/dma/qcom/hidma.h index b209942..afaeb9a 100644 --- a/drivers/dma/qcom/hidma.h +++ b/drivers/dma/qcom/hidma.h @@ -58,7 +58,7 @@ struct hidma_lldev { void __iomem *evca; /* Event Channel address */ struct hidma_tre **pending_tre_list; /* Pointers to pending TREs */ - s32 pending_tre_count; /* Number of TREs pending */ + atomic_t pending_tre_count; /* Number of TREs pending */ void *tre_ring; /* TRE ring */ dma_addr_t tre_dma; /* TRE ring to be shared with HW */ diff --git a/drivers/dma/qcom/hidma_dbg.c b/drivers/dma/qcom/hidma_dbg.c index 3d83b99..3bdcb80 100644 --- a/drivers/dma/qcom/hidma_dbg.c +++ b/drivers/dma/qcom/hidma_dbg.c @@ -74,7 +74,8 @@ static void hidma_ll_devstats(struct seq_file *s, void *llhndl) seq_printf(s, "tre_ring_handle=%pap\n", >tre_dma); seq_printf(s, "tre_ring_size = 0x%x\n", lldev->tre_ring_size); seq_printf(s, "tre_processed_off = 0x%x\n", lldev->tre_processed_off); - seq_printf(s, "pending_tre_count=%d\n", lldev->pending_tre_count); + seq_printf(s, "pending_tre_count=%d\n", + atomic_read(>pending_tre_count)); seq_printf(s, "evca=%p\n", lldev->evca); seq_printf(s, "evre_ring=%p\n", lldev->evre_ring); seq_printf(s, "evre_ring_handle=%pap\n", >evre_dma); diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index ad20dfb..a4fc941 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -218,10 +218,9 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, * Keep track of pending TREs that SW is expecting to receive * from HW. We got one now. Decrement our counter. */ - lldev->pending_tre_count--; - if (lldev->pending_tre_count < 0) { + if (atomic_dec_return(>pending_tre_count) < 0) { dev_warn(lldev->dev, "tre count mismatch on completion"); - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); } spin_unlock_irqrestore(>lock, flags); @@ -321,7 +320,7 @@ void hidma_cleanup_pending_tre(struct hidma_lldev *lldev, u8 err_info, u32 tre_read_off; tre_iterator = lldev->tre_processed_off; - while (lldev->pending_tre_count) { + while (atomic_read(>pending_tre_count)) { if (hidma_post_completed(lldev, tre_iterator, err_info, err_code)) break; @@ -564,7 +563,7 @@ void hidma_ll_queue_request(struct hidma_lldev *lldev, u32 tre_ch) tre->err_code = 0; tre->err_info = 0; tre->queued = 1; - lldev->pending_tre_count++; + atomic_inc(>pending_tre_count); lldev->tre_write_offset = (lldev->tre_write_offset + HIDMA_TRE_SIZE) % lldev->tre_ring_size; spin_unlock_irqrestore(>lock, flags); @@ -670,7 +669,7 @@ int hidma_ll_setup(struct hidma_lldev *lldev) u32 val; u32 nr_tres = lldev->nr_tres; - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); lldev->tre_processed_off = 0; lldev->evre_processed_off = 0; lldev->tre_write_offset = 0; @@ -834,7 +833,7 @@ int hidma_ll_uninit(struct hidma_lldev *lldev) tasklet_kill(>rst_task); memset(lldev->trepool, 0, required_bytes); lldev->trepool = NULL; - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); lldev->tre_write_offset = 0; rc = hidma_ll_reset(lldev); -- 1.9.1
[PATCH V5 08/10] dmaengine: qcom_hidma: protect common data structures
When MSI interrupts are supported, error and the transfer interrupt can come from multiple processor contexts. Each error interrupt is an MSI interrupt. If the channel is disabled by the first error interrupt, the remaining error interrupts will gracefully return in the interrupt handler. If an error is observed while servicing the completions in success case, the posting of the completions will be aborted as soon as channel disabled state is observed. The error interrupt handler will take it from there and finish the remaining completions. We don't want to create multiple success and error messages to be delivered to the client in mixed order. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma_ll.c | 44 +++- 1 file changed, 11 insertions(+), 33 deletions(-) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index 9d78c86..c4e8b64 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -198,13 +198,16 @@ static void hidma_ll_tre_complete(unsigned long arg) } } -static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, - u8 err_info, u8 err_code) +static int hidma_post_completed(struct hidma_lldev *lldev, u8 err_info, + u8 err_code) { struct hidma_tre *tre; unsigned long flags; + u32 tre_iterator; spin_lock_irqsave(>lock, flags); + + tre_iterator = lldev->tre_processed_off; tre = lldev->pending_tre_list[tre_iterator / HIDMA_TRE_SIZE]; if (!tre) { spin_unlock_irqrestore(>lock, flags); @@ -223,6 +226,9 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, atomic_set(>pending_tre_count, 0); } + HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, +lldev->tre_ring_size); + lldev->tre_processed_off = tre_iterator; spin_unlock_irqrestore(>lock, flags); tre->err_info = err_info; @@ -244,13 +250,11 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, static int hidma_handle_tre_completion(struct hidma_lldev *lldev) { u32 evre_ring_size = lldev->evre_ring_size; - u32 tre_ring_size = lldev->tre_ring_size; u32 err_info, err_code, evre_write_off; - u32 tre_iterator, evre_iterator; + u32 evre_iterator; u32 num_completed = 0; evre_write_off = readl_relaxed(lldev->evca + HIDMA_EVCA_WRITE_PTR_REG); - tre_iterator = lldev->tre_processed_off; evre_iterator = lldev->evre_processed_off; if ((evre_write_off > evre_ring_size) || @@ -273,12 +277,9 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) err_code = (cfg >> HIDMA_EVRE_CODE_BIT_POS) & HIDMA_EVRE_CODE_MASK; - if (hidma_post_completed(lldev, tre_iterator, err_info, -err_code)) + if (hidma_post_completed(lldev, err_info, err_code)) break; - HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, -tre_ring_size); HIDMA_INCREMENT_ITERATOR(evre_iterator, HIDMA_EVRE_SIZE, evre_ring_size); @@ -295,16 +296,10 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) if (num_completed) { u32 evre_read_off = (lldev->evre_processed_off + HIDMA_EVRE_SIZE * num_completed); - u32 tre_read_off = (lldev->tre_processed_off + - HIDMA_TRE_SIZE * num_completed); - evre_read_off = evre_read_off % evre_ring_size; - tre_read_off = tre_read_off % tre_ring_size; - writel(evre_read_off, lldev->evca + HIDMA_EVCA_DOORBELL_REG); /* record the last processed tre offset */ - lldev->tre_processed_off = tre_read_off; lldev->evre_processed_off = evre_read_off; } @@ -314,27 +309,10 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) void hidma_cleanup_pending_tre(struct hidma_lldev *lldev, u8 err_info, u8 err_code) { - u32 tre_iterator; - u32 tre_ring_size = lldev->tre_ring_size; - int num_completed = 0; - u32 tre_read_off; - - tre_iterator = lldev->tre_processed_off; while (atomic_read(>pending_tre_count)) { - if (hidma_post_completed(lldev, tre_iterator, err_info, -err_code)) + if (hidma_post_completed(lldev, err_info, err_code)) break; - HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, -
[PATCH V5 05/10] dmaengine: qcom_hidma: make pending_tre_count atomic
Getting ready for the MSI interrupts. The pending_tre_count is used in the interrupt handler to make sure all outstanding requests are serviced. The driver will allocate 11 MSI interrupts. Each MSI interrupt can be assigned to a different CPU. Then, we have a race condition for common variables as they share the same interrupt handler with a different cause bit and they can potentially be executed in parallel. Making this variable atomic so that it can be updated from multiple processor contexts. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma.h | 2 +- drivers/dma/qcom/hidma_dbg.c | 3 ++- drivers/dma/qcom/hidma_ll.c | 13 ++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/dma/qcom/hidma.h b/drivers/dma/qcom/hidma.h index b209942..afaeb9a 100644 --- a/drivers/dma/qcom/hidma.h +++ b/drivers/dma/qcom/hidma.h @@ -58,7 +58,7 @@ struct hidma_lldev { void __iomem *evca; /* Event Channel address */ struct hidma_tre **pending_tre_list; /* Pointers to pending TREs */ - s32 pending_tre_count; /* Number of TREs pending */ + atomic_t pending_tre_count; /* Number of TREs pending */ void *tre_ring; /* TRE ring */ dma_addr_t tre_dma; /* TRE ring to be shared with HW */ diff --git a/drivers/dma/qcom/hidma_dbg.c b/drivers/dma/qcom/hidma_dbg.c index 3d83b99..3bdcb80 100644 --- a/drivers/dma/qcom/hidma_dbg.c +++ b/drivers/dma/qcom/hidma_dbg.c @@ -74,7 +74,8 @@ static void hidma_ll_devstats(struct seq_file *s, void *llhndl) seq_printf(s, "tre_ring_handle=%pap\n", >tre_dma); seq_printf(s, "tre_ring_size = 0x%x\n", lldev->tre_ring_size); seq_printf(s, "tre_processed_off = 0x%x\n", lldev->tre_processed_off); - seq_printf(s, "pending_tre_count=%d\n", lldev->pending_tre_count); + seq_printf(s, "pending_tre_count=%d\n", + atomic_read(>pending_tre_count)); seq_printf(s, "evca=%p\n", lldev->evca); seq_printf(s, "evre_ring=%p\n", lldev->evre_ring); seq_printf(s, "evre_ring_handle=%pap\n", >evre_dma); diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index ad20dfb..a4fc941 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -218,10 +218,9 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, * Keep track of pending TREs that SW is expecting to receive * from HW. We got one now. Decrement our counter. */ - lldev->pending_tre_count--; - if (lldev->pending_tre_count < 0) { + if (atomic_dec_return(>pending_tre_count) < 0) { dev_warn(lldev->dev, "tre count mismatch on completion"); - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); } spin_unlock_irqrestore(>lock, flags); @@ -321,7 +320,7 @@ void hidma_cleanup_pending_tre(struct hidma_lldev *lldev, u8 err_info, u32 tre_read_off; tre_iterator = lldev->tre_processed_off; - while (lldev->pending_tre_count) { + while (atomic_read(>pending_tre_count)) { if (hidma_post_completed(lldev, tre_iterator, err_info, err_code)) break; @@ -564,7 +563,7 @@ void hidma_ll_queue_request(struct hidma_lldev *lldev, u32 tre_ch) tre->err_code = 0; tre->err_info = 0; tre->queued = 1; - lldev->pending_tre_count++; + atomic_inc(>pending_tre_count); lldev->tre_write_offset = (lldev->tre_write_offset + HIDMA_TRE_SIZE) % lldev->tre_ring_size; spin_unlock_irqrestore(>lock, flags); @@ -670,7 +669,7 @@ int hidma_ll_setup(struct hidma_lldev *lldev) u32 val; u32 nr_tres = lldev->nr_tres; - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); lldev->tre_processed_off = 0; lldev->evre_processed_off = 0; lldev->tre_write_offset = 0; @@ -834,7 +833,7 @@ int hidma_ll_uninit(struct hidma_lldev *lldev) tasklet_kill(>rst_task); memset(lldev->trepool, 0, required_bytes); lldev->trepool = NULL; - lldev->pending_tre_count = 0; + atomic_set(>pending_tre_count, 0); lldev->tre_write_offset = 0; rc = hidma_ll_reset(lldev); -- 1.9.1
[PATCH V5 08/10] dmaengine: qcom_hidma: protect common data structures
When MSI interrupts are supported, error and the transfer interrupt can come from multiple processor contexts. Each error interrupt is an MSI interrupt. If the channel is disabled by the first error interrupt, the remaining error interrupts will gracefully return in the interrupt handler. If an error is observed while servicing the completions in success case, the posting of the completions will be aborted as soon as channel disabled state is observed. The error interrupt handler will take it from there and finish the remaining completions. We don't want to create multiple success and error messages to be delivered to the client in mixed order. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma_ll.c | 44 +++- 1 file changed, 11 insertions(+), 33 deletions(-) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index 9d78c86..c4e8b64 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -198,13 +198,16 @@ static void hidma_ll_tre_complete(unsigned long arg) } } -static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, - u8 err_info, u8 err_code) +static int hidma_post_completed(struct hidma_lldev *lldev, u8 err_info, + u8 err_code) { struct hidma_tre *tre; unsigned long flags; + u32 tre_iterator; spin_lock_irqsave(>lock, flags); + + tre_iterator = lldev->tre_processed_off; tre = lldev->pending_tre_list[tre_iterator / HIDMA_TRE_SIZE]; if (!tre) { spin_unlock_irqrestore(>lock, flags); @@ -223,6 +226,9 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, atomic_set(>pending_tre_count, 0); } + HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, +lldev->tre_ring_size); + lldev->tre_processed_off = tre_iterator; spin_unlock_irqrestore(>lock, flags); tre->err_info = err_info; @@ -244,13 +250,11 @@ static int hidma_post_completed(struct hidma_lldev *lldev, int tre_iterator, static int hidma_handle_tre_completion(struct hidma_lldev *lldev) { u32 evre_ring_size = lldev->evre_ring_size; - u32 tre_ring_size = lldev->tre_ring_size; u32 err_info, err_code, evre_write_off; - u32 tre_iterator, evre_iterator; + u32 evre_iterator; u32 num_completed = 0; evre_write_off = readl_relaxed(lldev->evca + HIDMA_EVCA_WRITE_PTR_REG); - tre_iterator = lldev->tre_processed_off; evre_iterator = lldev->evre_processed_off; if ((evre_write_off > evre_ring_size) || @@ -273,12 +277,9 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) err_code = (cfg >> HIDMA_EVRE_CODE_BIT_POS) & HIDMA_EVRE_CODE_MASK; - if (hidma_post_completed(lldev, tre_iterator, err_info, -err_code)) + if (hidma_post_completed(lldev, err_info, err_code)) break; - HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, -tre_ring_size); HIDMA_INCREMENT_ITERATOR(evre_iterator, HIDMA_EVRE_SIZE, evre_ring_size); @@ -295,16 +296,10 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) if (num_completed) { u32 evre_read_off = (lldev->evre_processed_off + HIDMA_EVRE_SIZE * num_completed); - u32 tre_read_off = (lldev->tre_processed_off + - HIDMA_TRE_SIZE * num_completed); - evre_read_off = evre_read_off % evre_ring_size; - tre_read_off = tre_read_off % tre_ring_size; - writel(evre_read_off, lldev->evca + HIDMA_EVCA_DOORBELL_REG); /* record the last processed tre offset */ - lldev->tre_processed_off = tre_read_off; lldev->evre_processed_off = evre_read_off; } @@ -314,27 +309,10 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) void hidma_cleanup_pending_tre(struct hidma_lldev *lldev, u8 err_info, u8 err_code) { - u32 tre_iterator; - u32 tre_ring_size = lldev->tre_ring_size; - int num_completed = 0; - u32 tre_read_off; - - tre_iterator = lldev->tre_processed_off; while (atomic_read(>pending_tre_count)) { - if (hidma_post_completed(lldev, tre_iterator, err_info, -err_code)) + if (hidma_post_completed(lldev, err_info, err_code)) break; - HIDMA_INCREMENT_ITERATOR(tre_iterator, HIDMA_TRE_SIZE, -tre_ring_size); -
[PATCH V5 07/10] dmaengine: qcom_hidma: add a common API to setup the interrupt
Introducing the hidma_ll_setup_irq function to set up the interrupt type externally from the OS interface. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma.h| 2 ++ drivers/dma/qcom/hidma_ll.c | 27 +++ 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/drivers/dma/qcom/hidma.h b/drivers/dma/qcom/hidma.h index afaeb9a..b74a56e 100644 --- a/drivers/dma/qcom/hidma.h +++ b/drivers/dma/qcom/hidma.h @@ -46,6 +46,7 @@ struct hidma_tre { }; struct hidma_lldev { + bool msi_support; /* flag indicating MSI support*/ bool initialized; /* initialized flag */ u8 trch_state; /* trch_state of the device */ u8 evch_state; /* evch_state of the device */ @@ -148,6 +149,7 @@ int hidma_ll_disable(struct hidma_lldev *lldev); int hidma_ll_enable(struct hidma_lldev *llhndl); void hidma_ll_set_transfer_params(struct hidma_lldev *llhndl, u32 tre_ch, dma_addr_t src, dma_addr_t dest, u32 len, u32 flags); +void hidma_ll_setup_irq(struct hidma_lldev *lldev, bool msi); int hidma_ll_setup(struct hidma_lldev *lldev); struct hidma_lldev *hidma_ll_init(struct device *dev, u32 max_channels, void __iomem *trca, void __iomem *evca, diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index 015df4b..9d78c86 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -715,17 +715,36 @@ int hidma_ll_setup(struct hidma_lldev *lldev) writel(HIDMA_EVRE_SIZE * nr_tres, lldev->evca + HIDMA_EVCA_RING_LEN_REG); - /* support IRQ only for now */ + /* configure interrupts */ + hidma_ll_setup_irq(lldev, lldev->msi_support); + + rc = hidma_ll_enable(lldev); + if (rc) + return rc; + + return rc; +} + +void hidma_ll_setup_irq(struct hidma_lldev *lldev, bool msi) +{ + u32 val; + + lldev->msi_support = msi; + + /* disable interrupts again after reset */ + writel(0, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + writel(0, lldev->evca + HIDMA_EVCA_IRQ_EN_REG); + + /* support IRQ by default */ val = readl(lldev->evca + HIDMA_EVCA_INTCTRL_REG); val &= ~0xF; - val |= 0x1; + if (!lldev->msi_support) + val = val | 0x1; writel(val, lldev->evca + HIDMA_EVCA_INTCTRL_REG); /* clear all pending interrupts and enable them */ writel(ENABLE_IRQS, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); writel(ENABLE_IRQS, lldev->evca + HIDMA_EVCA_IRQ_EN_REG); - - return hidma_ll_enable(lldev); } struct hidma_lldev *hidma_ll_init(struct device *dev, u32 nr_tres, -- 1.9.1
[PATCH V5 06/10] dmaengine: qcom_hidma: bring out interrupt cause
Bring out the interrupt cause to the top level so that MSI interrupts can be hooked at a later stage. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma_ll.c | 57 ++--- 1 file changed, 33 insertions(+), 24 deletions(-) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index a4fc941..015df4b 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -432,12 +432,24 @@ static void hidma_ll_abort(unsigned long arg) * requests traditionally to the destination, this concept does not apply * here for this HW. */ -irqreturn_t hidma_ll_inthandler(int chirq, void *arg) +static void hidma_ll_int_handler_internal(struct hidma_lldev *lldev, int cause) { - struct hidma_lldev *lldev = arg; - u32 status; - u32 enable; - u32 cause; + if (cause & HIDMA_ERR_INT_MASK) { + dev_err(lldev->dev, "error 0x%x, disabling...\n", + cause); + + /* Clear out pending interrupts */ + writel(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + + /* No further submissions. */ + hidma_ll_disable(lldev); + + /* Driver completes the txn and intimates the client.*/ + hidma_cleanup_pending_tre(lldev, 0xFF, + HIDMA_EVRE_STATUS_ERROR); + + return; + } /* * Fine tuned for this HW... @@ -446,30 +458,28 @@ irqreturn_t hidma_ll_inthandler(int chirq, void *arg) * read and write accessors are used for performance reasons due to * interrupt delivery guarantees. Do not copy this code blindly and * expect that to work. +* +* Try to consume as many EVREs as possible. */ + hidma_handle_tre_completion(lldev); + + /* We consumed TREs or there are pending TREs or EVREs. */ + writel_relaxed(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); +} + +irqreturn_t hidma_ll_inthandler(int chirq, void *arg) +{ + struct hidma_lldev *lldev = arg; + u32 status; + u32 enable; + u32 cause; + status = readl_relaxed(lldev->evca + HIDMA_EVCA_IRQ_STAT_REG); enable = readl_relaxed(lldev->evca + HIDMA_EVCA_IRQ_EN_REG); cause = status & enable; while (cause) { - if (cause & HIDMA_ERR_INT_MASK) { - dev_err(lldev->dev, "error 0x%x, resetting...\n", - cause); - - /* Clear out pending interrupts */ - writel(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); - - tasklet_schedule(>rst_task); - goto out; - } - - /* -* Try to consume as many EVREs as possible. -*/ - hidma_handle_tre_completion(lldev); - - /* We consumed TREs or there are pending TREs or EVREs. */ - writel_relaxed(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + hidma_ll_int_handler_internal(lldev, cause); /* * Another interrupt might have arrived while we are @@ -480,7 +490,6 @@ irqreturn_t hidma_ll_inthandler(int chirq, void *arg) cause = status & enable; } -out: return IRQ_HANDLED; } -- 1.9.1
[PATCH V5 09/10] dmaengine: qcom_hidma: break completion processing on error
We try to consume as much successful transfers as possible. Now that we support MSI interrupts, an error interrupt might be observed by another processor while we are finishing the successful ones. Try to abort successful processing if this is the case. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma_ll.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index c4e8b64..aa76ec1 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -291,6 +291,13 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) evre_write_off = readl_relaxed(lldev->evca + HIDMA_EVCA_WRITE_PTR_REG); num_completed++; + + /* +* An error interrupt might have arrived while we are processing +* the completed interrupt. +*/ + if (!hidma_ll_isenabled(lldev)) + break; } if (num_completed) { -- 1.9.1
[PATCH V5 10/10] dmaengine: qcom_hidma: add MSI support for interrupts
The interrupts can now be delivered as platform MSI interrupts on newer platforms. The code looks for a new OF and ACPI strings in order to enable the functionality. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma.c| 143 ++-- drivers/dma/qcom/hidma.h| 2 + drivers/dma/qcom/hidma_ll.c | 8 +++ 3 files changed, 147 insertions(+), 6 deletions(-) diff --git a/drivers/dma/qcom/hidma.c b/drivers/dma/qcom/hidma.c index 10a9e3a..7b13213 100644 --- a/drivers/dma/qcom/hidma.c +++ b/drivers/dma/qcom/hidma.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "../dmaengine.h" #include "hidma.h" @@ -70,6 +71,7 @@ #define HIDMA_ERR_INFO_SW 0xFF #define HIDMA_ERR_CODE_UNEXPECTED_TERMINATE0x0 #define HIDMA_NR_DEFAULT_DESC 10 +#define HIDMA_MSI_INTS 11 static inline struct hidma_dev *to_hidma_dev(struct dma_device *dmadev) { @@ -530,6 +532,15 @@ static irqreturn_t hidma_chirq_handler(int chirq, void *arg) return hidma_ll_inthandler(chirq, lldev); } +static irqreturn_t hidma_chirq_handler_msi(int chirq, void *arg) +{ + struct hidma_lldev **lldevp = arg; + struct hidma_dev *dmadev = to_hidma_dev_from_lldev(lldevp); + + return hidma_ll_inthandler_msi(chirq, *lldevp, + 1 << (chirq - dmadev->msi_virqbase)); +} + static ssize_t hidma_show_values(struct device *dev, struct device_attribute *attr, char *buf) { @@ -584,6 +595,104 @@ static int hidma_sysfs_init(struct hidma_dev *dev) return device_create_file(dev->ddev.dev, dev->chid_attrs); } +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg) +{ + struct device *dev = msi_desc_to_dev(desc); + struct hidma_dev *dmadev = dev_get_drvdata(dev); + + if (!desc->platform.msi_index) { + writel(msg->address_lo, dmadev->dev_evca + 0x118); + writel(msg->address_hi, dmadev->dev_evca + 0x11C); + writel(msg->data, dmadev->dev_evca + 0x120); + } +} +#endif + +static void hidma_free_msis(struct hidma_dev *dmadev) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + struct device *dev = dmadev->ddev.dev; + struct msi_desc *desc; + + /* free allocated MSI interrupts above */ + for_each_msi_entry(desc, dev) + devm_free_irq(dev, desc->irq, >lldev); + + platform_msi_domain_free_irqs(dev); +#endif +} + +static int hidma_request_msi(struct hidma_dev *dmadev, +struct platform_device *pdev) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + int rc; + struct msi_desc *desc; + struct msi_desc *failed_desc = NULL; + + rc = platform_msi_domain_alloc_irqs(>dev, HIDMA_MSI_INTS, + hidma_write_msi_msg); + if (rc) + return rc; + + for_each_msi_entry(desc, >dev) { + if (!desc->platform.msi_index) + dmadev->msi_virqbase = desc->irq; + + rc = devm_request_irq(>dev, desc->irq, + hidma_chirq_handler_msi, + 0, "qcom-hidma-msi", + >lldev); + if (rc) { + failed_desc = desc; + break; + } + } + + if (rc) { + /* free allocated MSI interrupts above */ + for_each_msi_entry(desc, >dev) { + if (desc == failed_desc) + break; + devm_free_irq(>dev, desc->irq, + >lldev); + } + } else { + /* Add callback to free MSIs on teardown */ + hidma_ll_setup_irq(dmadev->lldev, true); + + } + if (rc) + dev_warn(>dev, +"failed to request MSI irq, falling back to wired IRQ\n"); + return rc; +#else + return -EINVAL; +#endif +} + +static bool hidma_msi_capable(struct device *dev) +{ + struct acpi_device *adev = ACPI_COMPANION(dev); + const char *of_compat; + int ret = -EINVAL; + + if (!adev || acpi_disabled) { + ret = device_property_read_string(dev, "compatible", + _compat); + if (ret) + return false; + + ret = strcmp(of_compat, "qcom,hidma-1.1"); + } else { +#ifdef CONFIG_ACPI + ret = strcmp(acpi_device_hid(adev), "QCOM8062"); +#endif + } + return ret == 0; +} + static int hidma_probe(struct platform_device *pdev) { struct hidma_dev *dmadev; @@ -593,6 +702,7 @@ static int hidma_probe(struct platform_device *pdev) void
[PATCH V5 07/10] dmaengine: qcom_hidma: add a common API to setup the interrupt
Introducing the hidma_ll_setup_irq function to set up the interrupt type externally from the OS interface. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma.h| 2 ++ drivers/dma/qcom/hidma_ll.c | 27 +++ 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/drivers/dma/qcom/hidma.h b/drivers/dma/qcom/hidma.h index afaeb9a..b74a56e 100644 --- a/drivers/dma/qcom/hidma.h +++ b/drivers/dma/qcom/hidma.h @@ -46,6 +46,7 @@ struct hidma_tre { }; struct hidma_lldev { + bool msi_support; /* flag indicating MSI support*/ bool initialized; /* initialized flag */ u8 trch_state; /* trch_state of the device */ u8 evch_state; /* evch_state of the device */ @@ -148,6 +149,7 @@ int hidma_ll_disable(struct hidma_lldev *lldev); int hidma_ll_enable(struct hidma_lldev *llhndl); void hidma_ll_set_transfer_params(struct hidma_lldev *llhndl, u32 tre_ch, dma_addr_t src, dma_addr_t dest, u32 len, u32 flags); +void hidma_ll_setup_irq(struct hidma_lldev *lldev, bool msi); int hidma_ll_setup(struct hidma_lldev *lldev); struct hidma_lldev *hidma_ll_init(struct device *dev, u32 max_channels, void __iomem *trca, void __iomem *evca, diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index 015df4b..9d78c86 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -715,17 +715,36 @@ int hidma_ll_setup(struct hidma_lldev *lldev) writel(HIDMA_EVRE_SIZE * nr_tres, lldev->evca + HIDMA_EVCA_RING_LEN_REG); - /* support IRQ only for now */ + /* configure interrupts */ + hidma_ll_setup_irq(lldev, lldev->msi_support); + + rc = hidma_ll_enable(lldev); + if (rc) + return rc; + + return rc; +} + +void hidma_ll_setup_irq(struct hidma_lldev *lldev, bool msi) +{ + u32 val; + + lldev->msi_support = msi; + + /* disable interrupts again after reset */ + writel(0, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + writel(0, lldev->evca + HIDMA_EVCA_IRQ_EN_REG); + + /* support IRQ by default */ val = readl(lldev->evca + HIDMA_EVCA_INTCTRL_REG); val &= ~0xF; - val |= 0x1; + if (!lldev->msi_support) + val = val | 0x1; writel(val, lldev->evca + HIDMA_EVCA_INTCTRL_REG); /* clear all pending interrupts and enable them */ writel(ENABLE_IRQS, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); writel(ENABLE_IRQS, lldev->evca + HIDMA_EVCA_IRQ_EN_REG); - - return hidma_ll_enable(lldev); } struct hidma_lldev *hidma_ll_init(struct device *dev, u32 nr_tres, -- 1.9.1
[PATCH V5 06/10] dmaengine: qcom_hidma: bring out interrupt cause
Bring out the interrupt cause to the top level so that MSI interrupts can be hooked at a later stage. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma_ll.c | 57 ++--- 1 file changed, 33 insertions(+), 24 deletions(-) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index a4fc941..015df4b 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -432,12 +432,24 @@ static void hidma_ll_abort(unsigned long arg) * requests traditionally to the destination, this concept does not apply * here for this HW. */ -irqreturn_t hidma_ll_inthandler(int chirq, void *arg) +static void hidma_ll_int_handler_internal(struct hidma_lldev *lldev, int cause) { - struct hidma_lldev *lldev = arg; - u32 status; - u32 enable; - u32 cause; + if (cause & HIDMA_ERR_INT_MASK) { + dev_err(lldev->dev, "error 0x%x, disabling...\n", + cause); + + /* Clear out pending interrupts */ + writel(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + + /* No further submissions. */ + hidma_ll_disable(lldev); + + /* Driver completes the txn and intimates the client.*/ + hidma_cleanup_pending_tre(lldev, 0xFF, + HIDMA_EVRE_STATUS_ERROR); + + return; + } /* * Fine tuned for this HW... @@ -446,30 +458,28 @@ irqreturn_t hidma_ll_inthandler(int chirq, void *arg) * read and write accessors are used for performance reasons due to * interrupt delivery guarantees. Do not copy this code blindly and * expect that to work. +* +* Try to consume as many EVREs as possible. */ + hidma_handle_tre_completion(lldev); + + /* We consumed TREs or there are pending TREs or EVREs. */ + writel_relaxed(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); +} + +irqreturn_t hidma_ll_inthandler(int chirq, void *arg) +{ + struct hidma_lldev *lldev = arg; + u32 status; + u32 enable; + u32 cause; + status = readl_relaxed(lldev->evca + HIDMA_EVCA_IRQ_STAT_REG); enable = readl_relaxed(lldev->evca + HIDMA_EVCA_IRQ_EN_REG); cause = status & enable; while (cause) { - if (cause & HIDMA_ERR_INT_MASK) { - dev_err(lldev->dev, "error 0x%x, resetting...\n", - cause); - - /* Clear out pending interrupts */ - writel(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); - - tasklet_schedule(>rst_task); - goto out; - } - - /* -* Try to consume as many EVREs as possible. -*/ - hidma_handle_tre_completion(lldev); - - /* We consumed TREs or there are pending TREs or EVREs. */ - writel_relaxed(cause, lldev->evca + HIDMA_EVCA_IRQ_CLR_REG); + hidma_ll_int_handler_internal(lldev, cause); /* * Another interrupt might have arrived while we are @@ -480,7 +490,6 @@ irqreturn_t hidma_ll_inthandler(int chirq, void *arg) cause = status & enable; } -out: return IRQ_HANDLED; } -- 1.9.1
[PATCH V5 09/10] dmaengine: qcom_hidma: break completion processing on error
We try to consume as much successful transfers as possible. Now that we support MSI interrupts, an error interrupt might be observed by another processor while we are finishing the successful ones. Try to abort successful processing if this is the case. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma_ll.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/dma/qcom/hidma_ll.c b/drivers/dma/qcom/hidma_ll.c index c4e8b64..aa76ec1 100644 --- a/drivers/dma/qcom/hidma_ll.c +++ b/drivers/dma/qcom/hidma_ll.c @@ -291,6 +291,13 @@ static int hidma_handle_tre_completion(struct hidma_lldev *lldev) evre_write_off = readl_relaxed(lldev->evca + HIDMA_EVCA_WRITE_PTR_REG); num_completed++; + + /* +* An error interrupt might have arrived while we are processing +* the completed interrupt. +*/ + if (!hidma_ll_isenabled(lldev)) + break; } if (num_completed) { -- 1.9.1
[PATCH V5 10/10] dmaengine: qcom_hidma: add MSI support for interrupts
The interrupts can now be delivered as platform MSI interrupts on newer platforms. The code looks for a new OF and ACPI strings in order to enable the functionality. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma.c| 143 ++-- drivers/dma/qcom/hidma.h| 2 + drivers/dma/qcom/hidma_ll.c | 8 +++ 3 files changed, 147 insertions(+), 6 deletions(-) diff --git a/drivers/dma/qcom/hidma.c b/drivers/dma/qcom/hidma.c index 10a9e3a..7b13213 100644 --- a/drivers/dma/qcom/hidma.c +++ b/drivers/dma/qcom/hidma.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "../dmaengine.h" #include "hidma.h" @@ -70,6 +71,7 @@ #define HIDMA_ERR_INFO_SW 0xFF #define HIDMA_ERR_CODE_UNEXPECTED_TERMINATE0x0 #define HIDMA_NR_DEFAULT_DESC 10 +#define HIDMA_MSI_INTS 11 static inline struct hidma_dev *to_hidma_dev(struct dma_device *dmadev) { @@ -530,6 +532,15 @@ static irqreturn_t hidma_chirq_handler(int chirq, void *arg) return hidma_ll_inthandler(chirq, lldev); } +static irqreturn_t hidma_chirq_handler_msi(int chirq, void *arg) +{ + struct hidma_lldev **lldevp = arg; + struct hidma_dev *dmadev = to_hidma_dev_from_lldev(lldevp); + + return hidma_ll_inthandler_msi(chirq, *lldevp, + 1 << (chirq - dmadev->msi_virqbase)); +} + static ssize_t hidma_show_values(struct device *dev, struct device_attribute *attr, char *buf) { @@ -584,6 +595,104 @@ static int hidma_sysfs_init(struct hidma_dev *dev) return device_create_file(dev->ddev.dev, dev->chid_attrs); } +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN +static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg) +{ + struct device *dev = msi_desc_to_dev(desc); + struct hidma_dev *dmadev = dev_get_drvdata(dev); + + if (!desc->platform.msi_index) { + writel(msg->address_lo, dmadev->dev_evca + 0x118); + writel(msg->address_hi, dmadev->dev_evca + 0x11C); + writel(msg->data, dmadev->dev_evca + 0x120); + } +} +#endif + +static void hidma_free_msis(struct hidma_dev *dmadev) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + struct device *dev = dmadev->ddev.dev; + struct msi_desc *desc; + + /* free allocated MSI interrupts above */ + for_each_msi_entry(desc, dev) + devm_free_irq(dev, desc->irq, >lldev); + + platform_msi_domain_free_irqs(dev); +#endif +} + +static int hidma_request_msi(struct hidma_dev *dmadev, +struct platform_device *pdev) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + int rc; + struct msi_desc *desc; + struct msi_desc *failed_desc = NULL; + + rc = platform_msi_domain_alloc_irqs(>dev, HIDMA_MSI_INTS, + hidma_write_msi_msg); + if (rc) + return rc; + + for_each_msi_entry(desc, >dev) { + if (!desc->platform.msi_index) + dmadev->msi_virqbase = desc->irq; + + rc = devm_request_irq(>dev, desc->irq, + hidma_chirq_handler_msi, + 0, "qcom-hidma-msi", + >lldev); + if (rc) { + failed_desc = desc; + break; + } + } + + if (rc) { + /* free allocated MSI interrupts above */ + for_each_msi_entry(desc, >dev) { + if (desc == failed_desc) + break; + devm_free_irq(>dev, desc->irq, + >lldev); + } + } else { + /* Add callback to free MSIs on teardown */ + hidma_ll_setup_irq(dmadev->lldev, true); + + } + if (rc) + dev_warn(>dev, +"failed to request MSI irq, falling back to wired IRQ\n"); + return rc; +#else + return -EINVAL; +#endif +} + +static bool hidma_msi_capable(struct device *dev) +{ + struct acpi_device *adev = ACPI_COMPANION(dev); + const char *of_compat; + int ret = -EINVAL; + + if (!adev || acpi_disabled) { + ret = device_property_read_string(dev, "compatible", + _compat); + if (ret) + return false; + + ret = strcmp(of_compat, "qcom,hidma-1.1"); + } else { +#ifdef CONFIG_ACPI + ret = strcmp(acpi_device_hid(adev), "QCOM8062"); +#endif + } + return ret == 0; +} + static int hidma_probe(struct platform_device *pdev) { struct hidma_dev *dmadev; @@ -593,6 +702,7 @@ static int hidma_probe(struct platform_device *pdev) void __iomem *evca;
[PATCH V5 03/10] of: irq: make of_msi_configure accessible from modules
The of_msi_configure routine is only accessible by the built-in kernel drivers. Export this function so that modules can use it too. This function is useful for configuring MSI on child device tree nodes on hierarchical objects. Acked-by: Rob HerringSigned-off-by: Sinan Kaya --- drivers/of/irq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/of/irq.c b/drivers/of/irq.c index a2e68f7..20c09e0 100644 --- a/drivers/of/irq.c +++ b/drivers/of/irq.c @@ -767,3 +767,4 @@ void of_msi_configure(struct device *dev, struct device_node *np) dev_set_msi_domain(dev, of_msi_get_domain(dev, np, DOMAIN_BUS_PLATFORM_MSI)); } +EXPORT_SYMBOL_GPL(of_msi_configure); -- 1.9.1
[PATCH V5 03/10] of: irq: make of_msi_configure accessible from modules
The of_msi_configure routine is only accessible by the built-in kernel drivers. Export this function so that modules can use it too. This function is useful for configuring MSI on child device tree nodes on hierarchical objects. Acked-by: Rob Herring Signed-off-by: Sinan Kaya --- drivers/of/irq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/of/irq.c b/drivers/of/irq.c index a2e68f7..20c09e0 100644 --- a/drivers/of/irq.c +++ b/drivers/of/irq.c @@ -767,3 +767,4 @@ void of_msi_configure(struct device *dev, struct device_node *np) dev_set_msi_domain(dev, of_msi_get_domain(dev, np, DOMAIN_BUS_PLATFORM_MSI)); } +EXPORT_SYMBOL_GPL(of_msi_configure); -- 1.9.1
[PATCH V5 02/10] Documentation: DT: qcom_hidma: correct spelling mistakes
Fix the spelling mistakes and extra and statements in the sentences. Acked-by: Rob HerringSigned-off-by: Sinan Kaya --- Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt index 2c5e4b8..55492c2 100644 --- a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt +++ b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt @@ -5,13 +5,13 @@ memcpy and memset capabilities. It has been designed for virtualized environments. Each HIDMA HW instance consists of multiple DMA channels. These channels -share the same bandwidth. The bandwidth utilization can be parititioned +share the same bandwidth. The bandwidth utilization can be partitioned among channels based on the priority and weight assignments. There are only two priority levels and 15 weigh assignments possible. Other parameters here determine how much of the system bus this HIDMA -instance can use like maximum read/write request and and number of bytes to +instance can use like maximum read/write request and number of bytes to read/write in a single burst. Main node required properties: -- 1.9.1
[PATCH V5 01/10] Documentation: DT: qcom_hidma: update binding for MSI
Adding a new binding for qcom,hidma-1.1 to distinguish HW supporting MSI interrupts from the older revision. Signed-off-by: Sinan Kaya--- Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt index fd5618b..2c5e4b8 100644 --- a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt +++ b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt @@ -47,12 +47,18 @@ When the OS is not in control of the management interface (i.e. it's a guest), the channel nodes appear on their own, not under a management node. Required properties: -- compatible: must contain "qcom,hidma-1.0" +- compatible: must contain "qcom,hidma-1.0" for initial HW or "qcom,hidma-1.1" +for MSI capable HW. - reg: Addresses for the transfer and event channel - interrupts: Should contain the event interrupt - desc-count: Number of asynchronous requests this channel can handle - iommus: required a iommu node +Optional properties for MSI: +- msi-parent : See the generic MSI binding described in + devicetree/bindings/interrupt-controller/msi.txt for a description of the + msi-parent property. + Example: Hypervisor OS configuration: -- 1.9.1
[PATCH V5 02/10] Documentation: DT: qcom_hidma: correct spelling mistakes
Fix the spelling mistakes and extra and statements in the sentences. Acked-by: Rob Herring Signed-off-by: Sinan Kaya --- Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt index 2c5e4b8..55492c2 100644 --- a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt +++ b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt @@ -5,13 +5,13 @@ memcpy and memset capabilities. It has been designed for virtualized environments. Each HIDMA HW instance consists of multiple DMA channels. These channels -share the same bandwidth. The bandwidth utilization can be parititioned +share the same bandwidth. The bandwidth utilization can be partitioned among channels based on the priority and weight assignments. There are only two priority levels and 15 weigh assignments possible. Other parameters here determine how much of the system bus this HIDMA -instance can use like maximum read/write request and and number of bytes to +instance can use like maximum read/write request and number of bytes to read/write in a single burst. Main node required properties: -- 1.9.1
[PATCH V5 01/10] Documentation: DT: qcom_hidma: update binding for MSI
Adding a new binding for qcom,hidma-1.1 to distinguish HW supporting MSI interrupts from the older revision. Signed-off-by: Sinan Kaya --- Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt index fd5618b..2c5e4b8 100644 --- a/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt +++ b/Documentation/devicetree/bindings/dma/qcom_hidma_mgmt.txt @@ -47,12 +47,18 @@ When the OS is not in control of the management interface (i.e. it's a guest), the channel nodes appear on their own, not under a management node. Required properties: -- compatible: must contain "qcom,hidma-1.0" +- compatible: must contain "qcom,hidma-1.0" for initial HW or "qcom,hidma-1.1" +for MSI capable HW. - reg: Addresses for the transfer and event channel - interrupts: Should contain the event interrupt - desc-count: Number of asynchronous requests this channel can handle - iommus: required a iommu node +Optional properties for MSI: +- msi-parent : See the generic MSI binding described in + devicetree/bindings/interrupt-controller/msi.txt for a description of the + msi-parent property. + Example: Hypervisor OS configuration: -- 1.9.1
[PATCH V5 04/10] dmaengine: qcom_hidma: configure DMA and MSI for OF
Configure the DMA bindings for the device tree based firmware. Signed-off-by: Sinan Kaya--- drivers/dma/qcom/hidma_mgmt.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/dma/qcom/hidma_mgmt.c b/drivers/dma/qcom/hidma_mgmt.c index 82f36e4..185d29c 100644 --- a/drivers/dma/qcom/hidma_mgmt.c +++ b/drivers/dma/qcom/hidma_mgmt.c @@ -375,8 +375,15 @@ static int __init hidma_mgmt_of_populate_channels(struct device_node *np) ret = PTR_ERR(new_pdev); goto out; } + of_node_get(child); + new_pdev->dev.of_node = child; of_dma_configure(_pdev->dev, child); - + /* +* It is assumed that calling of_msi_configure is safe on +* platforms with or without MSI support. +*/ + of_msi_configure(_pdev->dev, child); + of_node_put(child); kfree(res); res = NULL; } -- 1.9.1
[PATCH V5 04/10] dmaengine: qcom_hidma: configure DMA and MSI for OF
Configure the DMA bindings for the device tree based firmware. Signed-off-by: Sinan Kaya --- drivers/dma/qcom/hidma_mgmt.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/dma/qcom/hidma_mgmt.c b/drivers/dma/qcom/hidma_mgmt.c index 82f36e4..185d29c 100644 --- a/drivers/dma/qcom/hidma_mgmt.c +++ b/drivers/dma/qcom/hidma_mgmt.c @@ -375,8 +375,15 @@ static int __init hidma_mgmt_of_populate_channels(struct device_node *np) ret = PTR_ERR(new_pdev); goto out; } + of_node_get(child); + new_pdev->dev.of_node = child; of_dma_configure(_pdev->dev, child); - + /* +* It is assumed that calling of_msi_configure is safe on +* platforms with or without MSI support. +*/ + of_msi_configure(_pdev->dev, child); + of_node_put(child); kfree(res); res = NULL; } -- 1.9.1
Re: [PATCH V1 05/10] thermal: da9062/61: Thermal junction temperature monitoring driver
Steve, On Thursday 06 October 2016 02:13 PM, Steve Twiss wrote: From: Steve TwissAdd junction temperature monitoring supervisor device driver, compatible with the DA9062 and DA9061 PMICs. If the PMIC's internal junction temperature rises above TEMP_WARN (125 degC) an interrupt is issued. This TEMP_WARN level is defined as the THERMAL_TRIP_HOT trip-wire inside the device driver. A kernel work queue is configured to repeatedly poll this temperature trip-wire, between 1 and 10 second intervals (defaulting at 3 seconds). This first level of temperature supervision is intended for non-invasive temperature control, where the necessary measures for cooling the system down are left to the host software. In this case, inside the thermal notification function da9062_thermal_notify(). Signed-off-by: Steve Twiss --- This patch applies against linux-next and v4.8 Regards, Steve Twiss, Dialog Semiconductor Ltd. drivers/thermal/Kconfig | 10 ++ drivers/thermal/Makefile | 1 + drivers/thermal/da9062-thermal.c | 313 +++ 3 files changed, 324 insertions(+) create mode 100644 drivers/thermal/da9062-thermal.c diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 2d702ca..da58e54 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -272,6 +272,16 @@ config DB8500_CPUFREQ_COOLING bound cpufreq cooling device turns active to set CPU frequency low to cool down the CPU. +config DA9062_THERMAL + tristate "DA9062/DA9061 Dialog Semiconductor thermal driver" + depends on MFD_DA9062 + depends on OF + help + Enable this for the Dialog Semiconductor thermal sensor driver. + This will report PMIC junction over-temperature for one thermal trip + zone. + Compatible with the DA9062 and DA9061 PMICs. + config INTEL_POWERCLAMP tristate "Intel PowerClamp idle injection driver" depends on THERMAL diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile index 10b07c1..0a2b3f2 100644 --- a/drivers/thermal/Makefile +++ b/drivers/thermal/Makefile @@ -38,6 +38,7 @@ obj-$(CONFIG_ARMADA_THERMAL) += armada_thermal.o obj-$(CONFIG_TANGO_THERMAL)+= tango_thermal.o obj-$(CONFIG_IMX_THERMAL) += imx_thermal.o obj-$(CONFIG_DB8500_CPUFREQ_COOLING) += db8500_cpufreq_cooling.o +obj-$(CONFIG_DA9062_THERMAL) += da9062-thermal.o obj-$(CONFIG_INTEL_POWERCLAMP) += intel_powerclamp.o obj-$(CONFIG_X86_PKG_TEMP_THERMAL) += x86_pkg_temp_thermal.o obj-$(CONFIG_INTEL_SOC_DTS_IOSF_CORE) += intel_soc_dts_iosf.o diff --git a/drivers/thermal/da9062-thermal.c b/drivers/thermal/da9062-thermal.c new file mode 100644 index 000..feeabf6 --- /dev/null +++ b/drivers/thermal/da9062-thermal.c @@ -0,0 +1,313 @@ +/* + * Thermal device driver for DA9062 and DA9061 + * Copyright (C) 2016 Dialog Semiconductor Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#define DA9062_DEFAULT_POLLING_MS_PERIOD 3000 +#define DA9062_MAX_POLLING_MS_PERIOD 1 +#define DA9062_MIN_POLLING_MS_PERIOD 1000 + +#define DA9062_MILLI_CELSIUS(t)((t)*1000) + +struct da9062_thermal_config { + const char *name; +}; + +struct da9062_thermal { + struct da9062 *hw; + struct delayed_work work; + struct thermal_zone_device *zone; + enum thermal_device_mode mode; + unsigned int polling_period; + struct mutex lock; + int temperature; + int irq; + const struct da9062_thermal_config *config; + struct device *dev; +}; + +static void da9062_thermal_poll_on(struct work_struct *work) +{ + struct da9062_thermal *thermal = container_of(work, + struct da9062_thermal, + work.work); + unsigned int val; + int ret; + + /* clear E_TEMP */ + ret = regmap_write(thermal->hw->regmap, + DA9062AA_EVENT_B, + DA9062AA_E_TEMP_MASK); + if (ret < 0) { + dev_err(thermal->dev, + "Cannot clear the TJUNC temperature status\n"); + goto err_enable_irq; + } + + /* Now read E_TEMP again: it is acting like a status bit. +
Re: [PATCH V1 05/10] thermal: da9062/61: Thermal junction temperature monitoring driver
Steve, On Thursday 06 October 2016 02:13 PM, Steve Twiss wrote: From: Steve Twiss Add junction temperature monitoring supervisor device driver, compatible with the DA9062 and DA9061 PMICs. If the PMIC's internal junction temperature rises above TEMP_WARN (125 degC) an interrupt is issued. This TEMP_WARN level is defined as the THERMAL_TRIP_HOT trip-wire inside the device driver. A kernel work queue is configured to repeatedly poll this temperature trip-wire, between 1 and 10 second intervals (defaulting at 3 seconds). This first level of temperature supervision is intended for non-invasive temperature control, where the necessary measures for cooling the system down are left to the host software. In this case, inside the thermal notification function da9062_thermal_notify(). Signed-off-by: Steve Twiss --- This patch applies against linux-next and v4.8 Regards, Steve Twiss, Dialog Semiconductor Ltd. drivers/thermal/Kconfig | 10 ++ drivers/thermal/Makefile | 1 + drivers/thermal/da9062-thermal.c | 313 +++ 3 files changed, 324 insertions(+) create mode 100644 drivers/thermal/da9062-thermal.c diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index 2d702ca..da58e54 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -272,6 +272,16 @@ config DB8500_CPUFREQ_COOLING bound cpufreq cooling device turns active to set CPU frequency low to cool down the CPU. +config DA9062_THERMAL + tristate "DA9062/DA9061 Dialog Semiconductor thermal driver" + depends on MFD_DA9062 + depends on OF + help + Enable this for the Dialog Semiconductor thermal sensor driver. + This will report PMIC junction over-temperature for one thermal trip + zone. + Compatible with the DA9062 and DA9061 PMICs. + config INTEL_POWERCLAMP tristate "Intel PowerClamp idle injection driver" depends on THERMAL diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile index 10b07c1..0a2b3f2 100644 --- a/drivers/thermal/Makefile +++ b/drivers/thermal/Makefile @@ -38,6 +38,7 @@ obj-$(CONFIG_ARMADA_THERMAL) += armada_thermal.o obj-$(CONFIG_TANGO_THERMAL)+= tango_thermal.o obj-$(CONFIG_IMX_THERMAL) += imx_thermal.o obj-$(CONFIG_DB8500_CPUFREQ_COOLING) += db8500_cpufreq_cooling.o +obj-$(CONFIG_DA9062_THERMAL) += da9062-thermal.o obj-$(CONFIG_INTEL_POWERCLAMP) += intel_powerclamp.o obj-$(CONFIG_X86_PKG_TEMP_THERMAL) += x86_pkg_temp_thermal.o obj-$(CONFIG_INTEL_SOC_DTS_IOSF_CORE) += intel_soc_dts_iosf.o diff --git a/drivers/thermal/da9062-thermal.c b/drivers/thermal/da9062-thermal.c new file mode 100644 index 000..feeabf6 --- /dev/null +++ b/drivers/thermal/da9062-thermal.c @@ -0,0 +1,313 @@ +/* + * Thermal device driver for DA9062 and DA9061 + * Copyright (C) 2016 Dialog Semiconductor Ltd. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#define DA9062_DEFAULT_POLLING_MS_PERIOD 3000 +#define DA9062_MAX_POLLING_MS_PERIOD 1 +#define DA9062_MIN_POLLING_MS_PERIOD 1000 + +#define DA9062_MILLI_CELSIUS(t)((t)*1000) + +struct da9062_thermal_config { + const char *name; +}; + +struct da9062_thermal { + struct da9062 *hw; + struct delayed_work work; + struct thermal_zone_device *zone; + enum thermal_device_mode mode; + unsigned int polling_period; + struct mutex lock; + int temperature; + int irq; + const struct da9062_thermal_config *config; + struct device *dev; +}; + +static void da9062_thermal_poll_on(struct work_struct *work) +{ + struct da9062_thermal *thermal = container_of(work, + struct da9062_thermal, + work.work); + unsigned int val; + int ret; + + /* clear E_TEMP */ + ret = regmap_write(thermal->hw->regmap, + DA9062AA_EVENT_B, + DA9062AA_E_TEMP_MASK); + if (ret < 0) { + dev_err(thermal->dev, + "Cannot clear the TJUNC temperature status\n"); + goto err_enable_irq; + } + + /* Now read E_TEMP again: it is acting like a status bit. +* If over-temperature, then this status will be true. +
Re: [PATCH] mm/slab: fix kmemcg cache creation delayed issue
On Thu, Oct 06, 2016 at 09:02:00AM -0700, Doug Smythies wrote: > It was my (limited) understanding that the subsequent 2 patch set > superseded this patch. Indeed, the 2 patch set seems to solve > both the SLAB and SLUB bug reports. It would mean that patch 1 solves both the SLAB and SLUB bug reports since patch 2 is only effective for SLUB. Reason that I send this patch is that although patch 1 fixes the issue that too many kworkers are created, kmem_cache creation/destory is still slowed by synchronize_sched() and it would cause kmemcg usage counting delayed. I'm not sure how bad it is but it's generally better to start accounting as soon as possible. With patch 2 for SLUB and this patch for SLAB, performance of kmem_cache creation/destory would recover. Thanks. > > References: > > https://bugzilla.kernel.org/show_bug.cgi?id=172981 > https://bugzilla.kernel.org/show_bug.cgi?id=172991 > https://patchwork.kernel.org/patch/9361853 > https://patchwork.kernel.org/patch/9359271
Re: [PATCH] mm/slab: fix kmemcg cache creation delayed issue
On Thu, Oct 06, 2016 at 09:02:00AM -0700, Doug Smythies wrote: > It was my (limited) understanding that the subsequent 2 patch set > superseded this patch. Indeed, the 2 patch set seems to solve > both the SLAB and SLUB bug reports. It would mean that patch 1 solves both the SLAB and SLUB bug reports since patch 2 is only effective for SLUB. Reason that I send this patch is that although patch 1 fixes the issue that too many kworkers are created, kmem_cache creation/destory is still slowed by synchronize_sched() and it would cause kmemcg usage counting delayed. I'm not sure how bad it is but it's generally better to start accounting as soon as possible. With patch 2 for SLUB and this patch for SLAB, performance of kmem_cache creation/destory would recover. Thanks. > > References: > > https://bugzilla.kernel.org/show_bug.cgi?id=172981 > https://bugzilla.kernel.org/show_bug.cgi?id=172991 > https://patchwork.kernel.org/patch/9361853 > https://patchwork.kernel.org/patch/9359271
Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests
On 10/04/2016 10:12 AM, Michal Hocko wrote: From: Michal Hockocompaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by 56de7263fcf3 ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger any with my setup). But still there shouldn't be any strong reason to completely back off and do nothing in that context. In the worst case we just skip parts of the block with fs pages. This might be still sufficient to make a progress for small orders. Signed-off-by: Michal Hocko --- Hi, I am sending this as an RFC because I am not completely sure this a) is really worth it and b) it is 100% correct. I couldn't find any problems when staring into the code but as mentioned in the changelog I wasn't really able to trigger high order GFP_NOFS requests in my setup. Thoughts? mm/compaction.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index badb92bf14b4..07254a73ee32 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, page_count(page) > page_mapcount(page)) goto isolate_fail; + /* +* Only allow to migrate anonymous pages in GFP_NOFS context +* because those do not depend on fs locks. +*/ + if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) + goto isolate_fail; Unless page can acquire a page_mapping between this check and migration, I don't see a problem with allowing this. But make sure you don't break kcompactd and manual compaction from /proc, as they don't currently set cc->gfp_mask. Looks like until now it was only used to determine direct compactor's migratetype which is irrelevant in those contexts. + /* If we already hold the lock, we can skip some rechecking */ if (!locked) { locked = compact_trylock_irqsave(zone_lru_lock(zone), @@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio) { - int may_enter_fs = gfp_mask & __GFP_FS; int may_perform_io = gfp_mask & __GFP_IO; struct zoneref *z; struct zone *zone; enum compact_result rc = COMPACT_SKIPPED; - /* Check if the GFP flags allow compaction */ - if (!may_enter_fs || !may_perform_io) + /* +* Check if the GFP flags allow compaction - GFP_NOIO is really +* tricky context because the migration might require IO and +*/ + if (!may_perform_io) return COMPACT_SKIPPED; trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests
On 10/04/2016 10:12 AM, Michal Hocko wrote: From: Michal Hocko compaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by 56de7263fcf3 ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger any with my setup). But still there shouldn't be any strong reason to completely back off and do nothing in that context. In the worst case we just skip parts of the block with fs pages. This might be still sufficient to make a progress for small orders. Signed-off-by: Michal Hocko --- Hi, I am sending this as an RFC because I am not completely sure this a) is really worth it and b) it is 100% correct. I couldn't find any problems when staring into the code but as mentioned in the changelog I wasn't really able to trigger high order GFP_NOFS requests in my setup. Thoughts? mm/compaction.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index badb92bf14b4..07254a73ee32 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, page_count(page) > page_mapcount(page)) goto isolate_fail; + /* +* Only allow to migrate anonymous pages in GFP_NOFS context +* because those do not depend on fs locks. +*/ + if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) + goto isolate_fail; Unless page can acquire a page_mapping between this check and migration, I don't see a problem with allowing this. But make sure you don't break kcompactd and manual compaction from /proc, as they don't currently set cc->gfp_mask. Looks like until now it was only used to determine direct compactor's migratetype which is irrelevant in those contexts. + /* If we already hold the lock, we can skip some rechecking */ if (!locked) { locked = compact_trylock_irqsave(zone_lru_lock(zone), @@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio) { - int may_enter_fs = gfp_mask & __GFP_FS; int may_perform_io = gfp_mask & __GFP_IO; struct zoneref *z; struct zone *zone; enum compact_result rc = COMPACT_SKIPPED; - /* Check if the GFP flags allow compaction */ - if (!may_enter_fs || !may_perform_io) + /* +* Check if the GFP flags allow compaction - GFP_NOIO is really +* tricky context because the migration might require IO and +*/ + if (!may_perform_io) return COMPACT_SKIPPED; trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 06:56 +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 06:32 +0200, Markus Trippelsdorf wrote: > > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > > When it starts up everything is OK and one can scroll up and down > > > > > > to all > > > > > > entries. But as further and further new entries get added to the > > > > > > list, > > > > > > scrolling down is blocked (at the position of the last entry that > > > > > > was > > > > > > shown directly after startup). > > > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > > > Yes. It works fine now. Many thanks. > > > > > > Good. Can I add your Tested-by then? > > > > Sure. > > And BTW symbols are currently always cut off at 60 characters in > expanded entries. Hmm, no. Sometimes they are cut off, sometimes they are not. I haven't figured out what triggered this strange behavior. -- Markus
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 06:56 +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 06:32 +0200, Markus Trippelsdorf wrote: > > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > > When it starts up everything is OK and one can scroll up and down > > > > > > to all > > > > > > entries. But as further and further new entries get added to the > > > > > > list, > > > > > > scrolling down is blocked (at the position of the last entry that > > > > > > was > > > > > > shown directly after startup). > > > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > > > Yes. It works fine now. Many thanks. > > > > > > Good. Can I add your Tested-by then? > > > > Sure. > > And BTW symbols are currently always cut off at 60 characters in > expanded entries. Hmm, no. Sometimes they are cut off, sometimes they are not. I haven't figured out what triggered this strange behavior. -- Markus
[PATCH] perf top: Fix refreshing hierarchy entries on TUI
Markus reported that 'perf top --hierarchy' cannot scroll down after refresh. This was because the number of entries are not updated when hierarchy is enabled. Unlike normal report view, hierarchy mode needs to keep its own entry count since it can have non-leaf entries which can expand/collapse. Reported-and-tested-by: Markus TrippelsdorfFixes: f5b763feebe9 ("perf hists browser: Count number of hierarchy entries") Signed-off-by: Namhyung Kim --- tools/perf/ui/browsers/hists.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c index fb8e42c7507a..47be9299 100644 --- a/tools/perf/ui/browsers/hists.c +++ b/tools/perf/ui/browsers/hists.c @@ -601,7 +601,8 @@ int hist_browser__run(struct hist_browser *browser, const char *help) u64 nr_entries; hbt->timer(hbt->arg); - if (hist_browser__has_filter(browser)) + if (hist_browser__has_filter(browser) || + symbol_conf.report_hierarchy) hist_browser__update_nr_entries(browser); nr_entries = hist_browser__nr_entries(browser); -- 2.9.3
[PATCH] perf top: Fix refreshing hierarchy entries on TUI
Markus reported that 'perf top --hierarchy' cannot scroll down after refresh. This was because the number of entries are not updated when hierarchy is enabled. Unlike normal report view, hierarchy mode needs to keep its own entry count since it can have non-leaf entries which can expand/collapse. Reported-and-tested-by: Markus Trippelsdorf Fixes: f5b763feebe9 ("perf hists browser: Count number of hierarchy entries") Signed-off-by: Namhyung Kim --- tools/perf/ui/browsers/hists.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c index fb8e42c7507a..47be9299 100644 --- a/tools/perf/ui/browsers/hists.c +++ b/tools/perf/ui/browsers/hists.c @@ -601,7 +601,8 @@ int hist_browser__run(struct hist_browser *browser, const char *help) u64 nr_entries; hbt->timer(hbt->arg); - if (hist_browser__has_filter(browser)) + if (hist_browser__has_filter(browser) || + symbol_conf.report_hierarchy) hist_browser__update_nr_entries(browser); nr_entries = hist_browser__nr_entries(browser); -- 2.9.3
[GIT PULL] drm-vc4-next-2016-10-06
These are fixes that have been on the list for 1-3 weeks that didn't make it into 4.9. I've been running most of them most of the time, some have been merged downstream, and some have also been merged to the Fedora kernel build. This is about as much testing as we ever get on vc4, so I feel pretty good about them. The branch base is your current -next, because I wanted the merge forward of drm-vc4-fixes to avoid conflicts. The following changes since commit c2cbc38b9715bd8318062e600668fc30e5a3fbfa: drm: virtio: reinstate drm_virtio_set_busid() (2016-10-04 13:10:30 +1000) are available in the git repository at: https://github.com/anholt/linux tags/drm-vc4-next-2016-10-06 for you to fetch changes up to dfccd937deec9283d6ced73e138808e62bec54e8: drm/vc4: Add support for double-clocked modes. (2016-10-06 11:58:28 -0700) This pull request brings in several fixes for drm-next, mostly for HDMI. Eric Anholt (7): drm/vc4: Fix races when the CS reads from render targets. drm/vc4: Enable limited range RGB output on HDMI with CEA modes. drm/vc4: Fall back to using an EDID probe in the absence of a GPIO. drm/vc4: Increase timeout for HDMI_SCHEDULER_CONTROL changes. drm/vc4: Fix support for interlaced modes on HDMI. drm/vc4: Set up the AVI and SPD infoframes. drm/vc4: Add support for double-clocked modes. Masahiro Yamada (1): drm/vc4: cleanup with list_first_entry_or_null() drivers/gpu/drm/vc4/vc4_crtc.c | 64 +- drivers/gpu/drm/vc4/vc4_drv.h | 30 +++-- drivers/gpu/drm/vc4/vc4_gem.c | 13 ++ drivers/gpu/drm/vc4/vc4_hdmi.c | 231 +--- drivers/gpu/drm/vc4/vc4_regs.h | 19 ++- drivers/gpu/drm/vc4/vc4_render_cl.c | 21 +++- drivers/gpu/drm/vc4/vc4_validate.c | 17 ++- 7 files changed, 306 insertions(+), 89 deletions(-)
[GIT PULL] drm-vc4-next-2016-10-06
These are fixes that have been on the list for 1-3 weeks that didn't make it into 4.9. I've been running most of them most of the time, some have been merged downstream, and some have also been merged to the Fedora kernel build. This is about as much testing as we ever get on vc4, so I feel pretty good about them. The branch base is your current -next, because I wanted the merge forward of drm-vc4-fixes to avoid conflicts. The following changes since commit c2cbc38b9715bd8318062e600668fc30e5a3fbfa: drm: virtio: reinstate drm_virtio_set_busid() (2016-10-04 13:10:30 +1000) are available in the git repository at: https://github.com/anholt/linux tags/drm-vc4-next-2016-10-06 for you to fetch changes up to dfccd937deec9283d6ced73e138808e62bec54e8: drm/vc4: Add support for double-clocked modes. (2016-10-06 11:58:28 -0700) This pull request brings in several fixes for drm-next, mostly for HDMI. Eric Anholt (7): drm/vc4: Fix races when the CS reads from render targets. drm/vc4: Enable limited range RGB output on HDMI with CEA modes. drm/vc4: Fall back to using an EDID probe in the absence of a GPIO. drm/vc4: Increase timeout for HDMI_SCHEDULER_CONTROL changes. drm/vc4: Fix support for interlaced modes on HDMI. drm/vc4: Set up the AVI and SPD infoframes. drm/vc4: Add support for double-clocked modes. Masahiro Yamada (1): drm/vc4: cleanup with list_first_entry_or_null() drivers/gpu/drm/vc4/vc4_crtc.c | 64 +- drivers/gpu/drm/vc4/vc4_drv.h | 30 +++-- drivers/gpu/drm/vc4/vc4_gem.c | 13 ++ drivers/gpu/drm/vc4/vc4_hdmi.c | 231 +--- drivers/gpu/drm/vc4/vc4_regs.h | 19 ++- drivers/gpu/drm/vc4/vc4_render_cl.c | 21 +++- drivers/gpu/drm/vc4/vc4_validate.c | 17 ++- 7 files changed, 306 insertions(+), 89 deletions(-)
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 06:32 +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > When it starts up everything is OK and one can scroll up and down to > > > > > all > > > > > entries. But as further and further new entries get added to the list, > > > > > scrolling down is blocked (at the position of the last entry that was > > > > > shown directly after startup). > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > Yes. It works fine now. Many thanks. > > > > Good. Can I add your Tested-by then? > > Sure. And BTW symbols are currently always cut off at 60 characters in expanded entries. -- Markus
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 06:32 +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > When it starts up everything is OK and one can scroll up and down to > > > > > all > > > > > entries. But as further and further new entries get added to the list, > > > > > scrolling down is blocked (at the position of the last entry that was > > > > > shown directly after startup). > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > Yes. It works fine now. Many thanks. > > > > Good. Can I add your Tested-by then? > > Sure. And BTW symbols are currently always cut off at 60 characters in expanded entries. -- Markus
Re: Scrolling down broken with "perf top --hierarchy"
Cc-ing perf maintainers, On Fri, Oct 07, 2016 at 06:32:29AM +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > When it starts up everything is OK and one can scroll up and down to > > > > > all > > > > > entries. But as further and further new entries get added to the list, > > > > > scrolling down is blocked (at the position of the last entry that was > > > > > shown directly after startup). > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > Yes. It works fine now. Many thanks. > > > > Good. Can I add your Tested-by then? > > Sure. Ok, I'll send a formal patch with it. > > (And in the long run you should think of making "perf top --hierarchy" > the default for perf top, because it gives a much better (uncluttered) > overview of what is going on.) I think it's a matter of taste. Some people prefer to see the top single function or something (i.e. current behavior) while others prefer to see a higher-level view. But we can think again about the default at least for perf-top. I worried about changing default behavior because last time we did it for children mode many people complained about it. But I do think the hierarchy mode is useful for many people though. Hmm.. I thought that it already has a config option to enable hierarch mode by default, but I cannot find it now. Thanks, Namhyung
Re: Scrolling down broken with "perf top --hierarchy"
Cc-ing perf maintainers, On Fri, Oct 07, 2016 at 06:32:29AM +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > > When it starts up everything is OK and one can scroll up and down to > > > > > all > > > > > entries. But as further and further new entries get added to the list, > > > > > scrolling down is blocked (at the position of the last entry that was > > > > > shown directly after startup). > > > > > > > > I think below patch will fix the problem. Please check. > > > > > > Yes. It works fine now. Many thanks. > > > > Good. Can I add your Tested-by then? > > Sure. Ok, I'll send a formal patch with it. > > (And in the long run you should think of making "perf top --hierarchy" > the default for perf top, because it gives a much better (uncluttered) > overview of what is going on.) I think it's a matter of taste. Some people prefer to see the top single function or something (i.e. current behavior) while others prefer to see a higher-level view. But we can think again about the default at least for perf-top. I worried about changing default behavior because last time we did it for children mode many people complained about it. But I do think the hierarchy mode is useful for many people though. Hmm.. I thought that it already has a config option to enable hierarch mode by default, but I cannot find it now. Thanks, Namhyung
[PATCH 01/01] drivers:input:byd fix greedy detection of Sentelic FSP by the BYD touchpad driver
From: Christophe TORDEUXWith kernel v4.6 and later, the Sentelic touchpad STL3888_C0 and probably other Sentelic FSP touchpads are detected as a BYD touchpad and lose multitouch features. During the BYD handshake in the byd_detect function, the BYD driver mistakenly interprets a standard PS/2 protocol status request answer from the Sentelic touchpad as a successful handshake with a BYD touchpad. This is clearly a bug of the BYD driver. Description of the patch: In byd_detect function, remove positive detection result based on standard PS/2 protocol status request answer. Replace it with positive detection based on handshake answers as they can be inferred from the BYD touchpad datasheets found on BYD website. Signed-off-by: Christophe TORDEUX --- Resubmitting this patch because I got no feedback on my first submission. Fixes kernel bug 175421 which is impacting multiple users. --- drivers/input/mouse/byd.c | 76 ++- 1 file changed, 62 insertions(+), 14 deletions(-) diff --git a/drivers/input/mouse/byd.c b/drivers/input/mouse/byd.c index b27aa63..b5acca0 100644 --- a/drivers/input/mouse/byd.c +++ b/drivers/input/mouse/byd.c @@ -35,6 +35,18 @@ * BYD pad constants */ +/* Handshake answer of BTP6034 */ +#define BYD_MODEL_BTP6034 0x00E801 +/* Handshake answer of BTP6740 */ +#define BYD_MODEL_BTP6740 0x001155 +/* Handshake answers of BTP8644, BTP10463 and BTP11484 */ +#define BYD_MODEL_BTP8644 0x011155 + +/* Handshake SETRES byte of BTP6034 and BTP6740 */ +#define BYD_SHAKE_BYTE_A 0x00 +/* Handshake SETRES byte of BTP8644, BTP10463 and BTP11484 */ +#define BYD_SHAKE_BYTE_B 0x03 + /* * True device resolution is unknown, however experiments show the * resolution is about 111 units/mm. @@ -434,23 +446,59 @@ static void byd_disconnect(struct psmouse *psmouse) } } +u32 byd_try_model(u32 model) +{ + size_t i; + + u32 byd_model[] = { + BYD_MODEL_BTP6034, + BYD_MODEL_BTP6740, + BYD_MODEL_BTP8644 + }; + + for (i=0; i < ARRAY_SIZE(byd_model); i++) { + if (model == byd_model[i]) + return model; + } + + return 0; +} + int byd_detect(struct psmouse *psmouse, bool set_properties) { struct ps2dev *ps2dev = >ps2dev; - u8 param[4] = {0x03, 0x00, 0x00, 0x00}; - - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_GETINFO)) - return -1; - - if (param[1] != 0x03 || param[2] != 0x64) + size_t i; + + u8 byd_shbyte[] = { + BYD_SHAKE_BYTE_A, + BYD_SHAKE_BYTE_B + }; + + bool detect = false; + for (i=0; i < ARRAY_SIZE(byd_shbyte); i++) { + u32 model; + u8 param[4] = {byd_shbyte[i], 0x00, 0x00, 0x00}; + + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_GETINFO)) + return -1; + + model = param[2]; + model += param[1] << 8; + model += param[0] << 16; + model = byd_try_model(model); + if (model) + detect = true; + } + + if (!detect) return -ENODEV; psmouse_dbg(psmouse, "BYD touchpad detected\n"); signature.asc Description: PGP signature
[PATCH 01/01] drivers:input:byd fix greedy detection of Sentelic FSP by the BYD touchpad driver
From: Christophe TORDEUX With kernel v4.6 and later, the Sentelic touchpad STL3888_C0 and probably other Sentelic FSP touchpads are detected as a BYD touchpad and lose multitouch features. During the BYD handshake in the byd_detect function, the BYD driver mistakenly interprets a standard PS/2 protocol status request answer from the Sentelic touchpad as a successful handshake with a BYD touchpad. This is clearly a bug of the BYD driver. Description of the patch: In byd_detect function, remove positive detection result based on standard PS/2 protocol status request answer. Replace it with positive detection based on handshake answers as they can be inferred from the BYD touchpad datasheets found on BYD website. Signed-off-by: Christophe TORDEUX --- Resubmitting this patch because I got no feedback on my first submission. Fixes kernel bug 175421 which is impacting multiple users. --- drivers/input/mouse/byd.c | 76 ++- 1 file changed, 62 insertions(+), 14 deletions(-) diff --git a/drivers/input/mouse/byd.c b/drivers/input/mouse/byd.c index b27aa63..b5acca0 100644 --- a/drivers/input/mouse/byd.c +++ b/drivers/input/mouse/byd.c @@ -35,6 +35,18 @@ * BYD pad constants */ +/* Handshake answer of BTP6034 */ +#define BYD_MODEL_BTP6034 0x00E801 +/* Handshake answer of BTP6740 */ +#define BYD_MODEL_BTP6740 0x001155 +/* Handshake answers of BTP8644, BTP10463 and BTP11484 */ +#define BYD_MODEL_BTP8644 0x011155 + +/* Handshake SETRES byte of BTP6034 and BTP6740 */ +#define BYD_SHAKE_BYTE_A 0x00 +/* Handshake SETRES byte of BTP8644, BTP10463 and BTP11484 */ +#define BYD_SHAKE_BYTE_B 0x03 + /* * True device resolution is unknown, however experiments show the * resolution is about 111 units/mm. @@ -434,23 +446,59 @@ static void byd_disconnect(struct psmouse *psmouse) } } +u32 byd_try_model(u32 model) +{ + size_t i; + + u32 byd_model[] = { + BYD_MODEL_BTP6034, + BYD_MODEL_BTP6740, + BYD_MODEL_BTP8644 + }; + + for (i=0; i < ARRAY_SIZE(byd_model); i++) { + if (model == byd_model[i]) + return model; + } + + return 0; +} + int byd_detect(struct psmouse *psmouse, bool set_properties) { struct ps2dev *ps2dev = >ps2dev; - u8 param[4] = {0x03, 0x00, 0x00, 0x00}; - - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) - return -1; - if (ps2_command(ps2dev, param, PSMOUSE_CMD_GETINFO)) - return -1; - - if (param[1] != 0x03 || param[2] != 0x64) + size_t i; + + u8 byd_shbyte[] = { + BYD_SHAKE_BYTE_A, + BYD_SHAKE_BYTE_B + }; + + bool detect = false; + for (i=0; i < ARRAY_SIZE(byd_shbyte); i++) { + u32 model; + u8 param[4] = {byd_shbyte[i], 0x00, 0x00, 0x00}; + + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_SETRES)) + return -1; + if (ps2_command(ps2dev, param, PSMOUSE_CMD_GETINFO)) + return -1; + + model = param[2]; + model += param[1] << 8; + model += param[0] << 16; + model = byd_try_model(model); + if (model) + detect = true; + } + + if (!detect) return -ENODEV; psmouse_dbg(psmouse, "BYD touchpad detected\n"); signature.asc Description: PGP signature
[RESEND PATCH v3] scsi: ufshcd: fix possible unclocked register access
Vendor specific setup_clocks callback may require the clocks managed by ufshcd driver to be ON. So if the vendor specific setup_clocks callback is called while the required clocks are turned off, it could result into unclocked register access. To prevent possible unclock register access, this change adds one more argument to setup_clocks callback to let it know whether it is called pre/post the clock changes by core driver. Signed-off-by: Subhash Jadavani--- Changes from v2: * Added one more argument to setup_clocks callback, this should address Kiwoong Kim's comments on v2. Changes from v1: * Don't call ufshcd_vops_setup_clocks() again for clock off --- drivers/scsi/ufs/ufs-qcom.c | 10 ++ drivers/scsi/ufs/ufshcd.c | 17 - drivers/scsi/ufs/ufshcd.h | 8 +--- 3 files changed, 19 insertions(+), 16 deletions(-) diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c index 3aedf73..3c4f602 100644 --- a/drivers/scsi/ufs/ufs-qcom.c +++ b/drivers/scsi/ufs/ufs-qcom.c @@ -1094,10 +1094,12 @@ static void ufs_qcom_set_caps(struct ufs_hba *hba) * ufs_qcom_setup_clocks - enables/disable clocks * @hba: host controller instance * @on: If true, enable clocks else disable them. + * @status: PRE_CHANGE or POST_CHANGE notify * * Returns 0 on success, non-zero on failure. */ -static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) +static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on, +enum ufs_notify_change_status status) { struct ufs_qcom_host *host = ufshcd_get_variant(hba); int err; @@ -,7 +1113,7 @@ static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) if (!host) return 0; - if (on) { + if (on && (status == POST_CHANGE)) { err = ufs_qcom_phy_enable_iface_clk(host->generic_phy); if (err) goto out; @@ -1130,7 +1132,7 @@ static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) if (vote == host->bus_vote.min_bw_vote) ufs_qcom_update_bus_bw_vote(host); - } else { + } else if (!on && (status == PRE_CHANGE)) { /* M-PHY RMMI interface clocks can be turned off */ ufs_qcom_phy_disable_iface_clk(host->generic_phy); @@ -1254,7 +1256,7 @@ static int ufs_qcom_init(struct ufs_hba *hba) ufs_qcom_set_caps(hba); ufs_qcom_advertise_quirks(hba); - ufs_qcom_setup_clocks(hba, true); + ufs_qcom_setup_clocks(hba, true, POST_CHANGE); if (hba->dev->id < MAX_UFS_QCOM_HOSTS) ufs_qcom_hosts[hba->dev->id] = host; diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 05c7456..571a2f6 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -5389,6 +5389,10 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, bool on, if (!head || list_empty(head)) goto out; + ret = ufshcd_vops_setup_clocks(hba, on, PRE_CHANGE); + if (ret) + return ret; + list_for_each_entry(clki, head, list) { if (!IS_ERR_OR_NULL(clki->clk)) { if (skip_ref_clk && !strcmp(clki->name, "ref_clk")) @@ -5410,7 +5414,10 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, bool on, } } - ret = ufshcd_vops_setup_clocks(hba, on); + ret = ufshcd_vops_setup_clocks(hba, on, POST_CHANGE); + if (ret) + return ret; + out: if (ret) { list_for_each_entry(clki, head, list) { @@ -5500,8 +5507,6 @@ static void ufshcd_variant_hba_exit(struct ufs_hba *hba) if (!hba->vops) return; - ufshcd_vops_setup_clocks(hba, false); - ufshcd_vops_setup_regulators(hba, false); ufshcd_vops_exit(hba); @@ -5905,10 +5910,6 @@ disable_clks: if (ret) goto set_link_active; - ret = ufshcd_vops_setup_clocks(hba, false); - if (ret) - goto vops_resume; - if (!ufshcd_is_link_active(hba)) ufshcd_setup_clocks(hba, false); else @@ -5925,8 +5926,6 @@ disable_clks: ufshcd_hba_vreg_set_lpm(hba); goto out; -vops_resume: - ufshcd_vops_resume(hba, pm_op); set_link_active: ufshcd_vreg_set_hpm(hba); if (ufshcd_is_link_hibern8(hba) && !ufshcd_uic_hibern8_exit(hba)) diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h index 430bef1..afff7f4 100644 --- a/drivers/scsi/ufs/ufshcd.h +++ b/drivers/scsi/ufs/ufshcd.h @@ -273,7 +273,8 @@ struct ufs_hba_variant_ops { u32 (*get_ufs_hci_version)(struct ufs_hba *); int (*clk_scale_notify)(struct ufs_hba *, bool, enum ufs_notify_change_status); - int (*setup_clocks)(struct ufs_hba *, bool); + int
[RESEND PATCH v3] scsi: ufshcd: fix possible unclocked register access
Vendor specific setup_clocks callback may require the clocks managed by ufshcd driver to be ON. So if the vendor specific setup_clocks callback is called while the required clocks are turned off, it could result into unclocked register access. To prevent possible unclock register access, this change adds one more argument to setup_clocks callback to let it know whether it is called pre/post the clock changes by core driver. Signed-off-by: Subhash Jadavani --- Changes from v2: * Added one more argument to setup_clocks callback, this should address Kiwoong Kim's comments on v2. Changes from v1: * Don't call ufshcd_vops_setup_clocks() again for clock off --- drivers/scsi/ufs/ufs-qcom.c | 10 ++ drivers/scsi/ufs/ufshcd.c | 17 - drivers/scsi/ufs/ufshcd.h | 8 +--- 3 files changed, 19 insertions(+), 16 deletions(-) diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c index 3aedf73..3c4f602 100644 --- a/drivers/scsi/ufs/ufs-qcom.c +++ b/drivers/scsi/ufs/ufs-qcom.c @@ -1094,10 +1094,12 @@ static void ufs_qcom_set_caps(struct ufs_hba *hba) * ufs_qcom_setup_clocks - enables/disable clocks * @hba: host controller instance * @on: If true, enable clocks else disable them. + * @status: PRE_CHANGE or POST_CHANGE notify * * Returns 0 on success, non-zero on failure. */ -static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) +static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on, +enum ufs_notify_change_status status) { struct ufs_qcom_host *host = ufshcd_get_variant(hba); int err; @@ -,7 +1113,7 @@ static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) if (!host) return 0; - if (on) { + if (on && (status == POST_CHANGE)) { err = ufs_qcom_phy_enable_iface_clk(host->generic_phy); if (err) goto out; @@ -1130,7 +1132,7 @@ static int ufs_qcom_setup_clocks(struct ufs_hba *hba, bool on) if (vote == host->bus_vote.min_bw_vote) ufs_qcom_update_bus_bw_vote(host); - } else { + } else if (!on && (status == PRE_CHANGE)) { /* M-PHY RMMI interface clocks can be turned off */ ufs_qcom_phy_disable_iface_clk(host->generic_phy); @@ -1254,7 +1256,7 @@ static int ufs_qcom_init(struct ufs_hba *hba) ufs_qcom_set_caps(hba); ufs_qcom_advertise_quirks(hba); - ufs_qcom_setup_clocks(hba, true); + ufs_qcom_setup_clocks(hba, true, POST_CHANGE); if (hba->dev->id < MAX_UFS_QCOM_HOSTS) ufs_qcom_hosts[hba->dev->id] = host; diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 05c7456..571a2f6 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -5389,6 +5389,10 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, bool on, if (!head || list_empty(head)) goto out; + ret = ufshcd_vops_setup_clocks(hba, on, PRE_CHANGE); + if (ret) + return ret; + list_for_each_entry(clki, head, list) { if (!IS_ERR_OR_NULL(clki->clk)) { if (skip_ref_clk && !strcmp(clki->name, "ref_clk")) @@ -5410,7 +5414,10 @@ static int __ufshcd_setup_clocks(struct ufs_hba *hba, bool on, } } - ret = ufshcd_vops_setup_clocks(hba, on); + ret = ufshcd_vops_setup_clocks(hba, on, POST_CHANGE); + if (ret) + return ret; + out: if (ret) { list_for_each_entry(clki, head, list) { @@ -5500,8 +5507,6 @@ static void ufshcd_variant_hba_exit(struct ufs_hba *hba) if (!hba->vops) return; - ufshcd_vops_setup_clocks(hba, false); - ufshcd_vops_setup_regulators(hba, false); ufshcd_vops_exit(hba); @@ -5905,10 +5910,6 @@ disable_clks: if (ret) goto set_link_active; - ret = ufshcd_vops_setup_clocks(hba, false); - if (ret) - goto vops_resume; - if (!ufshcd_is_link_active(hba)) ufshcd_setup_clocks(hba, false); else @@ -5925,8 +5926,6 @@ disable_clks: ufshcd_hba_vreg_set_lpm(hba); goto out; -vops_resume: - ufshcd_vops_resume(hba, pm_op); set_link_active: ufshcd_vreg_set_hpm(hba); if (ufshcd_is_link_hibern8(hba) && !ufshcd_uic_hibern8_exit(hba)) diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h index 430bef1..afff7f4 100644 --- a/drivers/scsi/ufs/ufshcd.h +++ b/drivers/scsi/ufs/ufshcd.h @@ -273,7 +273,8 @@ struct ufs_hba_variant_ops { u32 (*get_ufs_hci_version)(struct ufs_hba *); int (*clk_scale_notify)(struct ufs_hba *, bool, enum ufs_notify_change_status); - int (*setup_clocks)(struct ufs_hba *, bool); + int (*setup_clocks)(struct ufs_hba *,
Re: [PATCH v2] mount: dont execute propagate_umount() many times for same mounts
Andrei Vaginwrites: > On Thu, Oct 06, 2016 at 02:46:30PM -0500, Eric W. Biederman wrote: >> Andrei Vagin writes: >> >> > The reason of this optimization is that umount() can hold namespace_sem >> > for a long time, this semaphore is global, so it affects all users. >> > Recently Eric W. Biederman added a per mount namespace limit on the >> > number of mounts. The default number of mounts allowed per mount >> > namespace at 100,000. Currently this value is allowed to construct a tree >> > which requires hours to be umounted. >> >> I am going to take a hard look at this as this problem sounds very >> unfortunate. My memory of going through this code before strongly >> suggests that changing the last list_for_each_entry to >> list_for_each_entry_reverse is going to impact the correctness of this >> change. > > I have read this code again and you are right, list_for_each_entry can't > be changed on list_for_each_entry_reverse here. > > I tested these changes more carefully and find one more issue, so I am > going to send a new patch and would like to get your comments to it. > > Thank you for your time. No problem. A quick question. You have introduced lookup_mnt_cont. Is that a core part of your fix or do you truly have problmenatic long hash chains. Simply increasing the hash table size should fix problems long hash chains (and there are other solutions like rhashtable that may be more appropriate than pre-allocating large hash chains). If it is not long hash chains introducing lookup_mnt_cont in your patch is a distraction to the core of what is going on. Perhaps I am blind but if the hash chains are not long I don't see mount propagation could be more than quadratic in the worst case. As there is only a loop within a loop. Or Is the tree walking in propagation_next that bad? Eric
Re: [PATCH v2] mount: dont execute propagate_umount() many times for same mounts
Andrei Vagin writes: > On Thu, Oct 06, 2016 at 02:46:30PM -0500, Eric W. Biederman wrote: >> Andrei Vagin writes: >> >> > The reason of this optimization is that umount() can hold namespace_sem >> > for a long time, this semaphore is global, so it affects all users. >> > Recently Eric W. Biederman added a per mount namespace limit on the >> > number of mounts. The default number of mounts allowed per mount >> > namespace at 100,000. Currently this value is allowed to construct a tree >> > which requires hours to be umounted. >> >> I am going to take a hard look at this as this problem sounds very >> unfortunate. My memory of going through this code before strongly >> suggests that changing the last list_for_each_entry to >> list_for_each_entry_reverse is going to impact the correctness of this >> change. > > I have read this code again and you are right, list_for_each_entry can't > be changed on list_for_each_entry_reverse here. > > I tested these changes more carefully and find one more issue, so I am > going to send a new patch and would like to get your comments to it. > > Thank you for your time. No problem. A quick question. You have introduced lookup_mnt_cont. Is that a core part of your fix or do you truly have problmenatic long hash chains. Simply increasing the hash table size should fix problems long hash chains (and there are other solutions like rhashtable that may be more appropriate than pre-allocating large hash chains). If it is not long hash chains introducing lookup_mnt_cont in your patch is a distraction to the core of what is going on. Perhaps I am blind but if the hash chains are not long I don't see mount propagation could be more than quadratic in the worst case. As there is only a loop within a loop. Or Is the tree walking in propagation_next that bad? Eric
Re: [PATCH] staging: sm750fb: Fix printk() style warning
On Sun, Oct 02, 2016 at 08:13:01PM +0200, Greg KH wrote: > On Sun, Oct 02, 2016 at 11:05:05AM -0700, Edward Lipinsky wrote: > > This patch fixes the checkpatch.pl warning: > > > > WARNING: printk() should include KERN_ facility level > > > > Signed-off-by: Edward Lipinsky> > --- > > drivers/staging/sm750fb/ddk750_help.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/staging/sm750fb/ddk750_help.c > > b/drivers/staging/sm750fb/ddk750_help.c > > index 9637dd3..e72a29c 100644 > > --- a/drivers/staging/sm750fb/ddk750_help.c > > +++ b/drivers/staging/sm750fb/ddk750_help.c > > @@ -11,7 +11,7 @@ void ddk750_set_mmio(void __iomem *addr, unsigned short > > devId, char revId) > > devId750 = devId; > > revId750 = revId; > > if (revId == 0xfe) > > - printk("found sm750le\n"); > > + pr_info("found sm750le\n"); > > Why can't you use dev_info() here? > > thanks, > > greg k-h It should work, but I'm not sure what should change in the header files to do it--esp. to make the dev parameter available in ddk750_help.c. (Only sm750.c uses dev_ style logging now, the rest of the driver still uses pr_*.) Thanks, Ed Lipinsky
Re: [PATCH] staging: sm750fb: Fix printk() style warning
On Sun, Oct 02, 2016 at 08:13:01PM +0200, Greg KH wrote: > On Sun, Oct 02, 2016 at 11:05:05AM -0700, Edward Lipinsky wrote: > > This patch fixes the checkpatch.pl warning: > > > > WARNING: printk() should include KERN_ facility level > > > > Signed-off-by: Edward Lipinsky > > --- > > drivers/staging/sm750fb/ddk750_help.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/staging/sm750fb/ddk750_help.c > > b/drivers/staging/sm750fb/ddk750_help.c > > index 9637dd3..e72a29c 100644 > > --- a/drivers/staging/sm750fb/ddk750_help.c > > +++ b/drivers/staging/sm750fb/ddk750_help.c > > @@ -11,7 +11,7 @@ void ddk750_set_mmio(void __iomem *addr, unsigned short > > devId, char revId) > > devId750 = devId; > > revId750 = revId; > > if (revId == 0xfe) > > - printk("found sm750le\n"); > > + pr_info("found sm750le\n"); > > Why can't you use dev_info() here? > > thanks, > > greg k-h It should work, but I'm not sure what should change in the header files to do it--esp. to make the dev parameter available in ddk750_help.c. (Only sm750.c uses dev_ style logging now, the rest of the driver still uses pr_*.) Thanks, Ed Lipinsky
Re: [tip:x86/apic] x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping
Hi Yinghai At 10/07/2016 05:20 AM, Yinghai Lu wrote: On Thu, Oct 6, 2016 at 1:06 AM, Dou Liyangwrote: I seem to remember that in x2APIC Spec the x2APIC ID may be at 255 or greater. Good to know. Maybe later when one package have more cores like 30 cores etc. If we do that judgment, it may be affect x2APIC's work in some other places. I saw the MADT, the main reason may be that we define 0xff to acpi_id in LAPIC mode. As you said, it was like: [ 42.107902] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) [ 42.120125] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) [ 42.132361] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) ... How about doing the acpi_id check when we parse it in acpi_parse_lapic(). 8< --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -233,6 +233,11 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end) acpi_table_print_madt_entry(header); + if (processor->id >= 255) { + ++disabled_cpus; + return -EINVAL; + } + /* * We need to register disabled CPU as well to permit * counting disabled CPUs. This allows us to size Yes, that should work. but should do the same thing for x2apic in acpi_parse_x2apic should have + if (processor->local_apic_id == -1) { + ++disabled_cpus; + return -EINVAL; + } that is the reason why i want to extend acpi_register_lapic() to take extra disabled_id (one is 0xff and another is 0x) so could save some lines. Yes, I understood. But I think adding an extra disabled_id is not a good way for validating the apic_id. If the disabled_id is not just one id(-1 or 255), may be two or more, even be a range. what should we do for extending our code? Firstly, I am not sure that the "-1" could appear in the MADT, even if the ACPI tables is unreasonable. Seondly, I guess if we need the check, there are some reserved methods in the kernel, such as "default_apic_id_valid", "x2apic_apic_id_valid" and so on. we should extend all of them and use them for check. CC'ed: Rafael and Lv May I ask a question? Is it possible that the "-1/ox" could appear in the MADT which is one of the ACPI tables? Thanks Yinghai
Re: [tip:x86/apic] x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping
Hi Yinghai At 10/07/2016 05:20 AM, Yinghai Lu wrote: On Thu, Oct 6, 2016 at 1:06 AM, Dou Liyang wrote: I seem to remember that in x2APIC Spec the x2APIC ID may be at 255 or greater. Good to know. Maybe later when one package have more cores like 30 cores etc. If we do that judgment, it may be affect x2APIC's work in some other places. I saw the MADT, the main reason may be that we define 0xff to acpi_id in LAPIC mode. As you said, it was like: [ 42.107902] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) [ 42.120125] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) [ 42.132361] ACPI: LAPIC (acpi_id[0xff] lapic_id[0xff] disabled) ... How about doing the acpi_id check when we parse it in acpi_parse_lapic(). 8< --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -233,6 +233,11 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end) acpi_table_print_madt_entry(header); + if (processor->id >= 255) { + ++disabled_cpus; + return -EINVAL; + } + /* * We need to register disabled CPU as well to permit * counting disabled CPUs. This allows us to size Yes, that should work. but should do the same thing for x2apic in acpi_parse_x2apic should have + if (processor->local_apic_id == -1) { + ++disabled_cpus; + return -EINVAL; + } that is the reason why i want to extend acpi_register_lapic() to take extra disabled_id (one is 0xff and another is 0x) so could save some lines. Yes, I understood. But I think adding an extra disabled_id is not a good way for validating the apic_id. If the disabled_id is not just one id(-1 or 255), may be two or more, even be a range. what should we do for extending our code? Firstly, I am not sure that the "-1" could appear in the MADT, even if the ACPI tables is unreasonable. Seondly, I guess if we need the check, there are some reserved methods in the kernel, such as "default_apic_id_valid", "x2apic_apic_id_valid" and so on. we should extend all of them and use them for check. CC'ed: Rafael and Lv May I ask a question? Is it possible that the "-1/ox" could appear in the MADT which is one of the ACPI tables? Thanks Yinghai
Re: [PATCH 1/2] watchdog: Introduce update_arch_nmi_watchdog
On Thu, Oct 06, 2016 at 03:16:42PM -0700, Babu Moger wrote: > Currently we do not have a way to enable/disable arch specific > watchdog handlers if it was implemented by any of the architectures. > > This patch introduces new function update_arch_nmi_watchdog > which can be used to enable/disable architecture specific NMI > watchdog handlers. Also exposes watchdog_enabled variable outside > so that arch specific nmi watchdogs can use it to implement > enalbe/disable behavour. > > Signed-off-by: Babu Moger> --- > include/linux/nmi.h |1 + > kernel/watchdog.c | 16 +--- > 2 files changed, 14 insertions(+), 3 deletions(-) > > diff --git a/include/linux/nmi.h b/include/linux/nmi.h > index 4630eea..01b4830 100644 > --- a/include/linux/nmi.h > +++ b/include/linux/nmi.h > @@ -66,6 +66,7 @@ static inline bool trigger_allbutself_cpu_backtrace(void) > > #ifdef CONFIG_LOCKUP_DETECTOR > u64 hw_nmi_get_sample_period(int watchdog_thresh); > +extern unsigned long watchdog_enabled; The extern is within an #ifdef, but the definition later is valid alway. So extern definition should be outside the #ifdef to match the actual implementation. To manipulate / read watchdog_enabled two constants are used: NMI_WATCHDOG_ENABLED, SOFT_WATCHDOG_ENABLED They should be visible too, so uses do not fall into the trap and uses constants (like in patch 2). Sam
Re: [PATCH 1/2] watchdog: Introduce update_arch_nmi_watchdog
On Thu, Oct 06, 2016 at 03:16:42PM -0700, Babu Moger wrote: > Currently we do not have a way to enable/disable arch specific > watchdog handlers if it was implemented by any of the architectures. > > This patch introduces new function update_arch_nmi_watchdog > which can be used to enable/disable architecture specific NMI > watchdog handlers. Also exposes watchdog_enabled variable outside > so that arch specific nmi watchdogs can use it to implement > enalbe/disable behavour. > > Signed-off-by: Babu Moger > --- > include/linux/nmi.h |1 + > kernel/watchdog.c | 16 +--- > 2 files changed, 14 insertions(+), 3 deletions(-) > > diff --git a/include/linux/nmi.h b/include/linux/nmi.h > index 4630eea..01b4830 100644 > --- a/include/linux/nmi.h > +++ b/include/linux/nmi.h > @@ -66,6 +66,7 @@ static inline bool trigger_allbutself_cpu_backtrace(void) > > #ifdef CONFIG_LOCKUP_DETECTOR > u64 hw_nmi_get_sample_period(int watchdog_thresh); > +extern unsigned long watchdog_enabled; The extern is within an #ifdef, but the definition later is valid alway. So extern definition should be outside the #ifdef to match the actual implementation. To manipulate / read watchdog_enabled two constants are used: NMI_WATCHDOG_ENABLED, SOFT_WATCHDOG_ENABLED They should be visible too, so uses do not fall into the trap and uses constants (like in patch 2). Sam
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > When it starts up everything is OK and one can scroll up and down to all > > > > entries. But as further and further new entries get added to the list, > > > > scrolling down is blocked (at the position of the last entry that was > > > > shown directly after startup). > > > > > > I think below patch will fix the problem. Please check. > > > > Yes. It works fine now. Many thanks. > > Good. Can I add your Tested-by then? Sure. (And in the long run you should think of making "perf top --hierarchy" the default for perf top, because it gives a much better (uncluttered) overview of what is going on.) -- Markus
Re: Scrolling down broken with "perf top --hierarchy"
On 2016.10.07 at 13:22 +0900, Namhyung Kim wrote: > On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > > Scrolling down is broken when using "perf top --hierarchy". > > > > When it starts up everything is OK and one can scroll up and down to all > > > > entries. But as further and further new entries get added to the list, > > > > scrolling down is blocked (at the position of the last entry that was > > > > shown directly after startup). > > > > > > I think below patch will fix the problem. Please check. > > > > Yes. It works fine now. Many thanks. > > Good. Can I add your Tested-by then? Sure. (And in the long run you should think of making "perf top --hierarchy" the default for perf top, because it gives a much better (uncluttered) overview of what is going on.) -- Markus
Re: Scrolling down broken with "perf top --hierarchy"
On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > Scrolling down is broken when using "perf top --hierarchy". > > > When it starts up everything is OK and one can scroll up and down to all > > > entries. But as further and further new entries get added to the list, > > > scrolling down is blocked (at the position of the last entry that was > > > shown directly after startup). > > > > I think below patch will fix the problem. Please check. > > Yes. It works fine now. Many thanks. Good. Can I add your Tested-by then? Thanks, Namhyung
Re: Scrolling down broken with "perf top --hierarchy"
On Fri, Oct 07, 2016 at 05:51:18AM +0200, Markus Trippelsdorf wrote: > On 2016.10.07 at 10:17 +0900, Namhyung Kim wrote: > > On Thu, Oct 06, 2016 at 06:33:33PM +0200, Markus Trippelsdorf wrote: > > > Scrolling down is broken when using "perf top --hierarchy". > > > When it starts up everything is OK and one can scroll up and down to all > > > entries. But as further and further new entries get added to the list, > > > scrolling down is blocked (at the position of the last entry that was > > > shown directly after startup). > > > > I think below patch will fix the problem. Please check. > > Yes. It works fine now. Many thanks. Good. Can I add your Tested-by then? Thanks, Namhyung
Re: [PATCH] ftrace: Support full glob matching
Hi Masami, On Wed, Oct 05, 2016 at 08:58:15PM +0900, Masami Hiramatsu wrote: > Use glob_match() to support flexible glob wildcards (*,?) > and character classes ([) for ftrace. > Since the full glob matching is slower than the current > partial matching routines(*pat, pat*, *pat*), this leaves > those routines and just add MATCH_GLOB for complex glob > expression. > > e.g. > > [root@localhost tracing]# echo 'sched*group' > set_ftrace_filter > [root@localhost tracing]# cat set_ftrace_filter > sched_free_group > sched_change_group > sched_create_group > sched_online_group > sched_destroy_group > sched_offline_group > [root@localhost tracing]# echo '[Ss]y[Ss]_*' > set_ftrace_filter > [root@localhost tracing]# head set_ftrace_filter > sys_arch_prctl > sys_rt_sigreturn > sys_ioperm > SyS_iopl > sys_modify_ldt > SyS_mmap > SyS_set_thread_area > SyS_get_thread_area > SyS_set_tid_address > sys_fork > > > Signed-off-by: Masami HiramatsuNice! Acked-by: Namhyung Kim Thanks, Namhyung > --- > Documentation/trace/events.txt |9 +++-- > Documentation/trace/ftrace.txt |9 +++-- > kernel/trace/Kconfig |2 ++ > kernel/trace/ftrace.c |4 > kernel/trace/trace.c |2 +- > kernel/trace/trace.h |2 ++ > kernel/trace/trace_events_filter.c | 17 - > 7 files changed, 31 insertions(+), 14 deletions(-) > > diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt > index 08d74d7..2cc08d4 100644 > --- a/Documentation/trace/events.txt > +++ b/Documentation/trace/events.txt > @@ -189,16 +189,13 @@ And for string fields they are: > > ==, !=, ~ > > -The glob (~) only accepts a wild card character (*) at the start and or > -end of the string. For example: > +The glob (~) accepts a wild card character (*,?) and character classes > +([). For example: > >prev_comm ~ "*sh" >prev_comm ~ "sh*" >prev_comm ~ "*sh*" > - > -But does not allow for it to be within the string: > - > - prev_comm ~ "ba*sh" <-- is invalid > + prev_comm ~ "ba*sh" > > 5.2 Setting filters > --- > diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt > index a6b3705..b26abc7 100644 > --- a/Documentation/trace/ftrace.txt > +++ b/Documentation/trace/ftrace.txt > @@ -2218,16 +2218,13 @@ hrtimer_interrupt > sys_nanosleep > > > -Perhaps this is not enough. The filters also allow simple wild > -cards. Only the following are currently available > +Perhaps this is not enough. The filters also allow glob(7) matching. > >* - will match functions that begin with >* - will match functions that end with >** - will match functions that have in it > - > -These are the only wild cards which are supported. > - > - * will not work. > + * - will match functions that begin with > + and end with > > Note: It is better to use quotes to enclose the wild cards, >otherwise the shell may expand the parameters into names > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > index ba33267..aa6eb15 100644 > --- a/kernel/trace/Kconfig > +++ b/kernel/trace/Kconfig > @@ -70,6 +70,7 @@ config FTRACE_NMI_ENTER > > config EVENT_TRACING > select CONTEXT_SWITCH_TRACER > +select GLOB > bool > > config CONTEXT_SWITCH_TRACER > @@ -133,6 +134,7 @@ config FUNCTION_TRACER > select KALLSYMS > select GENERIC_TRACER > select CONTEXT_SWITCH_TRACER > +select GLOB > help > Enable the kernel to trace every kernel function. This is done > by using a compiler feature to insert a small, 5-byte No-Operation > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 84752c8..5741184 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -3493,6 +3493,10 @@ static int ftrace_match(char *str, struct ftrace_glob > *g) > memcmp(str + slen - g->len, g->search, g->len) == 0) > matched = 1; > break; > + case MATCH_GLOB: > + if (glob_match(g->search, str)) > + matched = 1; > + break; > } > > return matched; > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 37824d9..ae343e7 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -4065,7 +4065,7 @@ static const char readme_msg[] = > "\n available_filter_functions - list of functions that can be > filtered on\n" > " set_ftrace_filter\t- echo function name in here to only trace > these\n" > "\t\t\t functions\n" > - "\t accepts: func_full_name, *func_end, func_begin*, > *func_middle*\n" > + "\t accepts: func_full_name or glob-matching-pattern\n" > "\t modules: Can select a group via module\n" > "\t Format: :mod:\n" > "\t example: echo :mod:ext3 >
Re: [PATCH] ftrace: Support full glob matching
Hi Masami, On Wed, Oct 05, 2016 at 08:58:15PM +0900, Masami Hiramatsu wrote: > Use glob_match() to support flexible glob wildcards (*,?) > and character classes ([) for ftrace. > Since the full glob matching is slower than the current > partial matching routines(*pat, pat*, *pat*), this leaves > those routines and just add MATCH_GLOB for complex glob > expression. > > e.g. > > [root@localhost tracing]# echo 'sched*group' > set_ftrace_filter > [root@localhost tracing]# cat set_ftrace_filter > sched_free_group > sched_change_group > sched_create_group > sched_online_group > sched_destroy_group > sched_offline_group > [root@localhost tracing]# echo '[Ss]y[Ss]_*' > set_ftrace_filter > [root@localhost tracing]# head set_ftrace_filter > sys_arch_prctl > sys_rt_sigreturn > sys_ioperm > SyS_iopl > sys_modify_ldt > SyS_mmap > SyS_set_thread_area > SyS_get_thread_area > SyS_set_tid_address > sys_fork > > > Signed-off-by: Masami Hiramatsu Nice! Acked-by: Namhyung Kim Thanks, Namhyung > --- > Documentation/trace/events.txt |9 +++-- > Documentation/trace/ftrace.txt |9 +++-- > kernel/trace/Kconfig |2 ++ > kernel/trace/ftrace.c |4 > kernel/trace/trace.c |2 +- > kernel/trace/trace.h |2 ++ > kernel/trace/trace_events_filter.c | 17 - > 7 files changed, 31 insertions(+), 14 deletions(-) > > diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt > index 08d74d7..2cc08d4 100644 > --- a/Documentation/trace/events.txt > +++ b/Documentation/trace/events.txt > @@ -189,16 +189,13 @@ And for string fields they are: > > ==, !=, ~ > > -The glob (~) only accepts a wild card character (*) at the start and or > -end of the string. For example: > +The glob (~) accepts a wild card character (*,?) and character classes > +([). For example: > >prev_comm ~ "*sh" >prev_comm ~ "sh*" >prev_comm ~ "*sh*" > - > -But does not allow for it to be within the string: > - > - prev_comm ~ "ba*sh" <-- is invalid > + prev_comm ~ "ba*sh" > > 5.2 Setting filters > --- > diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt > index a6b3705..b26abc7 100644 > --- a/Documentation/trace/ftrace.txt > +++ b/Documentation/trace/ftrace.txt > @@ -2218,16 +2218,13 @@ hrtimer_interrupt > sys_nanosleep > > > -Perhaps this is not enough. The filters also allow simple wild > -cards. Only the following are currently available > +Perhaps this is not enough. The filters also allow glob(7) matching. > >* - will match functions that begin with >* - will match functions that end with >** - will match functions that have in it > - > -These are the only wild cards which are supported. > - > - * will not work. > + * - will match functions that begin with > + and end with > > Note: It is better to use quotes to enclose the wild cards, >otherwise the shell may expand the parameters into names > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > index ba33267..aa6eb15 100644 > --- a/kernel/trace/Kconfig > +++ b/kernel/trace/Kconfig > @@ -70,6 +70,7 @@ config FTRACE_NMI_ENTER > > config EVENT_TRACING > select CONTEXT_SWITCH_TRACER > +select GLOB > bool > > config CONTEXT_SWITCH_TRACER > @@ -133,6 +134,7 @@ config FUNCTION_TRACER > select KALLSYMS > select GENERIC_TRACER > select CONTEXT_SWITCH_TRACER > +select GLOB > help > Enable the kernel to trace every kernel function. This is done > by using a compiler feature to insert a small, 5-byte No-Operation > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 84752c8..5741184 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -3493,6 +3493,10 @@ static int ftrace_match(char *str, struct ftrace_glob > *g) > memcmp(str + slen - g->len, g->search, g->len) == 0) > matched = 1; > break; > + case MATCH_GLOB: > + if (glob_match(g->search, str)) > + matched = 1; > + break; > } > > return matched; > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 37824d9..ae343e7 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -4065,7 +4065,7 @@ static const char readme_msg[] = > "\n available_filter_functions - list of functions that can be > filtered on\n" > " set_ftrace_filter\t- echo function name in here to only trace > these\n" > "\t\t\t functions\n" > - "\t accepts: func_full_name, *func_end, func_begin*, > *func_middle*\n" > + "\t accepts: func_full_name or glob-matching-pattern\n" > "\t modules: Can select a group via module\n" > "\t Format: :mod:\n" > "\t example: echo :mod:ext3 > set_ftrace_filter\n" > diff --git
Re: [PATCH] tools lib traceevent: Fix kbuffer_read_at_offset()
Hi Steve, On Wed, Oct 05, 2016 at 09:28:01AM -0400, Steven Rostedt wrote: > On Sat, 1 Oct 2016 19:17:00 +0900 > Namhyung Kimwrote: > > > When it's called with an offset less than or equal to the first event, > > it'll return a garbage value since the data is not initialized. > > Well, it can at most be equal to (unless offset is negative) because > kbuffer_load_subbuffer() sets kbuf->curr to zero. Actually kbuffer_load_subbuffer() calls kbuf->next_event(). Inside the function it has a loop updating next valid event. Sometimes, the data starts with TIME_EXTEND with value of 0 and the loop skips it which ended up setting kbuf->curr to 8. :) I'll take a look it later. > > But that said, it looks like offset == 0 is buggy. > > Acked-by: Steven Rostedt Thanks, Namhyung > > > -- Steve > > > > > Cc: Steven Rostedt > > Signed-off-by: Namhyung Kim > > --- > > tools/lib/traceevent/kbuffer-parse.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/tools/lib/traceevent/kbuffer-parse.c > > b/tools/lib/traceevent/kbuffer-parse.c > > index 3bcada3ae05a..65984f1c2974 100644 > > --- a/tools/lib/traceevent/kbuffer-parse.c > > +++ b/tools/lib/traceevent/kbuffer-parse.c > > @@ -622,6 +622,7 @@ void *kbuffer_read_at_offset(struct kbuffer *kbuf, int > > offset, > > > > /* Reset the buffer */ > > kbuffer_load_subbuffer(kbuf, kbuf->subbuffer); > > + data = kbuffer_read_event(kbuf, ts); > > > > while (kbuf->curr < offset) { > > data = kbuffer_next_event(kbuf, ts); >
Re: [PATCH] tools lib traceevent: Fix kbuffer_read_at_offset()
Hi Steve, On Wed, Oct 05, 2016 at 09:28:01AM -0400, Steven Rostedt wrote: > On Sat, 1 Oct 2016 19:17:00 +0900 > Namhyung Kim wrote: > > > When it's called with an offset less than or equal to the first event, > > it'll return a garbage value since the data is not initialized. > > Well, it can at most be equal to (unless offset is negative) because > kbuffer_load_subbuffer() sets kbuf->curr to zero. Actually kbuffer_load_subbuffer() calls kbuf->next_event(). Inside the function it has a loop updating next valid event. Sometimes, the data starts with TIME_EXTEND with value of 0 and the loop skips it which ended up setting kbuf->curr to 8. :) I'll take a look it later. > > But that said, it looks like offset == 0 is buggy. > > Acked-by: Steven Rostedt Thanks, Namhyung > > > -- Steve > > > > > Cc: Steven Rostedt > > Signed-off-by: Namhyung Kim > > --- > > tools/lib/traceevent/kbuffer-parse.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/tools/lib/traceevent/kbuffer-parse.c > > b/tools/lib/traceevent/kbuffer-parse.c > > index 3bcada3ae05a..65984f1c2974 100644 > > --- a/tools/lib/traceevent/kbuffer-parse.c > > +++ b/tools/lib/traceevent/kbuffer-parse.c > > @@ -622,6 +622,7 @@ void *kbuffer_read_at_offset(struct kbuffer *kbuf, int > > offset, > > > > /* Reset the buffer */ > > kbuffer_load_subbuffer(kbuf, kbuf->subbuffer); > > + data = kbuffer_read_event(kbuf, ts); > > > > while (kbuf->curr < offset) { > > data = kbuffer_next_event(kbuf, ts); >
[PATCH - stable 4.1 backport] block: don't release bdi while request_queue has live references
Hi, This patch was marked for stable v4.2+, but is needed for v4.1 as well. It fixes a regression introduced by: Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") This is a backport to 4.1.33 which has been tested and confirmed to work. Bug report at https://bugzilla.kernel.org/show_bug.cgi?id=173031 Please queue for 4.1.y Thanks, NeilBrown From: Tejun HeoDate: Tue, 8 Sep 2015 12:20:22 -0400 Subject: [PATCH] block: don't release bdi while request_queue has live references [ Upstream commit: b02176f30cd30acccd3b633ab7d9aed8b5da52ff ] bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") Cc: sta...@vger.kernel.org # v4.2+ Reported-and-tested-by: Andrey Konovalov Reported-and-tested-by: Francesco Dolcini (for 4.1 backport) Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kcy5gxjbw...@mail.gmail.com Reviewed-by: Jan Kara Reviewed-by: Jeff Moyer Signed-off-by: Jens Axboe Signed-off-by: NeilBrown --- block/blk-core.c| 2 +- block/blk-sysfs.c | 1 + include/linux/backing-dev.h | 5 - mm/backing-dev.c| 17 ++--- 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index bbbf36e6066b..edf8d72daa83 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -554,7 +554,7 @@ void blk_cleanup_queue(struct request_queue *q) q->queue_lock = >__queue_lock; spin_unlock_irq(lock); - bdi_destroy(>backing_dev_info); + bdi_unregister(>backing_dev_info); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 2b8fd302f677..c0bb3291859c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -501,6 +501,7 @@ static void blk_release_queue(struct kobject *kobj) struct request_queue *q = container_of(kobj, struct request_queue, kobj); + bdi_exit(>backing_dev_info); blkcg_exit_queue(q); if (q->elevator) { diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index d87d8eced064..17d1799f8552 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -110,12 +110,15 @@ struct backing_dev_info { struct backing_dev_info *inode_to_bdi(struct inode *inode); int __must_check bdi_init(struct backing_dev_info *bdi); -void bdi_destroy(struct backing_dev_info *bdi); +void bdi_exit(struct backing_dev_info *bdi); __printf(3, 4) int bdi_register(struct backing_dev_info *bdi, struct device *parent, const char *fmt, ...); int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev); +void bdi_unregister(struct backing_dev_info *bdi); +void bdi_destroy(struct backing_dev_info *bdi); + int
[PATCH - stable 4.1 backport] block: don't release bdi while request_queue has live references
Hi, This patch was marked for stable v4.2+, but is needed for v4.1 as well. It fixes a regression introduced by: Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") This is a backport to 4.1.33 which has been tested and confirmed to work. Bug report at https://bugzilla.kernel.org/show_bug.cgi?id=173031 Please queue for 4.1.y Thanks, NeilBrown From: Tejun Heo Date: Tue, 8 Sep 2015 12:20:22 -0400 Subject: [PATCH] block: don't release bdi while request_queue has live references [ Upstream commit: b02176f30cd30acccd3b633ab7d9aed8b5da52ff ] bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") Cc: sta...@vger.kernel.org # v4.2+ Reported-and-tested-by: Andrey Konovalov Reported-and-tested-by: Francesco Dolcini (for 4.1 backport) Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kcy5gxjbw...@mail.gmail.com Reviewed-by: Jan Kara Reviewed-by: Jeff Moyer Signed-off-by: Jens Axboe Signed-off-by: NeilBrown --- block/blk-core.c| 2 +- block/blk-sysfs.c | 1 + include/linux/backing-dev.h | 5 - mm/backing-dev.c| 17 ++--- 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index bbbf36e6066b..edf8d72daa83 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -554,7 +554,7 @@ void blk_cleanup_queue(struct request_queue *q) q->queue_lock = >__queue_lock; spin_unlock_irq(lock); - bdi_destroy(>backing_dev_info); + bdi_unregister(>backing_dev_info); /* @q is and will stay empty, shutdown and put */ blk_put_queue(q); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 2b8fd302f677..c0bb3291859c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -501,6 +501,7 @@ static void blk_release_queue(struct kobject *kobj) struct request_queue *q = container_of(kobj, struct request_queue, kobj); + bdi_exit(>backing_dev_info); blkcg_exit_queue(q); if (q->elevator) { diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index d87d8eced064..17d1799f8552 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -110,12 +110,15 @@ struct backing_dev_info { struct backing_dev_info *inode_to_bdi(struct inode *inode); int __must_check bdi_init(struct backing_dev_info *bdi); -void bdi_destroy(struct backing_dev_info *bdi); +void bdi_exit(struct backing_dev_info *bdi); __printf(3, 4) int bdi_register(struct backing_dev_info *bdi, struct device *parent, const char *fmt, ...); int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev); +void bdi_unregister(struct backing_dev_info *bdi); +void bdi_destroy(struct backing_dev_info *bdi); + int __must_check bdi_setup_and_register(struct backing_dev_info *, char *); void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
loop mount: kernel BUG at lib/percpu-refcount.c:231
Hi, Below bug happened to me while loop mount a file image after stopping a kvm guest. But it only happend once til now.. [ 4761.031686] [ cut here ] [ 4761.075984] kernel BUG at lib/percpu-refcount.c:231! [ 4761.120184] invalid opcode: [#1] SMP [ 4761.164307] Modules linked in: loop(+) macvtap macvlan tun ccm rfcomm fuse snd_hda_codec_hdmi cmac bnep vfat fat kvm_intel kvm irqbypass arc4 i915 rtsx_pci_sdmmc intel_gtt drm_kms_helper iwlmvm syscopyarea sysfillrect sysimgblt fb_sys_fops mac80211 drm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec btusb snd_hwdep iwlwifi snd_hda_core input_leds btrtl snd_seq pcspkr serio_raw btbcm snd_seq_device i2c_i801 btintel cfg80211 bluetooth snd_pcm i2c_smbus rtsx_pci mfd_core e1000e ptp pps_core snd_timer thinkpad_acpi wmi snd soundcore rfkill video nfsd auth_rpcgss nfs_acl lockd grace sunrpc [ 4761.323045] CPU: 1 PID: 25890 Comm: modprobe Not tainted 4.8.0+ #168 [ 4761.377791] Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET86WW (2.36 ) 12/04/2015 [ 4761.433704] task: 986fd1b7d780 task.stack: a85842528000 [ 4761.490120] RIP: 0010:[] [] __percpu_ref_switch_to_percpu+0xf8/0x100 [ 4761.548138] RSP: 0018:a8584252bb38 EFLAGS: 00010246 [ 4761.604673] RAX: RBX: 986fbdca3200 RCX: [ 4761.662416] RDX: 00983288 RSI: 0001 RDI: 986fbdca3958 [ 4761.720473] RBP: a8584252bb80 R08: 0008 R09: 0008 [ 4761.779270] R10: R11: R12: [ 4761.837603] R13: 9870fa22c800 R14: 9870fa22c80c R15: 986fbdca3200 [ 4761.895870] FS: 7fc286eb4640() GS:98711f24() knlGS: [ 4761.954596] CS: 0010 DS: ES: CR0: 80050033 [ 4762.012978] CR2: 555c3a20ee78 CR3: 000212988000 CR4: 001406e0 [ 4762.072454] Stack: [ 4762.131283] 9870f2f37800 9870c8e46000 9870fa22c880 a8584252bbb8 [ 4762.190776] ae2a147c ba169577 986fbdca3200 9870fa22c870 [ 4762.251149] 9870fa22c800 a8584252bb90 ae2b3294 a8584252bbc8 [ 4762.311657] Call Trace: [ 4762.371157] [] ? kobject_uevent_env+0xfc/0x3b0 [ 4762.431483] [] percpu_ref_switch_to_percpu+0x14/0x20 [ 4762.492093] [] blk_register_queue+0xbe/0x120 [ 4762.552727] [] device_add_disk+0x1c4/0x470 [ 4762.614155] [] loop_add+0x1d9/0x260 [loop] [ 4762.674042] [] loop_init+0x119/0x16c [loop] [ 4762.733949] [] ? 0xc02ff000 [ 4762.793563] [] do_one_initcall+0x4b/0x180 [ 4762.853068] [] ? free_vmap_area_noflush+0x43/0xb0 [ 4762.913665] [] do_init_module+0x55/0x1c4 [ 4762.973400] [] load_module+0x1fc4/0x23e0 [ 4763.033545] [] ? __symbol_put+0x60/0x60 [ 4763.094281] [] SYSC_init_module+0x138/0x150 [ 4763.154985] [] SyS_init_module+0x9/0x10 [ 4763.215577] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 4763.277044] Code: 00 48 c7 c7 20 c7 a8 ae 48 63 d2 e8 63 ef ff ff 3b 05 81 a9 7d 00 89 c2 7c cd 48 8b 43 08 48 83 e0 fe 48 89 43 08 e9 3c ff ff ff <0f> 0b e8 81 b6 d9 ff 90 55 48 89 e5 41 54 4c 8d 67 d8 53 48 89 [ 4763.342964] RIP [] __percpu_ref_switch_to_percpu+0xf8/0x100 [ 4763.407151] RSP Thanks Dave
loop mount: kernel BUG at lib/percpu-refcount.c:231
Hi, Below bug happened to me while loop mount a file image after stopping a kvm guest. But it only happend once til now.. [ 4761.031686] [ cut here ] [ 4761.075984] kernel BUG at lib/percpu-refcount.c:231! [ 4761.120184] invalid opcode: [#1] SMP [ 4761.164307] Modules linked in: loop(+) macvtap macvlan tun ccm rfcomm fuse snd_hda_codec_hdmi cmac bnep vfat fat kvm_intel kvm irqbypass arc4 i915 rtsx_pci_sdmmc intel_gtt drm_kms_helper iwlmvm syscopyarea sysfillrect sysimgblt fb_sys_fops mac80211 drm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec btusb snd_hwdep iwlwifi snd_hda_core input_leds btrtl snd_seq pcspkr serio_raw btbcm snd_seq_device i2c_i801 btintel cfg80211 bluetooth snd_pcm i2c_smbus rtsx_pci mfd_core e1000e ptp pps_core snd_timer thinkpad_acpi wmi snd soundcore rfkill video nfsd auth_rpcgss nfs_acl lockd grace sunrpc [ 4761.323045] CPU: 1 PID: 25890 Comm: modprobe Not tainted 4.8.0+ #168 [ 4761.377791] Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET86WW (2.36 ) 12/04/2015 [ 4761.433704] task: 986fd1b7d780 task.stack: a85842528000 [ 4761.490120] RIP: 0010:[] [] __percpu_ref_switch_to_percpu+0xf8/0x100 [ 4761.548138] RSP: 0018:a8584252bb38 EFLAGS: 00010246 [ 4761.604673] RAX: RBX: 986fbdca3200 RCX: [ 4761.662416] RDX: 00983288 RSI: 0001 RDI: 986fbdca3958 [ 4761.720473] RBP: a8584252bb80 R08: 0008 R09: 0008 [ 4761.779270] R10: R11: R12: [ 4761.837603] R13: 9870fa22c800 R14: 9870fa22c80c R15: 986fbdca3200 [ 4761.895870] FS: 7fc286eb4640() GS:98711f24() knlGS: [ 4761.954596] CS: 0010 DS: ES: CR0: 80050033 [ 4762.012978] CR2: 555c3a20ee78 CR3: 000212988000 CR4: 001406e0 [ 4762.072454] Stack: [ 4762.131283] 9870f2f37800 9870c8e46000 9870fa22c880 a8584252bbb8 [ 4762.190776] ae2a147c ba169577 986fbdca3200 9870fa22c870 [ 4762.251149] 9870fa22c800 a8584252bb90 ae2b3294 a8584252bbc8 [ 4762.311657] Call Trace: [ 4762.371157] [] ? kobject_uevent_env+0xfc/0x3b0 [ 4762.431483] [] percpu_ref_switch_to_percpu+0x14/0x20 [ 4762.492093] [] blk_register_queue+0xbe/0x120 [ 4762.552727] [] device_add_disk+0x1c4/0x470 [ 4762.614155] [] loop_add+0x1d9/0x260 [loop] [ 4762.674042] [] loop_init+0x119/0x16c [loop] [ 4762.733949] [] ? 0xc02ff000 [ 4762.793563] [] do_one_initcall+0x4b/0x180 [ 4762.853068] [] ? free_vmap_area_noflush+0x43/0xb0 [ 4762.913665] [] do_init_module+0x55/0x1c4 [ 4762.973400] [] load_module+0x1fc4/0x23e0 [ 4763.033545] [] ? __symbol_put+0x60/0x60 [ 4763.094281] [] SYSC_init_module+0x138/0x150 [ 4763.154985] [] SyS_init_module+0x9/0x10 [ 4763.215577] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 4763.277044] Code: 00 48 c7 c7 20 c7 a8 ae 48 63 d2 e8 63 ef ff ff 3b 05 81 a9 7d 00 89 c2 7c cd 48 8b 43 08 48 83 e0 fe 48 89 43 08 e9 3c ff ff ff <0f> 0b e8 81 b6 d9 ff 90 55 48 89 e5 41 54 4c 8d 67 d8 53 48 89 [ 4763.342964] RIP [] __percpu_ref_switch_to_percpu+0xf8/0x100 [ 4763.407151] RSP Thanks Dave
[GIT PULL] Please pull powerpc/linux.git powerpc-4.9-1 tag
Hi Linus, Please pull the first batch of powerpc updates for 4.9: The following changes since commit c6935931c1894ff857616ff8549b61236a19148f: Linux 4.8-rc5 (2016-09-04 14:31:46 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-4.9-1 for you to fetch changes up to b7b7013cac55d794940bd9cb7b7c55c9dececac4: powerpc/bpf: Add support for bpf constant blinding (2016-10-04 20:33:20 +1100) powerpc updates for 4.9 Highlights: - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin) - Use gas sections for arranging exception vectors et. al. - Large set of TM cleanups and selftests (Cyril Bur) - Enable transactional memory (TM) lazily for userspace (Cyril Bur) - Support for XZ compression in the zImage wrapper (Oliver O'Halloran) - Add support for bpf constant blinding (Naveen N. Rao) - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens) Fixes: - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman) - xmon: Don't use ld on 32-bit (Michael Ellerman) - vdso64: Use double word compare on pointers (Anton Blanchard) - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui) - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy) - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V) - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan) - Replay hypervisor maintenance interrupt first (Nicholas Piggin) Cleanups & features: - Sparse fixes/cleanups (Daniel Axtens) - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras) - Radix MMU fixups for POWER9 (Aneesh Kumar K.V) - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo) - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin) - Optimise MSR handling in exception handling (Nicholas Piggin) - Support for kexec with Radix MMU (Benjamin Herrenschmidt) - powernv EEH fixes (Russell Currey) - Suprise PCI hotplug support for powernv (Gavin Shan) - Endian/sparse fixes for powernv PCI (Gavin Shan) - Defconfig updates (Anton Blanchard) - Various performance optimisations (Anton Blanchard) - Align hot loops of memset() and backwards_memcpy() - During context switch, check before setting mm_cpumask - Remove static branch prediction in atomic{, 64}_add_unless - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian - Set default CPU type to POWER8 for little endian builds - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh) - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat) - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan) - Fix HV facility unavailable to use correct handler (Nicholas Piggin) - Remove unnecessary syscall trampoline (Nicholas Piggin) - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman) - Quieten EEH message when no adapters are found (Anton Blanchard) - powernv: Add PHB register dump debugfs handle (Russell Currey) - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin) - Document the syscall ABI (Nicholas Piggin) - MAINTAINERS: Update cxl maintainers (Michael Neuling) - powerpc: Remove all usages of NO_IRQ (Michael Ellerman) Minor cleanups: - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo. Andrew Donnellan (3): powerpc/pseries: fix memory leak in queue_hotplug_event() error path powerpc/powernv: Fix comment style and spelling cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() Aneesh Kumar K.V (6): powerpc/book3s: Add a cpu table entry for different POWER9 revs powerpc/mm/radix: Use different RTS encoding for different POWER9 revs powerpc/mm/radix: Use different pte update sequence for different POWER9 revs powerpc/mm: Update the HID bit when switching from radix to hash powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K powerpc/mm: Add radix flush all with IS=3 Anton Blanchard (11): powerpc/vdso64: Use double word compare on pointers powerpc/64: Align hot loops of memset() and backwards_memcpy() powerpc/configs: Enable VMX crypto powerpc/configs: Bump kernel ring buffer size on 64 bit configs powerpc/configs: Change a few things from built in to modules powerpc/configs: Enable Intel i40e on 64 bit configs powerpc/eeh: Quieten EEH message when no adapters are found powerpc: During context switch, check before setting mm_cpumask powerpc: Remove static branch prediction in atomic{, 64}_add_unless powerpc: Only
[GIT PULL] Please pull powerpc/linux.git powerpc-4.9-1 tag
Hi Linus, Please pull the first batch of powerpc updates for 4.9: The following changes since commit c6935931c1894ff857616ff8549b61236a19148f: Linux 4.8-rc5 (2016-09-04 14:31:46 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-4.9-1 for you to fetch changes up to b7b7013cac55d794940bd9cb7b7c55c9dececac4: powerpc/bpf: Add support for bpf constant blinding (2016-10-04 20:33:20 +1100) powerpc updates for 4.9 Highlights: - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin) - Use gas sections for arranging exception vectors et. al. - Large set of TM cleanups and selftests (Cyril Bur) - Enable transactional memory (TM) lazily for userspace (Cyril Bur) - Support for XZ compression in the zImage wrapper (Oliver O'Halloran) - Add support for bpf constant blinding (Naveen N. Rao) - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens) Fixes: - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman) - xmon: Don't use ld on 32-bit (Michael Ellerman) - vdso64: Use double word compare on pointers (Anton Blanchard) - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui) - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy) - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V) - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan) - Replay hypervisor maintenance interrupt first (Nicholas Piggin) Cleanups & features: - Sparse fixes/cleanups (Daniel Axtens) - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras) - Radix MMU fixups for POWER9 (Aneesh Kumar K.V) - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo) - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin) - Optimise MSR handling in exception handling (Nicholas Piggin) - Support for kexec with Radix MMU (Benjamin Herrenschmidt) - powernv EEH fixes (Russell Currey) - Suprise PCI hotplug support for powernv (Gavin Shan) - Endian/sparse fixes for powernv PCI (Gavin Shan) - Defconfig updates (Anton Blanchard) - Various performance optimisations (Anton Blanchard) - Align hot loops of memset() and backwards_memcpy() - During context switch, check before setting mm_cpumask - Remove static branch prediction in atomic{, 64}_add_unless - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian - Set default CPU type to POWER8 for little endian builds - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh) - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat) - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan) - Fix HV facility unavailable to use correct handler (Nicholas Piggin) - Remove unnecessary syscall trampoline (Nicholas Piggin) - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman) - Quieten EEH message when no adapters are found (Anton Blanchard) - powernv: Add PHB register dump debugfs handle (Russell Currey) - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin) - Document the syscall ABI (Nicholas Piggin) - MAINTAINERS: Update cxl maintainers (Michael Neuling) - powerpc: Remove all usages of NO_IRQ (Michael Ellerman) Minor cleanups: - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo. Andrew Donnellan (3): powerpc/pseries: fix memory leak in queue_hotplug_event() error path powerpc/powernv: Fix comment style and spelling cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() Aneesh Kumar K.V (6): powerpc/book3s: Add a cpu table entry for different POWER9 revs powerpc/mm/radix: Use different RTS encoding for different POWER9 revs powerpc/mm/radix: Use different pte update sequence for different POWER9 revs powerpc/mm: Update the HID bit when switching from radix to hash powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K powerpc/mm: Add radix flush all with IS=3 Anton Blanchard (11): powerpc/vdso64: Use double word compare on pointers powerpc/64: Align hot loops of memset() and backwards_memcpy() powerpc/configs: Enable VMX crypto powerpc/configs: Bump kernel ring buffer size on 64 bit configs powerpc/configs: Change a few things from built in to modules powerpc/configs: Enable Intel i40e on 64 bit configs powerpc/eeh: Quieten EEH message when no adapters are found powerpc: During context switch, check before setting mm_cpumask powerpc: Remove static branch prediction in atomic{, 64}_add_unless powerpc: Only
Re: [PATCH 4.8 00/10] 4.8.1-stable review
On Thu, Oct 06, 2016 at 11:51:01AM -0700, Guenter Roeck wrote: > On Thu, Oct 06, 2016 at 10:18:23AM +0200, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.8.1 release. > > There are 10 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:47:33 UTC 2016. > > Anything received after that time might be too late. > > > > Build results: > total: 149 pass: 149 fail: 0 > Qemu test results: > total: 108 pass: 108 fail: 0 > > Details are available at http://kerneltests.org/builders. Great, thanks for testing all of these and letting me know. greg k-h
Re: CONFIG_DEBUG_TEST_DRIVER_REMOVE needs a warning
On Thu, Oct 6, 2016 at 6:53 PM, Laura Abbottwrote: > On a whim, I decided to turn on CONFIG_DEBUG_TEST_DRIVER_REMOVE on > Fedora rawhide since it sounded harmless enough. It spewed warnings > and panicked some systems. Clearly it's doing its job > well of finding drivers that can't handle remove properly and I > underestimated it. I was expecting to maybe find a driver or two. > Can we get stronger Kconfig text indicating that this shouldn't be > turned on lightly? I'll be turning the option off in Fedora but sending > out reports from what was found. It hides behind CONFIG_DEBUG already. Is there a better option that distros won't turn on? Rob
Re: CONFIG_DEBUG_TEST_DRIVER_REMOVE needs a warning
On Thu, Oct 6, 2016 at 6:53 PM, Laura Abbott wrote: > On a whim, I decided to turn on CONFIG_DEBUG_TEST_DRIVER_REMOVE on > Fedora rawhide since it sounded harmless enough. It spewed warnings > and panicked some systems. Clearly it's doing its job > well of finding drivers that can't handle remove properly and I > underestimated it. I was expecting to maybe find a driver or two. > Can we get stronger Kconfig text indicating that this shouldn't be > turned on lightly? I'll be turning the option off in Fedora but sending > out reports from what was found. It hides behind CONFIG_DEBUG already. Is there a better option that distros won't turn on? Rob
Re: [PATCH 4.8 00/10] 4.8.1-stable review
On Thu, Oct 06, 2016 at 11:51:01AM -0700, Guenter Roeck wrote: > On Thu, Oct 06, 2016 at 10:18:23AM +0200, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.8.1 release. > > There are 10 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:47:33 UTC 2016. > > Anything received after that time might be too late. > > > > Build results: > total: 149 pass: 149 fail: 0 > Qemu test results: > total: 108 pass: 108 fail: 0 > > Details are available at http://kerneltests.org/builders. Great, thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 4.7 000/141] 4.7.7-stable review
On Thu, Oct 06, 2016 at 11:54:02AM -0700, Guenter Roeck wrote: > On Thu, Oct 06, 2016 at 10:27:16AM +0200, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.7.7 release. > > There are 141 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:44:08 UTC 2016. > > Anything received after that time might be too late. > > > > Build results: > total: 149 pass: 148 fail: 1 > Failed builds: > powerpc:ppc6xx_defconfig > > Qemu test results: > total: 108 pass: 108 fail: 0 > > Adding upstream commit c1a23f6d6455 ("scsi: sas: provide stub implementation > for scsi_is_sas_rphy") fixes the build problem. Thanks, this should now be fixed. greg k-h
Re: Change CONFIG_DEVKMEM default value to n
On Fri, Oct 07, 2016 at 10:04:11AM +0800, Dave Young wrote: > Kconfig comment suggests setting it as "n" if in doubt thus move the > default value to 'n'. > > Signed-off-by: Dave Young> Suggested-by: Kees Cook > --- > drivers/char/Kconfig |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- linux-x86.orig/drivers/char/Kconfig > +++ linux-x86/drivers/char/Kconfig > @@ -17,7 +17,7 @@ config DEVMEM > > config DEVKMEM > bool "/dev/kmem virtual device support" > - default y > + default n If you remove the "default" line, it defaults to 'n'. And is it really "safe" to default this to n now? thanks, greg k-h
Re: [PATCH 4.7 000/141] 4.7.7-stable review
On Thu, Oct 06, 2016 at 11:54:02AM -0700, Guenter Roeck wrote: > On Thu, Oct 06, 2016 at 10:27:16AM +0200, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.7.7 release. > > There are 141 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:44:08 UTC 2016. > > Anything received after that time might be too late. > > > > Build results: > total: 149 pass: 148 fail: 1 > Failed builds: > powerpc:ppc6xx_defconfig > > Qemu test results: > total: 108 pass: 108 fail: 0 > > Adding upstream commit c1a23f6d6455 ("scsi: sas: provide stub implementation > for scsi_is_sas_rphy") fixes the build problem. Thanks, this should now be fixed. greg k-h
Re: Change CONFIG_DEVKMEM default value to n
On Fri, Oct 07, 2016 at 10:04:11AM +0800, Dave Young wrote: > Kconfig comment suggests setting it as "n" if in doubt thus move the > default value to 'n'. > > Signed-off-by: Dave Young > Suggested-by: Kees Cook > --- > drivers/char/Kconfig |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- linux-x86.orig/drivers/char/Kconfig > +++ linux-x86/drivers/char/Kconfig > @@ -17,7 +17,7 @@ config DEVMEM > > config DEVKMEM > bool "/dev/kmem virtual device support" > - default y > + default n If you remove the "default" line, it defaults to 'n'. And is it really "safe" to default this to n now? thanks, greg k-h
Re: [PATCH 4.8 00/10] 4.8.1-stable review
On Thu, Oct 06, 2016 at 01:56:28PM -0600, Shuah Khan wrote: > On 10/06/2016 02:18 AM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.8.1 release. > > There are 10 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:47:33 UTC 2016. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.8.1-rc1.gz > > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-4.8.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Compiled and booted on my test system. No dmesg regressions. Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH] staging: lustre: lprocfs_status.h: fix sparse error: symbol redeclared with different type
On Thu, Oct 06, 2016 at 06:52:07PM +0200, Samuele Baisi wrote: > drivers/staging/lustre/lustre/obdclass/lprocfs_status.c:1554:5: error: > symbol 'lprocfs_wr_root_squash' redeclared with different type (originally > declared at > drivers/staging/lustre/lustre/obdclass/../include/lprocfs_status.h:704) > - incompatible argument 1 (different address spaces) > > drivers/staging/lustre/lustre/obdclass/lprocfs_status.c:1618:5: error: > symbol 'lprocfs_wr_nosquash_nids' redeclared with different type (originally > declared at > drivers/staging/lustre/lustre/obdclass/../include/lprocfs_status.h:706) > - incompatible argument 1 (different address spaces) > > Added __user annotation to the header definitions arguments (which are > indeed userspace buffers). Are they really? Have you tested this? The last time this was looked at, it was a non-trivial problem... And any reason you didn't cc the lustre maintainers with this change? If you think it is correct, please resend it with the testing information and cc: them. thanks, greg k-h
Re: [PATCH 4.8 00/10] 4.8.1-stable review
On Thu, Oct 06, 2016 at 01:56:28PM -0600, Shuah Khan wrote: > On 10/06/2016 02:18 AM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.8.1 release. > > There are 10 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Oct 8 07:47:33 UTC 2016. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.8.1-rc1.gz > > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-4.8.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Compiled and booted on my test system. No dmesg regressions. Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH] staging: lustre: lprocfs_status.h: fix sparse error: symbol redeclared with different type
On Thu, Oct 06, 2016 at 06:52:07PM +0200, Samuele Baisi wrote: > drivers/staging/lustre/lustre/obdclass/lprocfs_status.c:1554:5: error: > symbol 'lprocfs_wr_root_squash' redeclared with different type (originally > declared at > drivers/staging/lustre/lustre/obdclass/../include/lprocfs_status.h:704) > - incompatible argument 1 (different address spaces) > > drivers/staging/lustre/lustre/obdclass/lprocfs_status.c:1618:5: error: > symbol 'lprocfs_wr_nosquash_nids' redeclared with different type (originally > declared at > drivers/staging/lustre/lustre/obdclass/../include/lprocfs_status.h:706) > - incompatible argument 1 (different address spaces) > > Added __user annotation to the header definitions arguments (which are > indeed userspace buffers). Are they really? Have you tested this? The last time this was looked at, it was a non-trivial problem... And any reason you didn't cc the lustre maintainers with this change? If you think it is correct, please resend it with the testing information and cc: them. thanks, greg k-h
Re: [PATCH 4.7 122/141] scsi: ses: use scsi_is_sas_rphy instead of is_sas_attached
On Thu, Oct 06, 2016 at 09:25:34PM +0800, James Bottomley wrote: > On Thu, 2016-10-06 at 10:29 +0200, Greg Kroah-Hartman wrote: > > 4.7-stable review patch. If anyone has any objections, please let me > > know. > > This doesn't build if SCSI_SAS_ATTRS isn't set without this patch: > > > commit c1a23f6d64552b4480208aa584ec7e9c13d6d9c3 > Author: Johannes Thumshirn> Date: Wed Aug 17 11:46:16 2016 +0200 > > scsi: sas: provide stub implementation for scsi_is_sas_rphy > > Does it? You are right, we have ppc build failures without this, thanks for letting me know. greg k-h
Re: CONFIG_DEBUG_TEST_DRIVER_REMOVE needs a warning
On Thu, Oct 06, 2016 at 04:53:20PM -0700, Laura Abbott wrote: > On a whim, I decided to turn on CONFIG_DEBUG_TEST_DRIVER_REMOVE on > Fedora rawhide since it sounded harmless enough. It spewed warnings > and panicked some systems. Clearly it's doing its job > well of finding drivers that can't handle remove properly and I > underestimated it. Yes, we knew it was going to find bugs, you were brave :) > I was expecting to maybe find a driver or two. > Can we get stronger Kconfig text indicating that this shouldn't be > turned on lightly? I'll be turning the option off in Fedora but sending > out reports from what was found. Care to send a patch with the wording change you would have found better to warn yourself not to do this? thanks, greg k-h