3.14 stable regression don't remove from shrink list in select_collect()

2016-01-25 Thread Shawn Bohrer
I recently updated some machines to 3.14.58 and they reliably get soft lockups. Sometimes the soft lockup recovers and some times it does not. I've bisected this on the 3.14 stable branch and arrived at: c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink list in select_collect()

Re: NFS Freezer and stuck tasks

2015-05-01 Thread Shawn Bohrer
On Fri, May 01, 2015 at 05:10:34PM -0400, Benjamin Coddington wrote: > On Fri, 1 May 2015, Benjamin Coddington wrote: > > > On Wed, 4 Mar 2015, Shawn Bohrer wrote: > > > > > Hello, > > > > > > We're using the Linux cgroup Freezer on some machines

Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 02:16:05PM -0700, Mike Kravetz wrote: > On 04/08/2015 09:15 AM, Shawn Bohrer wrote: > >I've noticed on a number of my systems that after shutting down my > >application that uses huge pages that I'm left with some pages still > >in HugePag

Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 12:29:03PM -0700, Davidlohr Bueso wrote: > On Wed, 2015-04-08 at 11:15 -0500, Shawn Bohrer wrote: > > AnonHugePages:241664 kB > > HugePages_Total: 512 > > HugePages_Free: 512 > > HugePages_Rsvd: 384 > > HugePages_Surp:

HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
I've noticed on a number of my systems that after shutting down my application that uses huge pages that I'm left with some pages still in HugePages_Rsvd. It is possible that I still have something using huge pages that I'm not aware of but so far my attempts to find anything using huge pages have

NFS Freezer and stuck tasks

2015-03-04 Thread Shawn Bohrer
Hello, We're using the Linux cgroup Freezer on some machines that use NFS and have run into what appears to be a bug where frozen tasks are blocking running tasks and preventing them from completing. On one of our machines which happens to be running an older 3.10.46 kernel we have frozen some of

Re: [PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-09-15 Thread Shawn Bohrer
On Wed, Sep 03, 2014 at 12:13:57PM -0500, Shawn Bohrer wrote: > From: Shawn Bohrer > > In debugging an application that receives -ENOMEM from ib_reg_mr() I > found that ib_umem_get() can fail because the pinned_vm count has > wrapped causing it to always be larger than the lock

[PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-09-03 Thread Shawn Bohrer
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs

Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
On Thu, Aug 28, 2014 at 02:48:19PM +0300, Haggai Eran wrote: > On 26/08/2014 00:07, Shawn Bohrer wrote: > >>>> The following patch fixes the issue by storing the mm_struct of the > >> > > >> > You are doing more than just storing the mm_struct - you ar

[PATCH v2] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs

Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-25 Thread Shawn Bohrer
e to be destroyed, > and the file handle is waiting for the mm to be destroyed. > > The proper solution is to keep a reference to the task_pid (using > get_task_pid), and use this pid to get the task_struct and from it > the mm_struct during the destruction flow. I'll put to

Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-20 Thread Shawn Bohrer
On Tue, Aug 12, 2014 at 11:27:35AM -0500, Shawn Bohrer wrote: > From: Shawn Bohrer > > In debugging an application that receives -ENOMEM from ib_reg_mr() I > found that ib_umem_get() can fail because the pinned_vm count has > wrapped causing it to always be larger than the lock

[PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-12 Thread Shawn Bohrer
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs

Re: 3.10.16 cgroup_mutex deadlock

2013-11-20 Thread Shawn Bohrer
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote: > > Thanks Tejun and Hugh. Sorry for my late entry in getting around to > > testing this fix. On the surface it sounds correct however I'd like to > > test this on top of 3.10.* since that is what we'll likely be running. > > I've tried to

Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Shawn Bohrer
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote: > Sorry for the delay: I was on the point of reporting success last > night, when I tried a debug kernel: and that didn't work so well > (got spinlock bad magic report in pwd_adjust_max_active(), and > tests wouldn't run at all). > > Ev

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Shawn Bohrer
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote: > On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: > > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: > > > On Tue 12-11-13 18:17:20, Li Zefan wrote: > > > > Cc more people > > > >

3.10.16 cgroup_mutex deadlock

2013-11-11 Thread Shawn Bohrer
Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is hol

3.10.16 general protection fault kmem_cache_alloc+0x67/0x170

2013-11-04 Thread Shawn Bohrer
I had a machine crash this weekend running a 3.10.16 kernel that additionally has a few backported networking patches for performance improvements. At this point I can't rule out that the bug isn't from those patches, and I haven't yet tried to see if I can reproduce the crash. I did happen to ha

[tip:sched/core] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-06 Thread tip-bot for Shawn Bohrer
Commit-ID: 6bfa687c19b7ab8adee03f0d43c197c2945dd869 Gitweb: http://git.kernel.org/tip/6bfa687c19b7ab8adee03f0d43c197c2945dd869 Author: Shawn Bohrer AuthorDate: Fri, 4 Oct 2013 14:24:53 -0500 Committer: Ingo Molnar CommitDate: Sun, 6 Oct 2013 11:28:40 +0200 sched/rt: Remove redundant

[PATCH] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-04 Thread Shawn Bohrer
From: Shawn Bohrer In 76854c7e8f3f4172fef091e78d88b3b751463ac6 "sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles" an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes

[PATCH] USB: Fix compilation error when CONFIG_PM disabled

2013-08-26 Thread Shawn Bohrer
nd and ohci_resume are only defined when CONFIG_PM is defined so only use them under CONFIG_PM. Signed-off-by: Shawn Bohrer --- drivers/usb/host/ohci-pci.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c index 0f1d193..06

Re: [PATCH net] net: rename and move busy poll mib counter

2013-08-06 Thread Shawn Bohrer
On Tue, Aug 06, 2013 at 03:14:48AM -0700, Eric Dumazet wrote: > On Tue, 2013-08-06 at 12:52 +0300, Eliezer Tamir wrote: > > Move the low latency mib counter to the ip section. > > Rename it from low latency to busy poll. > > > > Reported-by: Shawn Bohrer >

Re: 3.10-rc4 stalls during mmap writes

2013-06-11 Thread Shawn Bohrer
On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote: > On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote: > > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: > > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: > > So to s

Re: 3.10-rc4 stalls during mmap writes

2013-06-10 Thread Shawn Bohrer
On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: > > So I guess my question is does anyone know why I'm now seeing these > > stalls with 3.10? > > Because we made all metadata updates in XFS fu

3.10-rc4 stalls during mmap writes

2013-06-07 Thread Shawn Bohrer
I've started testing the 3.10 kernel, previously I was on 3.4, and I'm encounting some fairly large stalls in my memory mapped writes in the range of .01 to 1s. I've managed to capture two of these stalls so far and both looked like the following: 1) Writing process writes to a new page and block

Re: old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
On Fri, May 31, 2013 at 01:05:35PM -0400, Steven Rostedt wrote: > On Fri, 2013-05-31 at 11:50 -0500, Shawn Bohrer wrote: > > Not sure if this is a big deal or not. I've got an old version of > > trace-cmd. It was built from git on 2012-09-12 but sadly I didn't > > s

old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
Not sure if this is a big deal or not. I've got an old version of trace-cmd. It was built from git on 2012-09-12 but sadly I didn't stash away the exact commit hash. Anyway this version works fine on a 3.4 kernel but on a 3.10-rc3 kernel it no longer works. I just pulled the latest trace-cmd fro

Re: deadlock on vmap_area_lock

2013-05-02 Thread Shawn Bohrer
On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote: > On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: > > On Wed, 1 May 2013, Shawn Bohrer wrote: > > > > > I've got two compute clusters with around 350 machines each which are > > > r

Re: deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: > On Wed, 1 May 2013, Shawn Bohrer wrote: > > > I've got two compute clusters with around 350 machines each which are > > running kernels based off of 3.1.9 (Yes I realize this is ancient by > > t

deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
I've got two compute clusters with around 350 machines each which are running kernels based off of 3.1.9 (Yes I realize this is ancient by todays standards). All of the machines run a 'find' command once an hour on one of the mounted XFS filesystems. Occasionally these find commands get stuck req

Re: 3.7 HDMI channel map regression

2013-02-17 Thread Shawn Bohrer
On Sun, Feb 17, 2013 at 09:34:53AM +0100, Takashi Iwai wrote: > At Sat, 16 Feb 2013 18:22:25 -0600, > Shawn Bohrer wrote: > > > > On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: > > > On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > >

Re: 3.7 HDMI channel map regression

2013-02-16 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: > On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > > At Sun, 27 Jan 2013 19:18:27 -0600, > > Shawn Bohrer wrote: > > > > > > Hi Takashi, > > > > > > I recently updated

Re: 3.7 HDMI channel map regression

2013-01-28 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > At Sun, 27 Jan 2013 19:18:27 -0600, > Shawn Bohrer wrote: > > > > Hi Takashi, > > > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL > > and FC channels to swap, and my

3.7 HDMI channel map regression

2013-01-27 Thread Shawn Bohrer
Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the proper channel mapping for generic HDMI driv

[PATCH] sched_rt: Use root_domain of rt_rq not current processor

2013-01-14 Thread Shawn Bohrer
__disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer --- kernel/sched/rt.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..4f02b28

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Shawn Bohrer
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote: > On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: > > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > > > > > I've also managed to reproduce this on 3.8.0-rc2

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Shawn Bohrer
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > is still present in the latest kernel. > > Shawn, > > Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: > On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > > I've tried reproducing the issue, but so far I've been unsuccessful > > but I believe that is because my RT tasks aren't using enough CP

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > I've tried reproducing the issue, but so far I've been unsuccessful > but I believe that is because my RT tasks aren't using enough CPU > cause borrowing from the other runqueues. Normally our RT tasks use

kernel BUG at kernel/sched_rt.c:493!

2013-01-05 Thread Shawn Bohrer
We recently managed to crash 10 of our test machines at the same time. Half of the machines were running a 3.1.9 kernel and half were running 3.4.9. I realize that these are both fairly old kernels but I've skimmed the list of fixes in the 3.4.* stable series and didn't see anything that appeared

Re: mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
On Fri, Sep 28, 2012 at 05:50:08PM +0200, Eric Dumazet wrote: > On Fri, 2012-09-28 at 10:14 -0500, Shawn Bohrer wrote: > > We've got a new application that is receiving UDP multicast data using > > AF_PACKET and writing out the packets in a custom format to disk. The > &g

mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
We've got a new application that is receiving UDP multicast data using AF_PACKET and writing out the packets in a custom format to disk. The packet rates are bursty, but it seems to be roughly 100 Mbps on average for 1 minute periods. With this application running all day we get a lot of these me

Re: [PATCH 1/3] CodingStyle updates

2007-09-29 Thread Shawn Bohrer
On Fri, Sep 28, 2007 at 05:32:00PM -0400, Erez Zadok wrote: > 1. Updates chapter 13 (printing kernel messages) to expand on the use of >pr_debug()/pr_info(), what to avoid, and how to hook your debug code with >kernel.h. > > 2. New chapter 19, branch prediction optimizations, discusses the

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-17 Thread Shawn Bohrer
On Tue, Jul 17, 2007 at 02:57:45AM +0200, Rene Herman wrote: > True enough. I'm rather wondering though why RHEL is shipping with it if > it's a _real_ problem. Scribbling junk all over kernel memory would be the > kind of thing I'd imagine you'd mightely piss-off enterprise customers with. >