[Kernel-packages] [Bug 1818141] [NEW] linux-azure - Add the same 4.15 InfiniBand configuration settings to the 4.18 kernel

2019-02-28 Thread David Coronel
Public bug reported:

The InfiniBand configuration settings from the 4.15 linux-azure kernel
are not in the 4.18 linux-azure kernel.

This is a request to apply the same InfiniBand settings to the 4.18
linux-azure kernel.

The following settings are only in the 4.15 linux-azure kernel (tested
with 4.15.0-1037 vs 4.18.0-1011):

CONFIG_INFINIBAND_IPOIB=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_USER_MAD=y

** Affects: linux-azure (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1818141

Title:
  linux-azure - Add the same 4.15 InfiniBand configuration settings to
  the 4.18 kernel

Status in linux-azure package in Ubuntu:
  New

Bug description:
  The InfiniBand configuration settings from the 4.15 linux-azure kernel
  are not in the 4.18 linux-azure kernel.

  This is a request to apply the same InfiniBand settings to the 4.18
  linux-azure kernel.

  The following settings are only in the 4.15 linux-azure kernel (tested
  with 4.15.0-1037 vs 4.18.0-1011):

  CONFIG_INFINIBAND_IPOIB=y
  CONFIG_INFINIBAND_IPOIB_DEBUG=y
  CONFIG_INFINIBAND_USER_MAD=y

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1818141/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1818138] [NEW] Add CONFIG_NO_HZ_FULL=y to linux-azure kernels

2019-02-28 Thread David Coronel
Public bug reported:

This is a request to enable CONFIG_NO_HZ_FULL=y in the linux-azure
kernels(currently 4.15 and 4.18) in order to increase NVME disks
performance.

The current CONFIG_NO_HZ configuration in linux-azure kernels is the
following (tested on 4.15.0-1037, 4.15.0-1039 and 4.18.0-1011):

CONFIG_NO_HZ=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ_IDLE=y

** Affects: linux-azure (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1818138

Title:
  Add CONFIG_NO_HZ_FULL=y to linux-azure kernels

Status in linux-azure package in Ubuntu:
  New

Bug description:
  This is a request to enable CONFIG_NO_HZ_FULL=y in the linux-azure
  kernels(currently 4.15 and 4.18) in order to increase NVME disks
  performance.

  The current CONFIG_NO_HZ configuration in linux-azure kernels is the
  following (tested on 4.15.0-1037, 4.15.0-1039 and 4.18.0-1011):

  CONFIG_NO_HZ=y
  CONFIG_NO_HZ_COMMON=y
  # CONFIG_NO_HZ_FULL is not set
  CONFIG_NO_HZ_IDLE=y

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1818138/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1816106] [NEW] 4.15.0-1037 does not see all PCI devices on GPU VMs

2019-02-15 Thread David Coronel
Public bug reported:

Host changes have altered how the PCI GUID is presented to the guest and
the patches for PCI IDs in 4.15.0-1037 do not properly handle the new
condition.

Impact:
Instances with multiple GPUs are only seeing one.

Workaround:
4.15.0-1036 does not have this behavior.

Additional info:

The commit in 4.15.0-1037 responsible is " - PCI: hv: Make sure the bus
domain is really unique"

The immediate action requested is to back out this patch on 4.15.0 (azure):
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/commit/?h=master-next=b9ae54076a78d01659c4d0f0a558cdb4056f0d13

The same thing on 4.18:
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/commit/?h=azure-edge-next=29927dfb7f69bcf2ae7fd1cda10997e646a5189c

** Affects: linux-azure (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1816106

Title:
  4.15.0-1037 does not see all PCI devices on GPU VMs

Status in linux-azure package in Ubuntu:
  New

Bug description:
  Host changes have altered how the PCI GUID is presented to the guest
  and the patches for PCI IDs in 4.15.0-1037 do not properly handle the
  new condition.

  Impact:
  Instances with multiple GPUs are only seeing one.

  Workaround:
  4.15.0-1036 does not have this behavior.

  Additional info:

  The commit in 4.15.0-1037 responsible is " - PCI: hv: Make sure the
  bus domain is really unique"

  The immediate action requested is to back out this patch on 4.15.0 (azure):
  
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/commit/?h=master-next=b9ae54076a78d01659c4d0f0a558cdb4056f0d13

  The same thing on 4.18:
  
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/commit/?h=azure-edge-next=29927dfb7f69bcf2ae7fd1cda10997e646a5189c

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1816106/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1802021] Re: [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

2019-02-14 Thread David Coronel
@lazamarius1: Just to clarify, the fix is scheduled to go in the 4.15
kernel in Bionic which is the same kernel as the Xenial HWE kernel. So
there's no need to add anything to the Affects section. You will see a
new linux-hwe 4.15 kernel in xenial-proposed once this is ready to test.
Thanks!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1802021

Title:
  [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

Status in linux package in Ubuntu:
  Confirmed
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux-azure source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Confirmed
Status in linux-azure source package in Cosmic:
  Fix Committed

Bug description:
  We had a customer seeing traces like the following:

  tack trace from kern.log:
  2018-10-10T04:43:08.542464+00:00 hbp2ann-2 kernel: INFO: task 
kworker/u16:0:16678 blocked for more than 120 seconds.
  2018-10-10T04:43:08.542503+00:00 hbp2ann-2 kernel: Not tainted 
4.15.0-1023-azure #24~16.04.1-Ubuntu
  2018-10-10T04:43:08.542513+00:00 hbp2ann-2 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  2018-10-10T04:43:08.547366+00:00 hbp2ann-2 kernel: kworker/u16:0 D 0 16678 2 
0x8000
  2018-10-10T04:43:08.547386+00:00 hbp2ann-2 kernel: Workqueue: events_unbound 
fsnotify_mark_destroy_workfn
  2018-10-10T04:43:08.547395+00:00 hbp2ann-2 kernel: Call Trace:
  2018-10-10T04:43:08.547413+00:00 hbp2ann-2 kernel: __schedule+0x3d6/0x8b0
  2018-10-10T04:43:08.547422+00:00 hbp2ann-2 kernel: ? 
check_preempt_wakeup+0xfb/0x240
  2018-10-10T04:43:08.547431+00:00 hbp2ann-2 kernel: ? 
sched_clock_local+0x17/0x90
  2018-10-10T04:43:08.547440+00:00 hbp2ann-2 kernel: schedule+0x36/0x80
  2018-10-10T04:43:08.547448+00:00 hbp2ann-2 kernel: 
schedule_timeout+0x1db/0x370
  2018-10-10T04:43:08.547458+00:00 hbp2ann-2 kernel: ? 
__enqueue_entity+0x5c/0x60
  2018-10-10T04:43:08.547467+00:00 hbp2ann-2 kernel: ? 
enqueue_entity+0x112/0x670
  2018-10-10T04:43:08.547477+00:00 hbp2ann-2 kernel: 
wait_for_completion+0xb4/0x140
  2018-10-10T04:43:08.547486+00:00 hbp2ann-2 kernel: ? wake_up_q+0x70/0x70
  2018-10-10T04:43:08.547510+00:00 hbp2ann-2 kernel: 
__synchronize_srcu.part.13+0x85/0xb0
  2018-10-10T04:43:08.547535+00:00 hbp2ann-2 kernel: ? 
trace_raw_output_rcu_utilization+0x50/0x50
  2018-10-10T04:43:08.547560+00:00 hbp2ann-2 kernel: synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547594+00:00 hbp2ann-2 kernel: ? 
synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547604+00:00 hbp2ann-2 kernel: 
fsnotify_mark_destroy_workfn+0x7c/0xe0
  2018-10-10T04:43:08.547612+00:00 hbp2ann-2 kernel: 
process_one_work+0x14d/0x410
  2018-10-10T04:43:08.547620+00:00 hbp2ann-2 kernel: worker_thread+0x4b/0x460
  2018-10-10T04:43:08.547628+00:00 hbp2ann-2 kernel: kthread+0x105/0x140
  2018-10-10T04:43:08.547637+00:00 hbp2ann-2 kernel: ? 
process_one_work+0x410/0x410
  2018-10-10T04:43:08.547645+00:00 hbp2ann-2 kernel: ? 
kthread_destroy_worker+0x50/0x50
  2018-10-10T04:43:08.547654+00:00 hbp2ann-2 kernel: ? do_syscall_64+0x73/0x130
  2018-10-10T04:43:08.547677+00:00 hbp2ann-2 kernel: ? SyS_exit_group+0x14/0x20
  2018-10-10T04:43:08.547685+00:00 hbp2ann-2 kernel: ret_from_fork+0x35/0x40

  Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120
  seconds.

  We are seeing more issue with fsnotify related callbacks. These are
  not a soft/hard lockup but seem to significantly degrade the
  responsiveness of systemd (and from there everything else).

  The following upstream commit may fix this issue, but it is in Paul's
  RCU tree and not in linux-next or upstream yet:

  https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-
  rcu.git/commit/?h=dev=1a05c0cd2fee234a10362cc8f66057557cbb291f

  srcu: Lock srcu_data structure in srcu_gp_start()
  The srcu_gp_start() function is called with the srcu_struct structure's
  ->lock held, but not with the srcu_data structure's ->lock.  This is
  problematic because this function accesses and updates the srcu_data
  structure's ->srcu_cblist, which is protected by that lock.  Failing to
  hold this lock can result in corruption of the SRCU callback lists,
  which in turn can result in arbitrarily bad results.

  This commit therefore makes srcu_gp_start() acquire the srcu_data
  structure's ->lock across the calls to rcu_segcblist_advance() and
  rcu_segcblist_accelerate(), thus preventing this corruption.

  Please investigate this issue and evaluate the proposed fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802021/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : 

[Kernel-packages] [Bug 1813211] Re: Allow I/O schedulers to be loaded with modprobe in linux-azure

2019-02-14 Thread David Coronel
[VERIFICATION BIONIC]

I can modprobe the mq-deadline, kyber and bfq schedulers but it looks
like doing a modprobe of the cfq and deadline schedulers doesn't list
them as choices in the available schedulers:

# uname -r
4.18.0-1009-azure

# modprobe bfq
# modprobe cfq-iosched
# modprobe mq-deadline
# modprobe kyber-iosched
# modprobe deadline-iosched

# tail /sys/block/sda/queue/scheduler
[none] mq-deadline kyber bfq 

I have these 4.18 kernel packages installed:

# dpkg -l | grep 4.18 | grep linux
ii  linux-azure 4.18.0.1009.9
ii  linux-azure-cloud-tools-4.18.0-1009 4.18.0-1009.9~18.04.1
ii  linux-azure-headers-4.18.0-1009 4.18.0-1009.9~18.04.1
ii  linux-azure-tools-4.18.0-1009   4.18.0-1009.9~18.04.1
ii  linux-cloud-tools-4.18.0-1009-azure 4.18.0-1009.9~18.04.1
ii  linux-cloud-tools-azure 4.18.0.1009.9
ii  linux-headers-4.18.0-1009-azure 4.18.0-1009.9~18.04.1
ii  linux-headers-azure 4.18.0.1009.9
ii  linux-image-4.18.0-1009-azure   4.18.0-1009.9~18.04.1
ii  linux-image-azure   4.18.0.1009.9
ii  linux-modules-4.18.0-1009-azure 4.18.0-1009.9~18.04.1
ii  linux-tools-4.18.0-1009-azure   4.18.0-1009.9~18.04.1
ii  linux-tools-azure   4.18.0.1009.9

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1813211

Title:
  Allow I/O schedulers to be loaded with modprobe in linux-azure

Status in linux-azure package in Ubuntu:
  Invalid
Status in linux-azure source package in Bionic:
  Fix Committed
Status in linux-azure source package in Cosmic:
  Fix Committed

Bug description:
  There was a previous request in bug
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671203 to limit
  the IO scheduler to NOOP in linux-azure. The other schedulers were
  turned off and NOOP was made the default.

  However with new upstream releases, new schedulers where added and
  linux-azure inherited the default config value for them from the
  master kernel. That's why there are some IO schedulers built as
  modules in the extra package (mq-deadline and kyber-iosched). In order
  to use those two modules, one needs to install the -extra package and
  modprobe the modules first:

  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] 

  # apt install linux-modules-extra-4.15.0-1036-azure

  # modprobe kyber-iosched
  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] kyber 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] kyber 

  # modprobe mq-deadline
  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] kyber mq-deadline 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] kyber mq-deadline 

  
  The schedulers cfq and deadline have been completely disabled in LP bug 
#1671203 so they cannot be added with modprobe.

  This is a request to move back the cfq and deadline schedulers along
  with the mq-deadline and kyber-iosched schedulers to the main linux-
  azure package (in other words, not in the -extra package) and allow
  the users to be able to modprobe the schedulers when needed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1802021] Re: [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

2019-02-04 Thread David Coronel
You can clone the ubuntu-xenial kernel:

git clone git://kernel.ubuntu.com/ubuntu/ubuntu-xenial.git

And then grep for the commit you're looking for. There's a few different
ways to do it, I do:

git log --oneline | grep "Expose SMT control init function"

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1802021

Title:
  [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  Confirmed
Status in linux-azure source package in Bionic:
  In Progress
Status in linux-hwe source package in Bionic:
  Confirmed
Status in linux-azure source package in Cosmic:
  In Progress
Status in linux-hwe source package in Cosmic:
  Confirmed

Bug description:
  We had a customer seeing traces like the following:

  tack trace from kern.log:
  2018-10-10T04:43:08.542464+00:00 hbp2ann-2 kernel: INFO: task 
kworker/u16:0:16678 blocked for more than 120 seconds.
  2018-10-10T04:43:08.542503+00:00 hbp2ann-2 kernel: Not tainted 
4.15.0-1023-azure #24~16.04.1-Ubuntu
  2018-10-10T04:43:08.542513+00:00 hbp2ann-2 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  2018-10-10T04:43:08.547366+00:00 hbp2ann-2 kernel: kworker/u16:0 D 0 16678 2 
0x8000
  2018-10-10T04:43:08.547386+00:00 hbp2ann-2 kernel: Workqueue: events_unbound 
fsnotify_mark_destroy_workfn
  2018-10-10T04:43:08.547395+00:00 hbp2ann-2 kernel: Call Trace:
  2018-10-10T04:43:08.547413+00:00 hbp2ann-2 kernel: __schedule+0x3d6/0x8b0
  2018-10-10T04:43:08.547422+00:00 hbp2ann-2 kernel: ? 
check_preempt_wakeup+0xfb/0x240
  2018-10-10T04:43:08.547431+00:00 hbp2ann-2 kernel: ? 
sched_clock_local+0x17/0x90
  2018-10-10T04:43:08.547440+00:00 hbp2ann-2 kernel: schedule+0x36/0x80
  2018-10-10T04:43:08.547448+00:00 hbp2ann-2 kernel: 
schedule_timeout+0x1db/0x370
  2018-10-10T04:43:08.547458+00:00 hbp2ann-2 kernel: ? 
__enqueue_entity+0x5c/0x60
  2018-10-10T04:43:08.547467+00:00 hbp2ann-2 kernel: ? 
enqueue_entity+0x112/0x670
  2018-10-10T04:43:08.547477+00:00 hbp2ann-2 kernel: 
wait_for_completion+0xb4/0x140
  2018-10-10T04:43:08.547486+00:00 hbp2ann-2 kernel: ? wake_up_q+0x70/0x70
  2018-10-10T04:43:08.547510+00:00 hbp2ann-2 kernel: 
__synchronize_srcu.part.13+0x85/0xb0
  2018-10-10T04:43:08.547535+00:00 hbp2ann-2 kernel: ? 
trace_raw_output_rcu_utilization+0x50/0x50
  2018-10-10T04:43:08.547560+00:00 hbp2ann-2 kernel: synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547594+00:00 hbp2ann-2 kernel: ? 
synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547604+00:00 hbp2ann-2 kernel: 
fsnotify_mark_destroy_workfn+0x7c/0xe0
  2018-10-10T04:43:08.547612+00:00 hbp2ann-2 kernel: 
process_one_work+0x14d/0x410
  2018-10-10T04:43:08.547620+00:00 hbp2ann-2 kernel: worker_thread+0x4b/0x460
  2018-10-10T04:43:08.547628+00:00 hbp2ann-2 kernel: kthread+0x105/0x140
  2018-10-10T04:43:08.547637+00:00 hbp2ann-2 kernel: ? 
process_one_work+0x410/0x410
  2018-10-10T04:43:08.547645+00:00 hbp2ann-2 kernel: ? 
kthread_destroy_worker+0x50/0x50
  2018-10-10T04:43:08.547654+00:00 hbp2ann-2 kernel: ? do_syscall_64+0x73/0x130
  2018-10-10T04:43:08.547677+00:00 hbp2ann-2 kernel: ? SyS_exit_group+0x14/0x20
  2018-10-10T04:43:08.547685+00:00 hbp2ann-2 kernel: ret_from_fork+0x35/0x40

  Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120
  seconds.

  We are seeing more issue with fsnotify related callbacks. These are
  not a soft/hard lockup but seem to significantly degrade the
  responsiveness of systemd (and from there everything else).

  The following upstream commit may fix this issue, but it is in Paul's
  RCU tree and not in linux-next or upstream yet:

  https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-
  rcu.git/commit/?h=dev=1a05c0cd2fee234a10362cc8f66057557cbb291f

  srcu: Lock srcu_data structure in srcu_gp_start()
  The srcu_gp_start() function is called with the srcu_struct structure's
  ->lock held, but not with the srcu_data structure's ->lock.  This is
  problematic because this function accesses and updates the srcu_data
  structure's ->srcu_cblist, which is protected by that lock.  Failing to
  hold this lock can result in corruption of the SRCU callback lists,
  which in turn can result in arbitrarily bad results.

  This commit therefore makes srcu_gp_start() acquire the srcu_data
  structure's ->lock across the calls to rcu_segcblist_advance() and
  rcu_segcblist_accelerate(), thus preventing this corruption.

  Please investigate this issue and evaluate the proposed fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1802021] Re: [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

2019-02-04 Thread David Coronel
Hi overlord. An easy way to check the updates included in released
kernels is to look at the "-changes" mailing list for your Ubuntu
release.

In this situation it would be https://lists.ubuntu.com/archives/xenial-
changes/

And you can find that new kernel here:

https://lists.ubuntu.com/archives/xenial-
changes/2019-February/023480.html

You can see this kernel 4.15.0-45 does not include the fix for this LP
bug #1802021.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1802021

Title:
  [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  Confirmed
Status in linux-azure source package in Bionic:
  In Progress
Status in linux-hwe source package in Bionic:
  Confirmed
Status in linux-azure source package in Cosmic:
  In Progress
Status in linux-hwe source package in Cosmic:
  Confirmed

Bug description:
  We had a customer seeing traces like the following:

  tack trace from kern.log:
  2018-10-10T04:43:08.542464+00:00 hbp2ann-2 kernel: INFO: task 
kworker/u16:0:16678 blocked for more than 120 seconds.
  2018-10-10T04:43:08.542503+00:00 hbp2ann-2 kernel: Not tainted 
4.15.0-1023-azure #24~16.04.1-Ubuntu
  2018-10-10T04:43:08.542513+00:00 hbp2ann-2 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  2018-10-10T04:43:08.547366+00:00 hbp2ann-2 kernel: kworker/u16:0 D 0 16678 2 
0x8000
  2018-10-10T04:43:08.547386+00:00 hbp2ann-2 kernel: Workqueue: events_unbound 
fsnotify_mark_destroy_workfn
  2018-10-10T04:43:08.547395+00:00 hbp2ann-2 kernel: Call Trace:
  2018-10-10T04:43:08.547413+00:00 hbp2ann-2 kernel: __schedule+0x3d6/0x8b0
  2018-10-10T04:43:08.547422+00:00 hbp2ann-2 kernel: ? 
check_preempt_wakeup+0xfb/0x240
  2018-10-10T04:43:08.547431+00:00 hbp2ann-2 kernel: ? 
sched_clock_local+0x17/0x90
  2018-10-10T04:43:08.547440+00:00 hbp2ann-2 kernel: schedule+0x36/0x80
  2018-10-10T04:43:08.547448+00:00 hbp2ann-2 kernel: 
schedule_timeout+0x1db/0x370
  2018-10-10T04:43:08.547458+00:00 hbp2ann-2 kernel: ? 
__enqueue_entity+0x5c/0x60
  2018-10-10T04:43:08.547467+00:00 hbp2ann-2 kernel: ? 
enqueue_entity+0x112/0x670
  2018-10-10T04:43:08.547477+00:00 hbp2ann-2 kernel: 
wait_for_completion+0xb4/0x140
  2018-10-10T04:43:08.547486+00:00 hbp2ann-2 kernel: ? wake_up_q+0x70/0x70
  2018-10-10T04:43:08.547510+00:00 hbp2ann-2 kernel: 
__synchronize_srcu.part.13+0x85/0xb0
  2018-10-10T04:43:08.547535+00:00 hbp2ann-2 kernel: ? 
trace_raw_output_rcu_utilization+0x50/0x50
  2018-10-10T04:43:08.547560+00:00 hbp2ann-2 kernel: synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547594+00:00 hbp2ann-2 kernel: ? 
synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547604+00:00 hbp2ann-2 kernel: 
fsnotify_mark_destroy_workfn+0x7c/0xe0
  2018-10-10T04:43:08.547612+00:00 hbp2ann-2 kernel: 
process_one_work+0x14d/0x410
  2018-10-10T04:43:08.547620+00:00 hbp2ann-2 kernel: worker_thread+0x4b/0x460
  2018-10-10T04:43:08.547628+00:00 hbp2ann-2 kernel: kthread+0x105/0x140
  2018-10-10T04:43:08.547637+00:00 hbp2ann-2 kernel: ? 
process_one_work+0x410/0x410
  2018-10-10T04:43:08.547645+00:00 hbp2ann-2 kernel: ? 
kthread_destroy_worker+0x50/0x50
  2018-10-10T04:43:08.547654+00:00 hbp2ann-2 kernel: ? do_syscall_64+0x73/0x130
  2018-10-10T04:43:08.547677+00:00 hbp2ann-2 kernel: ? SyS_exit_group+0x14/0x20
  2018-10-10T04:43:08.547685+00:00 hbp2ann-2 kernel: ret_from_fork+0x35/0x40

  Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120
  seconds.

  We are seeing more issue with fsnotify related callbacks. These are
  not a soft/hard lockup but seem to significantly degrade the
  responsiveness of systemd (and from there everything else).

  The following upstream commit may fix this issue, but it is in Paul's
  RCU tree and not in linux-next or upstream yet:

  https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-
  rcu.git/commit/?h=dev=1a05c0cd2fee234a10362cc8f66057557cbb291f

  srcu: Lock srcu_data structure in srcu_gp_start()
  The srcu_gp_start() function is called with the srcu_struct structure's
  ->lock held, but not with the srcu_data structure's ->lock.  This is
  problematic because this function accesses and updates the srcu_data
  structure's ->srcu_cblist, which is protected by that lock.  Failing to
  hold this lock can result in corruption of the SRCU callback lists,
  which in turn can result in arbitrarily bad results.

  This commit therefore makes srcu_gp_start() acquire the srcu_data
  structure's ->lock across the calls to rcu_segcblist_advance() and
  rcu_segcblist_accelerate(), thus preventing this corruption.

  Please investigate this issue and evaluate the proposed fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021/+subscriptions

-- 
Mailing list: 

[Kernel-packages] [Bug 1802021] Re: [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

2019-02-04 Thread David Coronel
@overlord: AFAIK, there is no simple reproducer test case for this
issue. The ideal testing scenario for a bug and fix like this one would
be for each user who reported this issue to use a test kernel with the
fix in their environment and report back if the issue still manifests or
not after some time has passed with that test kernel in your affected
environment.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1802021

Title:
  [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  Confirmed
Status in linux-azure source package in Bionic:
  In Progress
Status in linux-hwe source package in Bionic:
  Confirmed
Status in linux-azure source package in Cosmic:
  In Progress
Status in linux-hwe source package in Cosmic:
  Confirmed

Bug description:
  We had a customer seeing traces like the following:

  tack trace from kern.log:
  2018-10-10T04:43:08.542464+00:00 hbp2ann-2 kernel: INFO: task 
kworker/u16:0:16678 blocked for more than 120 seconds.
  2018-10-10T04:43:08.542503+00:00 hbp2ann-2 kernel: Not tainted 
4.15.0-1023-azure #24~16.04.1-Ubuntu
  2018-10-10T04:43:08.542513+00:00 hbp2ann-2 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  2018-10-10T04:43:08.547366+00:00 hbp2ann-2 kernel: kworker/u16:0 D 0 16678 2 
0x8000
  2018-10-10T04:43:08.547386+00:00 hbp2ann-2 kernel: Workqueue: events_unbound 
fsnotify_mark_destroy_workfn
  2018-10-10T04:43:08.547395+00:00 hbp2ann-2 kernel: Call Trace:
  2018-10-10T04:43:08.547413+00:00 hbp2ann-2 kernel: __schedule+0x3d6/0x8b0
  2018-10-10T04:43:08.547422+00:00 hbp2ann-2 kernel: ? 
check_preempt_wakeup+0xfb/0x240
  2018-10-10T04:43:08.547431+00:00 hbp2ann-2 kernel: ? 
sched_clock_local+0x17/0x90
  2018-10-10T04:43:08.547440+00:00 hbp2ann-2 kernel: schedule+0x36/0x80
  2018-10-10T04:43:08.547448+00:00 hbp2ann-2 kernel: 
schedule_timeout+0x1db/0x370
  2018-10-10T04:43:08.547458+00:00 hbp2ann-2 kernel: ? 
__enqueue_entity+0x5c/0x60
  2018-10-10T04:43:08.547467+00:00 hbp2ann-2 kernel: ? 
enqueue_entity+0x112/0x670
  2018-10-10T04:43:08.547477+00:00 hbp2ann-2 kernel: 
wait_for_completion+0xb4/0x140
  2018-10-10T04:43:08.547486+00:00 hbp2ann-2 kernel: ? wake_up_q+0x70/0x70
  2018-10-10T04:43:08.547510+00:00 hbp2ann-2 kernel: 
__synchronize_srcu.part.13+0x85/0xb0
  2018-10-10T04:43:08.547535+00:00 hbp2ann-2 kernel: ? 
trace_raw_output_rcu_utilization+0x50/0x50
  2018-10-10T04:43:08.547560+00:00 hbp2ann-2 kernel: synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547594+00:00 hbp2ann-2 kernel: ? 
synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547604+00:00 hbp2ann-2 kernel: 
fsnotify_mark_destroy_workfn+0x7c/0xe0
  2018-10-10T04:43:08.547612+00:00 hbp2ann-2 kernel: 
process_one_work+0x14d/0x410
  2018-10-10T04:43:08.547620+00:00 hbp2ann-2 kernel: worker_thread+0x4b/0x460
  2018-10-10T04:43:08.547628+00:00 hbp2ann-2 kernel: kthread+0x105/0x140
  2018-10-10T04:43:08.547637+00:00 hbp2ann-2 kernel: ? 
process_one_work+0x410/0x410
  2018-10-10T04:43:08.547645+00:00 hbp2ann-2 kernel: ? 
kthread_destroy_worker+0x50/0x50
  2018-10-10T04:43:08.547654+00:00 hbp2ann-2 kernel: ? do_syscall_64+0x73/0x130
  2018-10-10T04:43:08.547677+00:00 hbp2ann-2 kernel: ? SyS_exit_group+0x14/0x20
  2018-10-10T04:43:08.547685+00:00 hbp2ann-2 kernel: ret_from_fork+0x35/0x40

  Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120
  seconds.

  We are seeing more issue with fsnotify related callbacks. These are
  not a soft/hard lockup but seem to significantly degrade the
  responsiveness of systemd (and from there everything else).

  The following upstream commit may fix this issue, but it is in Paul's
  RCU tree and not in linux-next or upstream yet:

  https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-
  rcu.git/commit/?h=dev=1a05c0cd2fee234a10362cc8f66057557cbb291f

  srcu: Lock srcu_data structure in srcu_gp_start()
  The srcu_gp_start() function is called with the srcu_struct structure's
  ->lock held, but not with the srcu_data structure's ->lock.  This is
  problematic because this function accesses and updates the srcu_data
  structure's ->srcu_cblist, which is protected by that lock.  Failing to
  hold this lock can result in corruption of the SRCU callback lists,
  which in turn can result in arbitrarily bad results.

  This commit therefore makes srcu_gp_start() acquire the srcu_data
  structure's ->lock across the calls to rcu_segcblist_advance() and
  rcu_segcblist_accelerate(), thus preventing this corruption.

  Please investigate this issue and evaluate the proposed fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net

[Kernel-packages] [Bug 1813211] [NEW] Allow I/O schedulers to be loaded with modprobe in linux-azure

2019-01-24 Thread David Coronel
Public bug reported:

There was a previous request in bug
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671203 to limit
the IO scheduler to NOOP in linux-azure. The other schedulers were
turned off and NOOP was made the default.

However with new upstream releases, new schedulers where added and
linux-azure inherited the default config value for them from the master
kernel. That's why there are some IO schedulers built as modules in the
extra package (mq-deadline and kyber-iosched). In order to use those two
modules, one needs to install the -extra package and modprobe the
modules first:

# tail /sys/block/sd*/queue/scheduler
==> /sys/block/sda/queue/scheduler <==
[none] 

==> /sys/block/sdb/queue/scheduler <==
[none] 

# apt install linux-modules-extra-4.15.0-1036-azure

# modprobe kyber-iosched
# tail /sys/block/sd*/queue/scheduler
==> /sys/block/sda/queue/scheduler <==
[none] kyber 

==> /sys/block/sdb/queue/scheduler <==
[none] kyber 

# modprobe mq-deadline
# tail /sys/block/sd*/queue/scheduler
==> /sys/block/sda/queue/scheduler <==
[none] kyber mq-deadline 

==> /sys/block/sdb/queue/scheduler <==
[none] kyber mq-deadline 


The schedulers cfq and deadline have been completely disabled in LP bug 
#1671203 so they cannot be added with modprobe.

This is a request to move back the cfq and deadline schedulers along
with the mq-deadline and kyber-iosched schedulers to the main linux-
azure package (in other words, not in the -extra package) and allow the
users to be able to modprobe the schedulers when needed.

** Affects: linux-azure (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1813211

Title:
  Allow I/O schedulers to be loaded with modprobe in linux-azure

Status in linux-azure package in Ubuntu:
  New

Bug description:
  There was a previous request in bug
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671203 to limit
  the IO scheduler to NOOP in linux-azure. The other schedulers were
  turned off and NOOP was made the default.

  However with new upstream releases, new schedulers where added and
  linux-azure inherited the default config value for them from the
  master kernel. That's why there are some IO schedulers built as
  modules in the extra package (mq-deadline and kyber-iosched). In order
  to use those two modules, one needs to install the -extra package and
  modprobe the modules first:

  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] 

  # apt install linux-modules-extra-4.15.0-1036-azure

  # modprobe kyber-iosched
  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] kyber 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] kyber 

  # modprobe mq-deadline
  # tail /sys/block/sd*/queue/scheduler
  ==> /sys/block/sda/queue/scheduler <==
  [none] kyber mq-deadline 

  ==> /sys/block/sdb/queue/scheduler <==
  [none] kyber mq-deadline 

  
  The schedulers cfq and deadline have been completely disabled in LP bug 
#1671203 so they cannot be added with modprobe.

  This is a request to move back the cfq and deadline schedulers along
  with the mq-deadline and kyber-iosched schedulers to the main linux-
  azure package (in other words, not in the -extra package) and allow
  the users to be able to modprobe the schedulers when needed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-22 Thread David Coronel
Hi Marcelo,

Maybe I'm doing something wrong but I still don't see the Mellanox
devices with this 4.18.0-1005 kernel from the CKT PPA:

ubuntu@lp1794477:~$ sudo add-apt-repository ppa:canonical-kernel-team/ppa
ubuntu@lp1794477:~$ sudo apt install linux-azure-edge
ubuntu@lp1794477:~$ sudo update-grub
ubuntu@lp1794477:~$ sudo reboot

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1005-azure #5~18.04.1-Ubuntu SMP Thu Nov 22 00:01:08 UTC 
2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
ubuntu@lp1794477:~$ lspci | grep -i mell
ubuntu@lp1794477:~$ lsmod | grep -i mlx

ubuntu@lp1794477:~$ apt-cache policy linux-image-4.18.0-1005-azure
linux-image-4.18.0-1005-azure:
  Installed: 4.18.0-1005.5~18.04.1
  Candidate: 4.18.0-1005.5~18.04.1
  Version table:
 *** 4.18.0-1005.5~18.04.1 500
500 http://ppa.launchpad.net/canonical-kernel-team/ppa/ubuntu 
bionic/main amd64 Packages
100 /var/lib/dpkg/status

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  Fix Committed
Status in linux-azure package in Ubuntu:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
bionic-tree/4.18.0-1004 with all the packages:

$ sudo dpkg -i 
linux-modules-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-modules-extra-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb 
\
 
linux-image-unsigned-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb 
\
 linux-azure-headers-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_all.deb \
 linux-headers-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-azure-edge-tools-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-tools-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 
linux-azure-edge-cloud-tools-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_amd64.deb
 \
 linux-cloud-tools-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb 

$ dpkg -l | grep 4.18.0
ii  linux-azure-edge-cloud-tools-4.18.0-1004 4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-azure-edge-tools-4.18.0-1004   4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-azure-headers-4.18.0-1004  4.18.0-1004.4~18.04.1LP1794477 
 all
ii  linux-cloud-tools-4.18.0-1004-azure  4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-headers-4.18.0-1004-azure  4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-image-unsigned-4.18.0-1004-azure   4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-modules-4.18.0-1004-azure  4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-modules-extra-4.18.0-1004-azure4.18.0-1004.4~18.04.1LP1794477 
 amd64
ii  linux-tools-4.18.0-1004-azure4.18.0-1004.4~18.04.1LP1794477 
 amd64

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #4~18.04.1LP1794477 SMP Wed Nov 21 20:57:54 
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[5.937218] mlx4_core: Mellanox ConnectX core driver v4.0-0
[5.971114]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[6.016903] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
bionic-tree/4.18.0-1004:


ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #4~18.04.1 SMP Wed Nov 21 20:07:48 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[6.337786] mlx4_core: Mellanox ConnectX core driver v4.0-0
[6.374260]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[6.393764] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
4.18.0-1004:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #5~lp1794477 SMP Wed Nov 21 19:19:16 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[6.009608] mlx4_core: Mellanox ConnectX core driver v4.0-0
[6.045063]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[6.085565] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
4.18.0-1003:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1003-azure #4~lp1794477 SMP Wed Nov 21 18:32:18 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[5.701478] mlx4_core: Mellanox ConnectX core driver v4.0-0
[5.732267]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[5.773481] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
4.18.0-1002:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1002-azure #3~lp1794477 SMP Wed Nov 21 16:41:12 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[5.723350] mlx4_core: Mellanox ConnectX core driver v4.0-0
[5.751281]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[5.806824] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
4.18.0-1001:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1001-azure #2~lp1794477 SMP Wed Nov 21 16:06:45 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[5.701988] mlx4_core: Mellanox ConnectX core driver v4.0-0
[5.730013]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[5.780748] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-21 Thread David Coronel
I can see the Mellanox devices with that 4.17.0-1001-azure kernel.

I installed the new kernel:

sudo dpkg -i linux-
modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb linux-
modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb linux-
image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb

Fixed the missing depencies with:

sudo apt --fix-broken install

Updated grub just in case:

sudo update-grub

I rebooted and I see the devices:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.17.0-1001-azure #2~lp1794477 SMP Tue Nov 20 22:40:04 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[5.534382] mlx4_core: Mellanox ConnectX core driver v4.0-0
[5.564313]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[5.588721] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en   114688  0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-20 Thread David Coronel
These are the packages I see installed out of the box in the instance:

ubuntu@lp1794477:~/41701001$ dpkg -l | grep 4.15.0-1030.31 | sort

ii  linux-azure-cloud-tools-4.15.0-1030 4.15.0-1030.31  
amd64Linux kernel version specific cloud tools for version 
4.15.0-1030
ii  linux-azure-headers-4.15.0-1030 4.15.0-1030.31  
all  Header files related to Linux kernel version 4.15.0
ii  linux-azure-tools-4.15.0-1030   4.15.0-1030.31  
amd64Linux kernel version specific tools for version 4.15.0-1030
ii  linux-cloud-tools-4.15.0-1030-azure 4.15.0-1030.31  
amd64Linux kernel version specific cloud tools for version 
4.15.0-1030
ii  linux-headers-4.15.0-1030-azure 4.15.0-1030.31  
amd64Linux kernel headers for version 4.15.0 on 64 bit x86 SMP
ii  linux-image-4.15.0-1030-azure   4.15.0-1030.31  
amd64Signed kernel image azure
ii  linux-modules-4.15.0-1030-azure 4.15.0-1030.31  
amd64Linux kernel extra modules for version 4.15.0 on 64 bit x86 SMP
ii  linux-tools-4.15.0-1030-azure   4.15.0-1030.31  
amd64Linux kernel version specific tools for version 4.15.0-1030

And these are the files I downloaded from your link:

ubuntu@lp1794477:~/41701001$ ls -1 | sort

linux-azure-cloud-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb
linux-azure-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb
linux-cloud-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-headers-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-20 Thread David Coronel
Hi Joseph,

Is there a special way I should install these? I tried:

sudo dpkg -i *

But I get:

Selecting previously unselected package linux-azure-cloud-tools-4.17.0-1001.
(Reading database ... 56547 files and directories currently installed.)
Preparing to unpack 
linux-azure-cloud-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-azure-cloud-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-azure-tools-4.17.0-1001.
Preparing to unpack 
linux-azure-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-azure-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-cloud-tools-4.17.0-1001-azure.
Preparing to unpack 
linux-cloud-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-cloud-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-headers-4.17.0-1001-azure.
Preparing to unpack 
linux-headers-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-headers-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-image-unsigned-4.17.0-1001-azure.
Preparing to unpack 
linux-image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-image-unsigned-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-modules-4.17.0-1001-azure.
Preparing to unpack 
linux-modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-modules-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-modules-extra-4.17.0-1001-azure.
Preparing to unpack 
linux-modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-modules-extra-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-tools-4.17.0-1001-azure.
Preparing to unpack 
linux-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Setting up linux-azure-cloud-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of 
linux-azure-tools-4.17.0-1001:
 linux-azure-tools-4.17.0-1001 depends on libc6 (>= 2.28); however:
  Version of libc6:amd64 on system is 2.27-3ubuntu1.

dpkg: error processing package linux-azure-tools-4.17.0-1001 (--install):
 dependency problems - leaving unconfigured
Setting up linux-cloud-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of 
linux-headers-4.17.0-1001-azure:
 linux-headers-4.17.0-1001-azure depends on linux-azure-headers-4.17.0-1001; 
however:
  Package linux-azure-headers-4.17.0-1001 is not installed.

dpkg: error processing package linux-headers-4.17.0-1001-azure (--install):
 dependency problems - leaving unconfigured
Setting up linux-modules-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of 
linux-modules-extra-4.17.0-1001-azure:
 linux-modules-extra-4.17.0-1001-azure depends on crda | wireless-crda; however:
  Package crda is not installed.
  Package wireless-crda is not installed.

dpkg: error processing package linux-modules-extra-4.17.0-1001-azure 
(--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of 
linux-tools-4.17.0-1001-azure:
 linux-tools-4.17.0-1001-azure depends on linux-azure-tools-4.17.0-1001; 
however:
  Package linux-azure-tools-4.17.0-1001 is not configured yet.

dpkg: error processing package linux-tools-4.17.0-1001-azure (--install):
 dependency problems - leaving unconfigured
Setting up linux-image-unsigned-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
I: /vmlinuz is now a symlink to boot/vmlinuz-4.17.0-1001-azure
I: /initrd.img is now a symlink to boot/initrd.img-4.17.0-1001-azure
Processing triggers for linux-image-unsigned-4.17.0-1001-azure 
(4.17.0-1001.2~lp1794477) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.17.0-1001-azure
/etc/kernel/postinst.d/zz-update-grub:
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.17.0-1001-azure
Found initrd image: /boot/initrd.img-4.17.0-1001-azure
Found linux image: /boot/vmlinuz-4.15.0-1030-azure
Found initrd image: /boot/initrd.img-4.15.0-1030-azure
done
Errors were encountered while processing:
 linux-azure-tools-4.17.0-1001
 linux-headers-4.17.0-1001-azure
 linux-modules-extra-4.17.0-1001-azure
 linux-tools-4.17.0-1001-azure

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
 

[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-20 Thread David Coronel
Hi Chris. I spoke to Joseph and I think we have everything we need to
start the bisect. We'll get started and keep you posted.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 
should happen transparently.
  • Modprobe  -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but 
nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
  - There are no entries in the logs to show anything about the drivers or 
netvsc/pci-hyperv that might relate to this issue.

  Kernel: 4.18.0-7-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794477/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794477] Re: Accelerated networking (SR-IOV VF) broken in 18.10 daily

2018-11-20 Thread David Coronel
I can reproduce the issue in Azure with Ubuntu 18.04 and the kernel
linux-azure-edge 4.18.0.1004.5 from bionic-proposed.

I use an instance of type "Standard F4s_v2 (4 vcpus, 8 GB memory)"

I launch the instance and get the 4.15.0-1030-azure kernel:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 18.04.1 LTS
Release:18.04
Codename:   bionic

$ uname -a
Linux davecore-an 4.15.0-1030-azure #31-Ubuntu SMP Tue Oct 30 18:35:53 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

$ dmesg | grep -i mella
[   27.398120] mlx4_core: Mellanox ConnectX core driver v4.0-0
[   27.425228]  mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand 
driver v4.0-0
[   27.658916] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

$ dpkg -l | grep linux-azure
ii  linux-azure 4.15.0.1030.30  
amd64Complete Linux kernel for Azure systems.
ii  linux-azure-cloud-tools-4.15.0-1030 4.15.0-1030.31  
amd64Linux kernel version specific cloud tools for version 
4.15.0-1030
ii  linux-azure-headers-4.15.0-1030 4.15.0-1030.31  
all  Header files related to Linux kernel version 4.15.0
ii  linux-azure-tools-4.15.0-1030   4.15.0-1030.31  
amd64Linux kernel version specific tools for version 4.15.0-1030

$ lsmod  | grep -i mlx
mlx4_en   114688  0

$ lspci
:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host 
bridge (AGP disabled) (rev 03)
:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual 
VGA
0001:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family 
[ConnectX-3/ConnectX-3 Pro Virtual Function]

I then enable bionic-proposed and install the linux-azure-edge kernel
and reboot. I can't see the Mellanox device anymore:

$ dmesg | grep -i mella

$ uname -a
Linux davecore-an 4.18.0-1004-azure #4~18.04.1-Ubuntu SMP Thu Oct 25 14:25:41 
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep linux-azure
ii  linux-azure  4.15.0.1030.30 
 amd64Complete Linux kernel for Azure systems.
ii  linux-azure-cloud-tools-4.15.0-1030  4.15.0-1030.31 
 amd64Linux kernel version specific cloud tools for version 
4.15.0-1030
ii  linux-azure-edge 4.18.0.1004.5  
 amd64Complete Linux kernel for Azure systems.
ii  linux-azure-edge-cloud-tools-4.18.0-1004 4.18.0-1004.4~18.04.1  
 amd64Linux kernel version specific cloud tools for version 
4.18.0-1004
ii  linux-azure-edge-tools-4.18.0-1004   4.18.0-1004.4~18.04.1  
 amd64Linux kernel version specific tools for version 
4.18.0-1004
ii  linux-azure-headers-4.15.0-1030  4.15.0-1030.31 
 all  Header files related to Linux kernel version 4.15.0
ii  linux-azure-headers-4.18.0-1004  4.18.0-1004.4~18.04.1  
 all  Header files related to Linux kernel version 4.18.0
ii  linux-azure-tools-4.15.0-10304.15.0-1030.31 
 amd64Linux kernel version specific tools for version 
4.15.0-1030

$ lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge 
(AGP disabled) (rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1794477

Title:
  Accelerated networking (SR-IOV VF) broken in 18.10 daily

Status in linux package in Ubuntu:
  In Progress
Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we 
discovered that accelerated networking wasn’t working inside the VM.
  No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in 
use.
  We tested the daily build on Hyper-V also, but there the Mellanox VF is 
functional, with the same mlx4 drivers.

  To give more details about this:
  • No mellanox logs are showing up in dmesg or syslog.
  • Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as 
loaded, although Accelerated Networking is enabled for the Azure VM, so this 

[Kernel-packages] [Bug 1797314] Re: fscache: bad refcounting in fscache_op_complete leads to OOPS

2018-10-30 Thread David Coronel
I have a user who confirms the 4.15.0-39 kernel fixes this issue:

# uname -a
Linux  4.15.0-39-generic #42~16.04.1-Ubuntu SMP Wed Oct 24 17:09:54 
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/fs/fscache/stats 
FS-Cache statistics
Cookies: idx=2 dat=283731 spc=0
Objects: alc=278557 nal=0 avl=278557 ded=220831
ChkAux : non=0 ok=278555 upd=0 obs=0
Pages : mrk=347998933 unc=323161526
Acquire: n=283733 nul=0 noc=0 ok=283733 nbf=0 oom=0
Lookups: n=278557 neg=0 pos=278557 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=226008 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=960948 ok=960948 wt=24031 nod=0 nbf=0 int=0 oom=0
Retrvls: ops=960948 owt=51 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=323161526 gon=0 bsy=0 can=0 wt=0
Ops : pend=51 run=960948 enq=379179023 can=0 rej=0
Ops : ini=960948 dfr=15 rel=960948 gc=15
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
CacheEv: nsp=0 stl=0 rtr=0 cul=0

Marking verification done for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1797314

Title:
  fscache: bad refcounting in fscache_op_complete leads to OOPS

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  In Progress

Bug description:
  SRU Justification
  -

  [Impact]

  A kernel BUG is sometimes observed when using fscache:
  [4740718.880898] FS-Cache:
  [4740718.880920] FS-Cache: Assertion failed
  [4740718.880934] FS-Cache: 0 > 0 is false
  [4740718.881001] [ cut here ]
  [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
  [4740718.881040] invalid opcode:  [#1] SMP

  [4740718.892659] Call Trace:
  [4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
  [4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
  [4740718.895180]  [] process_one_work+0x150/0x3f0
  [4740718.895966]  [] worker_thread+0x11a/0x470
  [4740718.896753]  [] ? __schedule+0x359/0x980
  [4740718.897783]  [] ? rescuer_thread+0x310/0x310
  [4740718.898581]  [] kthread+0xd6/0xf0
  [4740718.899469]  [] ? kthread_park+0x60/0x60
  [4740718.900477]  [] ret_from_fork+0x3f/0x70
  [4740718.901514]  [] ? kthread_park+0x60/0x60

  [Problem]

  In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in
  part:

  atomic_sub(n_pages, >n_pages);
  if (atomic_read(>n_pages) <= 0)
  fscache_op_complete(>op, true);

  The code is using atomic_sub followed by an atomic_read. This causes
  two threads doing a decrement of pages to race with each other seeing
  the op->refcount <= 0 at same time, and end up calling
  fscache_op_complete in both the threads leading to the OOPS.

  [Fix]
  The fix is trivial to use atomic_sub_return instead of two calls.

  [Testcase]
  I believe the user has tested the patch successfully on their 
fscache/cachefiles setup.

  [Regression Potential]
  Limited to fscache. Small, comprehensible change.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797314/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1796376] Re: Bug #1739107 fix causes linux-cloud-tools-common not to be upgradable with unattended-upgrades on shutdown mode

2018-10-15 Thread David Coronel
@Guillaume: I am Eric's colleague from comment #21. I was able to
reproduce the issue in Azure with your reproducer in comment #5. I'll
show Eric how I reproduced the issue and he'll be able to follow up with
you. Thanks!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796376

Title:
  Bug #1739107 fix causes linux-cloud-tools-common not to be upgradable
  with unattended-upgrades on shutdown mode

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Since the following linux-cloud-tools-common package versions :

  Xenial : 4.4.0-135.161
  Bionic : 4.15.0-34.37

  the Systemd service unit file for hv-kvp-daemon has been modified with new 
dependencies that make the package unable to be upgraded with 
unattended-upgrades on shutdown mode.
  Unattended-upgrades hangs with the linux-cloud-tools-common package during 
"Preparing to unpack". The server restarts after the unattended-upgrades 
service timeout expires.

  - Package state after reboot :

  iFR linux-cloud-tools-common  4.15.0-34.37
  all  Linux kernel version specific cloud tools for version
  4.15.0

  - Unattended-upgrades dpkg logs :

  Log started: 2018-10-05  17:59:04
  (Reading database ... 52043 files and directories currently installed.)
  Preparing to unpack .../linux-cloud-tools-common_4.15.0-36.39_all.deb ...
  Log ended: 2018-10-05  18:00:54

  - Impact :

  The current impact is very important as all security updates are
  blocked until you manually fix each server with :

  dpkg --configure -a
  apt install --only-upgrade linux-cloud-tools-common

  - Workaround/Fix :

  As a simple straightforward fix, replacing :

  Before=shutdown.target cloud-init-local.service walinuxagent.service

  with :

  Before=shutdown.target walinuxagent.service

  makes the package upgradable during shutdown.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796376/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793430] Re: Page leaking in cachefiles_read_backing_file while vmscan is active

2018-10-11 Thread David Coronel
** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1796376] Re: Bug #1739107 fix causes linux-cloud-tools-common not to be upgradable with unattended-upgrades on shutdown mode

2018-10-09 Thread David Coronel
Hi Guillaume, would it be possible to provide step-by-step instructions
to reproduce this issue?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796376

Title:
  Bug #1739107 fix causes linux-cloud-tools-common not to be upgradable
  with unattended-upgrades on shutdown mode

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Since the following linux-cloud-tools-common package versions :

  Xenial : 4.4.0-135.161
  Bionic : 4.15.0-34.37

  the Systemd service unit file for hv-kvp-daemon has been modified with new 
dependencies that make the package unable to be upgraded with 
unattended-upgrades on shutdown mode.
  Unattended-upgrades hangs with the linux-cloud-tools-common package during 
"Preparing to unpack". The server restarts after the unattended-upgrades 
service timeout expires.

  - Package state after reboot :

  iFR linux-cloud-tools-common  4.15.0-34.37
  all  Linux kernel version specific cloud tools for version
  4.15.0

  - Unattended-upgrades dpkg logs :

  Log started: 2018-10-05  17:59:04
  (Reading database ... 52043 files and directories currently installed.)
  Preparing to unpack .../linux-cloud-tools-common_4.15.0-36.39_all.deb ...
  Log ended: 2018-10-05  18:00:54

  - Impact :

  The current impact is very important as all security updates are
  blocked until you manually fix each server with :

  dpkg --configure -a
  apt install --only-upgrade linux-cloud-tools-common

  - Workaround/Fix :

  As a simple straightforward fix, replacing :

  Before=shutdown.target cloud-init-local.service walinuxagent.service

  with :

  Before=shutdown.target walinuxagent.service

  makes the package upgradable during shutdown.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796376/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1776277] Re: fscache cookie refcount updated incorrectly during fscache object allocation

2018-09-04 Thread David Coronel
I have confirmation that the kernel in -proposed fixes the issue in
xenial.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1776277

Title:
  fscache cookie refcount updated incorrectly during fscache object
  allocation

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache + Cachefiles use:

   kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
   kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

  [Cause]
   1)Two threads are trying to do operate on a cookie and two objects.
   2a)One thread tries to unmount the filesystem and in process goes over
     a huge list of objects marking them dead and deleting the objects.
     cookie->usage is also decremented in following path
    nfs_fscache_release_super_cookie
     -> __fscache_relinquish_cookie
  ->__fscache_cookie_put
  ->BUG_ON(atomic_read(>usage) <= 0);

   2b)second thread tries to lookup an object for reading data in
  following path
   
   fscache_alloc_object
    1) cachefiles_alloc_object
-> fscache_object_init 
  -> assign cookie, but usage not bumped.
   2) fscache_attach_object -> fails in cant_attach_object because the 
  cookie's backing object or cookie's->parent object are going away
   3)fscache_put_object
     -> cachefiles_put_object
      ->fscache_object_destroy
        ->fscache_cookie_put
     ->BUG_ON(atomic_read(>usage) <= 0);
  [Fix]
   Bump up the cookie usage in fscache_object_init,
   when it is first being assigned a cookie atomically such that the cookie
   is added and bumped up if its refcount is not zero.
   remove the assignment in the attach_object.

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1776254] Re: CacheFiles: Error: Overlong wait for old active object to go away.

2018-09-04 Thread David Coronel
I have confirmation that the kernel in -proposed fixes the issue in
xenial.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1776254

Title:
  CacheFiles: Error: Overlong wait for old active object to go away.

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache + Cachefiles use:

   CacheFiles: Error: Overlong wait for old active object to go away.
   BUG: unable to handle kernel NULL pointer dereference at 0002

   CacheFiles: Error: Object already active
   kernel BUG at fs/cachefiles/namei.c:163!

  [Cause]
In a heavily loaded system with big files being read and truncated,
an fscache object for a cookie is being dropped and a new object being 
looked.
The new object being looked for has to wait for the old object to go away 
before the
new object is moved to active state.

  [Fix]
   Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when 
retrying 
   the object lookup.
   Remove the BUG() for the case where the old object is still being dropped
   and convert to WARN()

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776254/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1788222] [NEW] Bugfix for handling of shadow doorbell buffer

2018-08-21 Thread David Coronel
Public bug reported:

This request is to pull in the following patch for NVMe bug from
https://lore.kernel.org/patchwork/patch/974880/ in the linux-gcp kernel:

===
This patch adds full memory barrier into nvme_dbbuf_update_and_check_event
function to ensure that the shadow doorbell is written before reading
EventIdx from memory. This is a critical bugfix for initial patch that
added support for shadow doorbell into NVMe driver[...].

** Affects: linux-gcp (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-gcp in Ubuntu.
https://bugs.launchpad.net/bugs/1788222

Title:
  Bugfix for handling of shadow doorbell buffer

Status in linux-gcp package in Ubuntu:
  New

Bug description:
  This request is to pull in the following patch for NVMe bug from
  https://lore.kernel.org/patchwork/patch/974880/ in the linux-gcp
  kernel:

  ===
  This patch adds full memory barrier into nvme_dbbuf_update_and_check_event
  function to ensure that the shadow doorbell is written before reading
  EventIdx from memory. This is a critical bugfix for initial patch that
  added support for shadow doorbell into NVMe driver[...].

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-gcp/+bug/1788222/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1783246] Re: Cephfs + fscache: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: jbd2__journal_start+0x22/0x1f0

2018-08-10 Thread David Coronel
I have confirmation from a user who has done verification for this
kernel. Changing to verification-done-bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1783246

Title:
  Cephfs + fscache: unable to handle kernel NULL pointer dereference at
   IP: jbd2__journal_start+0x22/0x1f0

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Impact]
  Certain sequences of file system operations on a cephfs volume backed by 
fscache with an ext4 store can cause a kernel BUG:

  
  [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at 

  [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0
  ...
  [ 5818.962490] Call Trace:
  [ 5818.963055] ? ext4_writepages+0x5d5/0xf40
  [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120
  [ 5818.964994] ext4_writepages+0x5d5/0xf40
  [ 5818.965991] ? __enqueue_entity+0x5c/0x60
  [ 5818.966791] ? check_preempt_wakeup+0x130/0x240
  [ 5818.967679] do_writepages+0x4b/0xe0
  [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0
  [ 5818.969526] ? do_writepages+0x4b/0xe0
  [ 5818.970493] ? ext4_statfs+0x114/0x260
  [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.973385] filemap_write_and_wait+0x31/0x90
  [ 5818.974461] ext4_bmap+0x8c/0xe0
  [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles]
  [ 5818.976718] ? _cond_resched+0x19/0x40
  [ 5818.977482] ? wake_up_bit+0x42/0x50
  [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache]
  [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache]
  [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph]
  [ 5818.981630] ceph_readpages+0x49/0x100 [ceph]
  [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0
  [ 5818.983628] ? __cap_is_valid+0x21/0xb0 [ceph]
  [ 5818.984526] ondemand_readahead+0x11a/0x2a0
  [ 5818.985374] ? ondemand_readahead+0x11a/0x2a0
  [ 5818.986825] page_cache_async_readahead+0x71/0x80
  [ 5818.987751] generic_file_read_iter+0x784/0xbf0
  [ 5818.988663] ? ceph_put_cap_refs+0x1c4/0x330 [ceph]
  [ 5818.989620] ? page_cache_tree_insert+0xe0/0xe0
  [ 5818.990519] ceph_read_iter+0x106/0x820 [ceph]
  [ 5818.991818] new_sync_read+0xe4/0x130
  [ 5818.992588] __vfs_read+0x29/0x40
  [ 5818.993504] vfs_read+0x8e/0x130
  [ 5818.994192] SyS_read+0x55/0xc0
  [ 5818.994870] do_syscall_64+0x73/0x130
  [ 5818.995632] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

  [Fix]
  Cherry-pick 5d988308283ecf062fa88f20ae05c52cce0bcdca from upstream.

  This patch stops cephfs from reusing current->journal for its own
  internal use, which means that it's valid when ext4 uses it via
  fscache.

  [Testcase]
  A user has been using the following test case:
  ( cat /proc/fs/fscache/stats > ~/test.log; i=0; while true; do
  touch small; echo 3 > /proc/sys/vm/drop_caches & md5sum small; let "i++"; 
if ! (( $i % 1000 )); then
  echo "Test iteration $i done" >> ~/test.log; cat 
/proc/fs/fscache/stats >> ~/test.log;
  fi;
  done ) > ~/nohup.out 2>&1

  (It boils down to "touch file; drop caches; read file")
  Without the patch, this fails very quickly - usually the first time, always 
within a few iterations. With the patch, the user ran this loop for over 60 
hours without incident.

  [Regression potential]
  The change is not trivial, but is limited to cephfs, and has been in mainline 
since v4.16. So the risk of regression is well contained.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1783246/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1786110] [NEW] SMB3: Fix regression in server reconnect detection

2018-08-08 Thread David Coronel
Public bug reported:

Request to pull this patch from upstream into the next 4.15 kernel
update:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b2adf22fdfba85a6701c481faccdbbb3a418ccfc

This fixes a regression in SMB 2/3 reconnect detection. The fix has been
backported to stable updates for 4.9.x, 4.14.x, 4.16.x and 4.17.x. But
EOL kernels 4.10.x, 4.11.x, 4.12.x, 4.13.x and 4.15.x are also impacted.

Highest priority is fixing linux-azure but would recommend cherry-
picking this patch for all affected kernels.

** Affects: linux-azure (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1786110

Title:
  SMB3: Fix regression in server reconnect detection

Status in linux-azure package in Ubuntu:
  New

Bug description:
  Request to pull this patch from upstream into the next 4.15 kernel
  update:

  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b2adf22fdfba85a6701c481faccdbbb3a418ccfc

  This fixes a regression in SMB 2/3 reconnect detection. The fix has
  been backported to stable updates for 4.9.x, 4.14.x, 4.16.x and
  4.17.x. But EOL kernels 4.10.x, 4.11.x, 4.12.x, 4.13.x and 4.15.x are
  also impacted.

  Highest priority is fixing linux-azure but would recommend cherry-
  picking this patch for all affected kernels.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1786110/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774336] Re: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

2018-06-19 Thread David Coronel
I have confirmation from a user running NFS stress tests on Xenial with
this kernel and has not seen any issues. Changed tag to verification-
done-xenial.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774336

Title:
  FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache use:

  [81738.886634] FS-Cache: 
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] [ cut here ]
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

  6 == 5 represents an operation being DEAD when it was not expected to
  be.

  [Cause]
  There is a race in fscache and cachefiles. 

  One thread is in cachefiles_read_waiter:
   1) object->work_lock is taken.
   2) the operation is added to the to_do list.
   3) the work lock is dropped.
   4) fscache_enqueue_retrieval is called, which takes a reference.

  Another thread is in cachefiles_read_copier:
   1) object->work_lock is taken
   2) an item is popped off the to_do list.
   3) object->work_lock is dropped.
   4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.

  Now if the this process in cachefiles_read_copier takes place
  *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
  dropped before it is taken, which leads to the objects reference count
  hitting zero, which leads to lifecycle events for the object happening
  too soon, leading to the assertion failure later on.

  (This is simplified and clarified from the original upstream analysis
  for this patch at https://www.redhat.com/archives/linux-
  cachefs/2018-February/msg1.html and from a similar patch with a
  different approach to fixing the bug at
  https://www.redhat.com/archives/linux-cachefs/2017-June/msg2.html)

  [Fix]
  Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This 
means that the object cannot be popped off the to_do list until it is in a 
fully consistent state with the reference taken.

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles. 
   - The change makes things more conservative (doing more under lock) so 
that's reassuring. 
   - There may be performance impacts but none have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1774336/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1762554] Re: [Hyper-V] IB/mlx5: Respect new UMR capabilities

2018-06-07 Thread David Coronel
@Marcelo, I have a reply from Madhuri:

Created By: Madhuri Kaniganti (portal) (07/06/2018 11:40 AM)
Just these two fixes needed:

IB/mlx5: Respect new UMR capabilities
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20180409=c8d75a980fab886a9c716567e6b47cc414ad84ee


IB/mlx5: Enable ECN capable bits for UD RoCE v2 QPs
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/hw/mlx5?id=ea8af0d2f2b5b16da4553205ddaf225e0a057e03

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1762554

Title:
  [Hyper-V] IB/mlx5: Respect new UMR capabilities

Status in linux-azure package in Ubuntu:
  Confirmed
Status in linux-azure source package in Bionic:
  Confirmed

Bug description:
  In some firmware configuration, UMR usage from Virtual Functions is 
restricted.
  This information is published to the driver using new capability bits.

  Avoid using UMRs in these cases and use the Firmware slow-path flow to create
  mkeys and populate them with Virtual to Physical address translation.

  Older drivers that do not have this patch, will end up using memory keys that
  aren't populated with Virtual to Physical address translation that is done
  part of the UMR work.

  https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-
  next.git/commit/?h=next-20180409=c8d75a980fab886a9c716567e6b47cc414ad84ee

  and pull this patch as well
  IB/mlx5: Enable ECN capable bits for UD RoCE v2 QPs
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/hw/mlx5?id=ea8af0d2f2b5b16da4553205ddaf225e0a057e03

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1762554/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1756197] Re: Kernel panic BUG_ON in 4.4.0-92-generic fs/ext4/inode.c:1894!

2018-06-01 Thread David Coronel
Marking bug as invalid as it looks like the root of the issue was with
the user's own module.

** Changed in: linux (Ubuntu)
   Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1756197

Title:
  Kernel panic BUG_ON in 4.4.0-92-generic fs/ext4/inode.c:1894!

Status in linux package in Ubuntu:
  Invalid

Bug description:
  User reports a kernel panic under load:

  [2503476.606215] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/ext4/inode.c:1894!
  [2503476.606236] invalid opcode:  [#1] SMP 
  [2503476.606249] Modules linked in: veth ipt_MASQUERADE 
nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay 
xt_multiport iptable_filter ip_tables x_tables nv_peer_mem(OE) cachefiles 
fscache msr rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) 
ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_core(OE) 
mlx4_en(OE) mlx4_core(OE) ipmi_ssif mxm_wmi intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper 
cryptd joydev input_leds sb_edac edac_core nvidia_uvm(POE) lpc_ich mei_me mei 
shpchp
  [2503476.606483]  8250_fintek ipmi_si acpi_power_meter wmi mac_hid 
ipmi_devintf ipmi_msghandler knem(OE) sunrpc autofs4 nvidia_drm(POE) 
nvidia_modeset(POE) ses enclosure ast ixgbe i2c_algo_bit dca ttm nvidia(POE) 
drm_kms_helper syscopyarea hid_generic vxlan sysfillrect sysimgblt 
ip6_udp_tunnel mlx5_core(OE) usbhid udp_tunnel mlx_compat(OE) megaraid_sas 
fb_sys_fops hid ahci ptp libahci drm pps_core mdio fjes
  [2503476.606608] CPU: 23 PID: 27629 Comm: kworker/u162:3 Tainted: P   
OE   4.4.0-92-generic #115-Ubuntu
  [2503476.606632] Hardware name: NVIDIA DGX-1 with V100/DGX-1 with V100, BIOS 
S2W_3A04 08/29/2017
  [2503476.606659] Workqueue: writeback wb_workfn (flush-8:0)
  [2503476.606675] task: 884b48124600 ti: 887a04a58000 task.ti: 
887a04a58000
  [2503476.606695] RIP: 0010:[]  [] 
ext4_writepage+0x2ec/0x540
  [2503476.606721] RSP: 0018:887a04a5b858  EFLAGS: 00010246
  [2503476.606735] RAX: 0501ef00016d RBX: 1000 RCX: 
0038
  [2503476.606754] RDX: 8827720d3490 RSI: 887a04a5bbd8 RDI: 
ea010f9c1c00
  [2503476.606772] RBP: 887a04a5b8c0 R08: 0001a8c0 R09: 
0002
  [2503476.606790] R10: 88807fff8000 R11: 0033 R12: 
8827720d3328
  [2503476.606808] R13: 887a04a5bbd8 R14: ea010f9c1c00 R15: 
ea010f9c1c00
  [2503476.606828] FS:  () GS:887f7ecc() 
knlGS:
  [2503476.606848] CS:  0010 DS:  ES:  CR0: 80050033
  [2503476.606864] CR2: 7f4fb11430b0 CR3: 02e0a000 CR4: 
003406e0
  [2503476.606882] DR0:  DR1:  DR2: 

  [2503476.606916] DR3:  DR6: fffe0ff0 DR7: 
0400
  [2503476.606935] Stack:
  [2503476.606943]  811ceba3 8118f139 887a04a5b864 
811cd450
  [2503476.606969]    0246 
63a00f9eaf8e9dde
  [2503476.606991]  8827720d3490 887a04a5bbd8 8827720d3490 
ea010f9c1c00
  [2503476.607013] Call Trace:
  [2503476.607025]  [] ? page_mkclean+0x73/0xa0
  [2503476.607043]  [] ? page_referenced_one+0x1a0/0x1a0
  [2503476.607062]  [] __writepage+0x12/0x30
  [2503476.607080]  [] write_cache_pages+0x1ee/0x510
  [2503476.607097]  [] ? ext4_writepage+0x540/0x540
  [2503476.607114]  [] ext4_writepages+0x195/0xd30
  [2503476.607132]  [] ? generic_writepages+0x5b/0x80
  [2503476.607149]  [] do_writepages+0x1e/0x30
  [2503476.607165]  [] __writeback_single_inode+0x45/0x340
  [2503476.607201]  [] writeback_sb_inodes+0x262/0x600
  [2503476.607218]  [] __writeback_inodes_wb+0x8c/0xc0
  [2503476.607236]  [] wb_writeback+0x253/0x310
  [2503476.607251]  [] wb_workfn+0x24d/0x400
  [2503476.607268]  [] process_one_work+0x165/0x480
  [2503476.607285]  [] worker_thread+0x4b/0x4c0
  [2503476.607300]  [] ? process_one_work+0x480/0x480
  [2503476.607318]  [] kthread+0xe5/0x100
  [2503476.607332]  [] ? kthread_create_on_node+0x1e0/0x1e0
  [2503476.607352]  [] ret_from_fork+0x3f/0x70
  [2503476.607368]  [] ? kthread_create_on_node+0x1e0/0x1e0
  [2503476.607385] Code: 74 2b 49 8b 94 24 68 ff ff ff 80 e6 40 74 07 a9 00 00 
00 08 74 17 25 00 08 00 00 3d 00 08 00 00 0f 84 aa fd ff ff e8 15 ed 08 00 <0f> 
0b 49 8b 84 24 68 ff ff ff f6 c4 08 0f 85 92 fd ff ff e9 79 
  [2503476.607525] RIP  [] ext4_writepage+0x2ec/0x540
  [2503476.607543]  RSP 
  [2503476.613569] ---[ end trace 03d35738081084c6 ]---
  [2503476.690872] BUG: unable to handle 

[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete

2018-06-01 Thread David Coronel
I have confirmation from the user who reported this that they have
tested the new kernel 4.4.0-128.154 in their test environment for a 24
hours stress test and could not reproduce the issue. I changed the tag
to verification done.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1750038

Title:
  user space process hung in 'D' state waiting for disk io to complete

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.

  [Fix]
  Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
  ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")

  [Testcase]
  See Test scenario in previous description.

  A test kernel with this patch was tested heavily (>100hrs of test
  suite) without issue.

  [Regression Potential]
  This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 

  However, the patch is already in Artful and Bionic without issue.

  The patch does not apply to Trusty.

  == Previous Description ==

  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.

  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces
  seems to be stuck in kmalloc allocation during cleanup of dirty NFS
  pages.

  All the subsequent operations on the NFS mounts are stuck and reboot
  is required to rectify the situation.

  [Test scenario]

  1) Applications running in Docker environment
  2) Application have cgroup limits --cpu-shares --memory -shm-limit
  3) python and C++ based applications (torch and caffe)
  4) Applications read big lmdb files and write results to NFS shares
  5) use NFS v3 , hard and fscache is enabled
  6) now swap space is configured

  This prevents all other I/O activity on that mount to hang.

  we are running into this issue more frequently and identified few
  applications causing this problem.

  As updated in the description, the problem seems to be happening when
  exercising the stack

  try_to_free_mem_cgroup_pages+0xba/0x1a0

  we see this with docker containers with cgroup option --memory
  .

  whenever there is a deadlock, we see that the process that is hung has
  reached the maximum cgroup limit, multiple times and typically cleans
  up dirty data and caches to bring the usage under the limit.

  This reclaim path happens many times and finally we hit probably a
  race get into deadlock

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1750038/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete

2018-05-28 Thread David Coronel
Created By: David Coronel (28/05/2018 4:28 PM)
Good day Sam, Jamie,

The kernel 4.4.0-128-generic #154 is now ready for testing by Nvidia:

linux | 4.4.0-128.154 | xenial-proposed | source
linux-image-4.4.0-128-generic | 4.4.0-128.154 | xenial-proposed | amd64, arm64, 
armhf, i386, ppc64el, s390x

You have 5 working days from today to test and report results. If you
don't test this package, the patch will be reverted from the kernel.

You can find the instructions to test from -proposed here:
https://wiki.ubuntu.com/Testing/EnableProposed

Let me know if you have any questions.

Thank you,
David

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1750038

Title:
  user space process hung in 'D' state waiting for disk io to complete

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.

  [Fix]
  Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
  ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")

  [Testcase]
  See Test scenario in previous description.

  A test kernel with this patch was tested heavily (>100hrs of test
  suite) without issue.

  [Regression Potential]
  This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 

  However, the patch is already in Artful and Bionic without issue.

  The patch does not apply to Trusty.

  == Previous Description ==

  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.

  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces
  seems to be stuck in kmalloc allocation during cleanup of dirty NFS
  pages.

  All the subsequent operations on the NFS mounts are stuck and reboot
  is required to rectify the situation.

  [Test scenario]

  1) Applications running in Docker environment
  2) Application have cgroup limits --cpu-shares --memory -shm-limit
  3) python and C++ based applications (torch and caffe)
  4) Applications read big lmdb files and write results to NFS shares
  5) use NFS v3 , hard and fscache is enabled
  6) now swap space is configured

  This prevents all other I/O activity on that mount to hang.

  we are running into this issue more frequently and identified few
  applications causing this problem.

  As updated in the description, the problem seems to be happening when
  exercising the stack

  try_to_free_mem_cgroup_pages+0xba/0x1a0

  we see this with docker containers with cgroup option --memory
  .

  whenever there is a deadlock, we see that the process that is hung has
  reached the maximum cgroup limit, multiple times and typically cleans
  up dirty data and caches to bring the usage under the limit.

  This reclaim path happens many times and finally we hit probably a
  race get into deadlock

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1750038/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1744117] [NEW] [SRU][HWE] ALSA backport missing NVIDIA GPU codec IDs to patch table to Ubuntu 16.04 LTS Kernel

2018-01-18 Thread David Coronel
Public bug reported:

Add support for 4/6/8 channel sound in 16.04 LTS kernel for future GPUs.
This commit should be what is needed:

https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/?id=74ec118152ea494a25ebb677cbc83a75c982ac5f

ALSA: hda - Add missing NVIDIA GPU codec IDs to patch table

Add codec IDs for several recently released, pending, and historical
NVIDIA GPU audio controllers to the patch table, to allow the correct
patch functions to be selected for them.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

** Affects: linux (Ubuntu Xenial)
 Importance: Medium
 Assignee: Eric Desrochers (slashd)
 Status: In Progress


** Tags: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1744117

Title:
  [SRU][HWE] ALSA backport missing NVIDIA GPU codec IDs to patch table
  to Ubuntu 16.04 LTS Kernel

Status in linux package in Ubuntu:
  New
Status in linux source package in Xenial:
  In Progress

Bug description:
  Add support for 4/6/8 channel sound in 16.04 LTS kernel for future
  GPUs. This commit should be what is needed:

  
https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/?id=74ec118152ea494a25ebb677cbc83a75c982ac5f

  ALSA: hda - Add missing NVIDIA GPU codec IDs to patch table

  Add codec IDs for several recently released, pending, and historical
  NVIDIA GPU audio controllers to the patch table, to allow the correct
  patch functions to be selected for them.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1744117/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1739107] [NEW] linux-cloud-tools-common: Ensure hv-kvp-daemon.service starts before walinuxagent.service

2017-12-19 Thread David Coronel
Public bug reported:

This is a request to make a change in the hv-kvp-daemon systemd service
which is part of the linux-cloud-tools-common package to ensure the hv-
kvp-daemon service starts before the walinuxagent service. The default
dependencies make hv-kvp-daemon wait until the whole system is up before
it can start.

Currently the /lib/systemd/system/hv-kvp-daemon.service file looks like
this:

 
# On Azure/Hyper-V systems start the hv_kvp_daemon 
# 
# author "Andy Whitcroft " 
[Unit] 
Description=Hyper-V KVP Protocol Daemon 
ConditionVirtualization=microsoft 

[Service] 
ExecStart=/usr/sbin/hv_kvp_daemon -n 

[Install] 
WantedBy=multi-user.target 
 

The suggested modification is to make the [Unit] section look like this:

[Unit] 
Description=Hyper-V KVP Protocol Daemon 
ConditionVirtualization=microsoft 
DefaultDependencies=no 
After=systemd-remount-fs.service 
Before=shutdown.target cloud-init-local.service walinuxagent.service 
Conflicts=shutdown.target 
RequiresMountsFor=/var/lib/hyperv


The hv-kvp-daemon service is not currently part of the critical-chain: 

$ systemd-analyze critical-chain 
The time after the unit is active or started is printed after the "@" 
character. 
The time the unit takes to start is printed after the "+" character. 

graphical.target @10.809s 
└─multi-user.target @10.723s 
└─ephemeral-disk-warning.service @10.538s +31ms 
└─cloud-config.service @8.249s +2.252s 
└─basic.target @8.044s 
└─sockets.target @8.019s 
└─snapd.socket @7.692s +264ms 
└─sysinit.target @6.719s 
└─cloud-init.service @5.803s +842ms 
└─networking.service @5.137s +612ms 
└─network-pre.target @5.074s 
└─cloud-init-local.service @2.257s +2.783s 
└─systemd-remount-fs.service @1.368s +656ms 
└─systemd-journald.socket @1.218s 
└─-.mount @649ms 
└─system.slice @653ms 
└─-.slice @649ms 

In an Azure VM, the current startup time of my test is: 
$ systemd-analyze 
Startup finished in 10.375s (kernel) + 12.352s (userspace) = 22.728s 


After making the suggested change, the startup time is similar: 

$ systemd-analyze 
Startup finished in 9.759s (kernel) + 11.867s (userspace) = 21.627s 

And the service is now in the critical-chain:

$ systemd-analyze critical-chain 
The time after the unit is active or started is printed after the "@" 
character. 
The time the unit takes to start is printed after the "+" character. 

graphical.target @10.666s 
└─multi-user.target @10.636s 
└─ephemeral-disk-warning.service @10.556s +36ms 
└─cloud-config.service @8.423s +2.095s 
└─basic.target @8.124s 
└─sockets.target @8.101s 
└─lxd.socket @7.677s +326ms 
└─sysinit.target @6.755s 
└─cloud-init.service @5.814s +908ms 
└─networking.service @5.111s +651ms 
└─network-pre.target @5.087s 
└─cloud-init-local.service @2.345s +2.707s 
└─hv-kvp-daemon.service @2.316s 
└─systemd-remount-fs.service @1.253s +680ms 
└─system.slice @1.225s 
└─-.slice @650ms 

The ConditionVirtualization=microsoft line makes it so that this doesn't
affect non microsoft virtualization environments (ie. qemu, kvm, vmware,
xen, etc.)

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1739107

Title:
  linux-cloud-tools-common: Ensure hv-kvp-daemon.service starts before
  walinuxagent.service

Status in linux package in Ubuntu:
  New

Bug description:
  This is a request to make a change in the hv-kvp-daemon systemd
  service which is part of the linux-cloud-tools-common package to
  ensure the hv-kvp-daemon service starts before the walinuxagent
  service. The default dependencies make hv-kvp-daemon wait until the
  whole system is up before it can start.

  Currently the /lib/systemd/system/hv-kvp-daemon.service file looks
  like this:

   
  # On Azure/Hyper-V systems start the hv_kvp_daemon 
  # 
  # author "Andy Whitcroft " 
  [Unit] 
  Description=Hyper-V KVP Protocol Daemon 
  ConditionVirtualization=microsoft 

  [Service] 
  ExecStart=/usr/sbin/hv_kvp_daemon -n 

  [Install] 
  WantedBy=multi-user.target 
   

  The suggested modification is to make the [Unit] section look like
  this:

  [Unit] 
  Description=Hyper-V KVP Protocol Daemon 
  ConditionVirtualization=microsoft 
  DefaultDependencies=no 
  After=systemd-remount-fs.service 
  Before=shutdown.target cloud-init-local.service walinuxagent.service 
  Conflicts=shutdown.target 
  RequiresMountsFor=/var/lib/hyperv

  
  The hv-kvp-daemon service is not currently part of the critical-chain: 

  $ systemd-analyze critical-chain 
  The time after the unit is active or started is printed after the "@" 
character. 
  The time the unit takes to start is printed after the "+" character. 

  graphical.target @10.809s 
  └─multi-user.target @10.723s 
  └─ephemeral-disk-warning.service @10.538s +31ms 
  

[Kernel-packages] [Bug 1737033] Re: upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

2017-12-13 Thread David Coronel
I also confirm the kernel 4.4.0-104.127 in xenial-proposed fixes the
issue. I am able to mount my CephFS filesystem normally. Benjamin also
confirmed the kernel works for him in comments #26 and #27. I am
changing the tag to verification-done-xenial. Thanks!

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737033

Title:
  upgrading linux-image package to 4.4.0-103.126 breaks Ceph network
  file system connection

Status in ceph package in Ubuntu:
  Confirmed
Status in linux package in Ubuntu:
  Triaged
Status in ceph source package in Xenial:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  After clients have upgraded to 4.4.0-103.126 they can no longer
  connect to the Ceph network.

  Using the Grub menu to boot the previous kernel fixes the issue.

  The error in dmesg is:

  [   46.811897] FS-Cache: Loaded
  [   46.843670] Key type ceph registered
  [   46.844177] libceph: loaded (mon/osd proto 15/24)
  [   46.863107] FS-Cache: Netfs 'ceph' registered for caching
  [   46.863116] ceph: loaded (mds proto 32)
  [   46.884392] libceph: client3354099 fsid 
2efbeab1-4903-4c4c-8365-6778afecbcbd
  [   46.886856] libceph: mon0 10.10.2.111:6789 session established
  [   46.897487] ceph: problem parsing mds trace -5
  [   46.897491] ceph: mds parse_reply err -5
  [   46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

  All clients are running ceph client version:
  ii  ceph-fs-common10.2.9-0ubuntu0.16.04.1

  Server nodes are running 10.2.6 packages as supplied by Ceph.

  All 10.2.* versions are compatible. Using the previous kernel allows the 
connection to work.
  --- 
  ApportVersion: 2.20.1-0ubuntu2.14
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/controlC0:  lightdm1310 F pulseaudio
  DistroRelease: Ubuntu 16.04
  HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: innotek GmbH VirtualBox
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US
   SHELL=/bin/bash
  ProcFB: 0 vboxdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic 
root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
  ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-103-generic N/A
   linux-backports-modules-4.4.0-103-generic  N/A
   linux-firmware 1.157.14
  RfKill:
   
  Tags:  xenial xenial
  Uname: Linux 4.4.0-103-generic x86_64
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: False
  dmi.bios.date: 12/01/2006
  dmi.bios.vendor: innotek GmbH
  dmi.bios.version: VirtualBox
  dmi.board.name: VirtualBox
  dmi.board.vendor: Oracle Corporation
  dmi.board.version: 1.2
  dmi.chassis.type: 1
  dmi.chassis.vendor: Oracle Corporation
  dmi.modalias: 
dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
  dmi.product.name: VirtualBox
  dmi.product.version: 1.2
  dmi.sys.vendor: innotek GmbH

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1737033/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1728739] Re: Attempt to map rbd image from ceph jewel/luminous hangs

2017-12-08 Thread David Coronel
FYI, this patch seems to have introduced a regression, see
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1737033

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1728739

Title:
  Attempt to map rbd image from ceph jewel/luminous hangs

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]

  Attempting to map an rbd image using the 4.4 LTS (Xenial) kernel
  client from a Jewel or Luminous Ceph cluster with optimal tunables
  fails due to feature set mismatch.

  The Jewel release of Ceph introduced a new set of CRUSH tunables.
  These tunables were first introduced in the 4.5 Linux kernel and are
  thus not available in the 16.04 LTS 4.4 Linux Kernel. Attempting to
  map RBD images as block devices will fail due to not being able to
  understand these new tunables:

  (from kern.log)

  Oct 30 21:19:05 ceph-7 kernel: [  815.674075] Key type ceph registered
  Oct 30 21:19:05 ceph-7 kernel: [  815.676862] libceph: loaded (mon/osd proto 
15/24)
  Oct 30 21:19:05 ceph-7 kernel: [  815.678970] rbd: loaded (major 251)
  Oct 30 21:19:05 ceph-7 kernel: [  815.689556] libceph: mon0 10.5.0.19:6789 
feature set mismatch, my 106b84a842a42 < server's 40106b84a842a42, missing 
400
  Oct 30 21:19:05 ceph-7 kernel: [  815.692897] libceph: mon0 10.5.0.19:6789 
missing required protocol features

  Support for the new CRUSH tunables were added in upstream kernel 4.5
  in http://www.spinics.net/lists/ceph-devel/msg28421.html.

  [Test Case]

  1. Deploy a Jewel or Luminous Ceph cluster.
  2. Create rbd image suitable for the kernel client:
$ rbd create --pool rbd --image-feature layering --size 1G test
  3. Map the rbd image to the local server:
$ rbd map --pool rbd test

  [Regression Potential]

  Minimal. Code is limited to kernel rbd driver and new code should
  primarily affect clients connecting to clusters with the new tunables
  options.

  [Additional Info]

  A workaround is to change the crush tunables configured for the Ceph
  cluster to a legacy version (hammer or lower) via:

  $ ceph osd crush tunables hammer

  However, changing the tunables to hammer fails to allow the cluster to take 
advantage of newer placement strategies which reduces the amount of data 
movement throughout the cluster.
  --- 
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Oct 31 01:23 seq
   crw-rw 1 root audio 116, 33 Oct 31 01:23 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.20.1-0ubuntu2.10
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  DistroRelease: Ubuntu 16.04
  Ec2AMI: ami-0001
  Ec2AMIManifest: FIXME
  Ec2AvailabilityZone: nova
  Ec2InstanceType: m1.small
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: OpenStack Foundation OpenStack Nova
  Package: linux (not installed)
  PciMultimedia:
   
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB:
   
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-98-generic 
root=UUID=d7006b2f-ace6-464d-8b21-17180b3ed360 ro console=tty1 console=ttyS0
  ProcVersionSignature: Ubuntu 4.4.0-98.121-generic 4.4.90
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-98-generic N/A
   linux-backports-modules-4.4.0-98-generic  N/A
   linux-firmwareN/A
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  xenial ec2-images
  Uname: Linux 4.4.0-98-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: True
  dmi.bios.date: 04/01/2014
  dmi.bios.vendor: SeaBIOS
  dmi.bios.version: 1.10.1-1ubuntu1~cloud0
  dmi.chassis.type: 1
  dmi.chassis.vendor: QEMU
  dmi.chassis.version: pc-i440fx-zesty
  dmi.modalias: 
dmi:bvnSeaBIOS:bvr1.10.1-1ubuntu1~cloud0:bd04/01/2014:svnOpenStackFoundation:pnOpenStackNova:pvr15.0.6:cvnQEMU:ct1:cvrpc-i440fx-zesty:
  dmi.product.name: OpenStack Nova
  dmi.product.version: 15.0.6
  dmi.sys.vendor: OpenStack Foundation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728739/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1737033] Re: upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

2017-12-08 Thread David Coronel
I tested a kernel without the following commits from
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728739 and I was
able to mount the CephFS filesystem successfully:

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=f14ca6ac3b198f30ee138f02c3c7d380d165736e

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=09ade67fc6224440719a94e8c8add556eb036437

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=c043f509409e380281403f66c54794efbd5d0f02

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=74771960c7794994fd9067cdd8df0725b6ce75f6

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=5a03f3043cb617cf2cf5ec9cbd0685d0e86e8b0e

http://kernel.ubuntu.com/git/ubuntu/ubuntu-
xenial.git/commit/?id=75ee940e71ed7841958cace371a5c79076a0378c

I spoke to the kernel team and their plan is to revert and re-upload
(likely next week) to restore functionality.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737033

Title:
  upgrading linux-image package to 4.4.0-103.126 breaks Ceph network
  file system connection

Status in ceph package in Ubuntu:
  Confirmed
Status in linux package in Ubuntu:
  Triaged
Status in ceph source package in Xenial:
  Confirmed
Status in linux source package in Xenial:
  Triaged

Bug description:
  After clients have upgraded to 4.4.0-103.126 they can no longer
  connect to the Ceph network.

  Using the Grub menu to boot the previous kernel fixes the issue.

  The error in dmesg is:

  [   46.811897] FS-Cache: Loaded
  [   46.843670] Key type ceph registered
  [   46.844177] libceph: loaded (mon/osd proto 15/24)
  [   46.863107] FS-Cache: Netfs 'ceph' registered for caching
  [   46.863116] ceph: loaded (mds proto 32)
  [   46.884392] libceph: client3354099 fsid 
2efbeab1-4903-4c4c-8365-6778afecbcbd
  [   46.886856] libceph: mon0 10.10.2.111:6789 session established
  [   46.897487] ceph: problem parsing mds trace -5
  [   46.897491] ceph: mds parse_reply err -5
  [   46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

  All clients are running ceph client version:
  ii  ceph-fs-common10.2.9-0ubuntu0.16.04.1

  Server nodes are running 10.2.6 packages as supplied by Ceph.

  All 10.2.* versions are compatible. Using the previous kernel allows the 
connection to work.
  --- 
  ApportVersion: 2.20.1-0ubuntu2.14
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/controlC0:  lightdm1310 F pulseaudio
  DistroRelease: Ubuntu 16.04
  HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: innotek GmbH VirtualBox
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US
   SHELL=/bin/bash
  ProcFB: 0 vboxdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic 
root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
  ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-103-generic N/A
   linux-backports-modules-4.4.0-103-generic  N/A
   linux-firmware 1.157.14
  RfKill:
   
  Tags:  xenial xenial
  Uname: Linux 4.4.0-103-generic x86_64
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: False
  dmi.bios.date: 12/01/2006
  dmi.bios.vendor: innotek GmbH
  dmi.bios.version: VirtualBox
  dmi.board.name: VirtualBox
  dmi.board.vendor: Oracle Corporation
  dmi.board.version: 1.2
  dmi.chassis.type: 1
  dmi.chassis.vendor: Oracle Corporation
  dmi.modalias: 
dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
  dmi.product.name: VirtualBox
  dmi.product.version: 1.2
  dmi.sys.vendor: innotek GmbH

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1737033/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1737033] Re: upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

2017-12-08 Thread David Coronel
I forgot to mention that I was still able to reproduce the bug with the
kernel you built with a revert of commit ff467fd from
http://kernel.ubuntu.com/~jsalisbury/lp1737033

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737033

Title:
  upgrading linux-image package to 4.4.0-103.126 breaks Ceph network
  file system connection

Status in ceph package in Ubuntu:
  New
Status in linux package in Ubuntu:
  In Progress
Status in ceph source package in Xenial:
  New
Status in linux source package in Xenial:
  In Progress

Bug description:
  After clients have upgraded to 4.4.0-103.126 they can no longer
  connect to the Ceph network.

  Using the Grub menu to boot the previous kernel fixes the issue.

  The error in dmesg is:

  [   46.811897] FS-Cache: Loaded
  [   46.843670] Key type ceph registered
  [   46.844177] libceph: loaded (mon/osd proto 15/24)
  [   46.863107] FS-Cache: Netfs 'ceph' registered for caching
  [   46.863116] ceph: loaded (mds proto 32)
  [   46.884392] libceph: client3354099 fsid 
2efbeab1-4903-4c4c-8365-6778afecbcbd
  [   46.886856] libceph: mon0 10.10.2.111:6789 session established
  [   46.897487] ceph: problem parsing mds trace -5
  [   46.897491] ceph: mds parse_reply err -5
  [   46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

  All clients are running ceph client version:
  ii  ceph-fs-common10.2.9-0ubuntu0.16.04.1

  Server nodes are running 10.2.6 packages as supplied by Ceph.

  All 10.2.* versions are compatible. Using the previous kernel allows the 
connection to work.
  --- 
  ApportVersion: 2.20.1-0ubuntu2.14
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/controlC0:  lightdm1310 F pulseaudio
  DistroRelease: Ubuntu 16.04
  HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: innotek GmbH VirtualBox
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US
   SHELL=/bin/bash
  ProcFB: 0 vboxdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic 
root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
  ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-103-generic N/A
   linux-backports-modules-4.4.0-103-generic  N/A
   linux-firmware 1.157.14
  RfKill:
   
  Tags:  xenial xenial
  Uname: Linux 4.4.0-103-generic x86_64
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: False
  dmi.bios.date: 12/01/2006
  dmi.bios.vendor: innotek GmbH
  dmi.bios.version: VirtualBox
  dmi.board.name: VirtualBox
  dmi.board.vendor: Oracle Corporation
  dmi.board.version: 1.2
  dmi.chassis.type: 1
  dmi.chassis.vendor: Oracle Corporation
  dmi.modalias: 
dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
  dmi.product.name: VirtualBox
  dmi.product.version: 1.2
  dmi.sys.vendor: innotek GmbH

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1737033/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1737033] Re: upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

2017-12-08 Thread David Coronel
Hi Joseph,

I'm able to reproduce Benjamin's original issue with kernel
4.4.0-103-generic #126-Ubuntu:

# mount -t ceph :6789:/ /mnt/mycephfs -o name=admin,secret=
mount error 5 = Input/output error

I don't get this problem with 4.4.0-101-generic #124-Ubuntu. I'm working
with Jay Vosburgh to test a build of kernel 4.4.0-103.126 without the
patches from
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728739

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737033

Title:
  upgrading linux-image package to 4.4.0-103.126 breaks Ceph network
  file system connection

Status in ceph package in Ubuntu:
  New
Status in linux package in Ubuntu:
  In Progress
Status in ceph source package in Xenial:
  New
Status in linux source package in Xenial:
  In Progress

Bug description:
  After clients have upgraded to 4.4.0-103.126 they can no longer
  connect to the Ceph network.

  Using the Grub menu to boot the previous kernel fixes the issue.

  The error in dmesg is:

  [   46.811897] FS-Cache: Loaded
  [   46.843670] Key type ceph registered
  [   46.844177] libceph: loaded (mon/osd proto 15/24)
  [   46.863107] FS-Cache: Netfs 'ceph' registered for caching
  [   46.863116] ceph: loaded (mds proto 32)
  [   46.884392] libceph: client3354099 fsid 
2efbeab1-4903-4c4c-8365-6778afecbcbd
  [   46.886856] libceph: mon0 10.10.2.111:6789 session established
  [   46.897487] ceph: problem parsing mds trace -5
  [   46.897491] ceph: mds parse_reply err -5
  [   46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

  All clients are running ceph client version:
  ii  ceph-fs-common10.2.9-0ubuntu0.16.04.1

  Server nodes are running 10.2.6 packages as supplied by Ceph.

  All 10.2.* versions are compatible. Using the previous kernel allows the 
connection to work.
  --- 
  ApportVersion: 2.20.1-0ubuntu2.14
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/controlC0:  lightdm1310 F pulseaudio
  DistroRelease: Ubuntu 16.04
  HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: innotek GmbH VirtualBox
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US
   SHELL=/bin/bash
  ProcFB: 0 vboxdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic 
root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
  ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-103-generic N/A
   linux-backports-modules-4.4.0-103-generic  N/A
   linux-firmware 1.157.14
  RfKill:
   
  Tags:  xenial xenial
  Uname: Linux 4.4.0-103-generic x86_64
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: False
  dmi.bios.date: 12/01/2006
  dmi.bios.vendor: innotek GmbH
  dmi.bios.version: VirtualBox
  dmi.board.name: VirtualBox
  dmi.board.vendor: Oracle Corporation
  dmi.board.version: 1.2
  dmi.chassis.type: 1
  dmi.chassis.vendor: Oracle Corporation
  dmi.modalias: 
dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
  dmi.product.name: VirtualBox
  dmi.product.version: 1.2
  dmi.sys.vendor: innotek GmbH

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1737033/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1737033] Re: upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

2017-12-08 Thread David Coronel
On the running system after you rollback to the previous kernel, can you
paste your /etc/fstab for the filesystems that failed to mount and the
corresponding lines from mount -v? I assume the issues are with nfshome
and officeshare:

cat /etc/fstab | grep -i -e nfshome -e officeshare -e office_share
mount -v | grep -i -e nfshome -e officeshare -e office_share

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737033

Title:
  upgrading linux-image package to 4.4.0-103.126 breaks Ceph network
  file system connection

Status in ceph package in Ubuntu:
  New
Status in linux package in Ubuntu:
  In Progress
Status in ceph source package in Xenial:
  New
Status in linux source package in Xenial:
  In Progress

Bug description:
  After clients have upgraded to 4.4.0-103.126 they can no longer
  connect to the Ceph network.

  Using the Grub menu to boot the previous kernel fixes the issue.

  The error in dmesg is:

  [   46.811897] FS-Cache: Loaded
  [   46.843670] Key type ceph registered
  [   46.844177] libceph: loaded (mon/osd proto 15/24)
  [   46.863107] FS-Cache: Netfs 'ceph' registered for caching
  [   46.863116] ceph: loaded (mds proto 32)
  [   46.884392] libceph: client3354099 fsid 
2efbeab1-4903-4c4c-8365-6778afecbcbd
  [   46.886856] libceph: mon0 10.10.2.111:6789 session established
  [   46.897487] ceph: problem parsing mds trace -5
  [   46.897491] ceph: mds parse_reply err -5
  [   46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

  All clients are running ceph client version:
  ii  ceph-fs-common10.2.9-0ubuntu0.16.04.1

  Server nodes are running 10.2.6 packages as supplied by Ceph.

  All 10.2.* versions are compatible. Using the previous kernel allows the 
connection to work.
  --- 
  ApportVersion: 2.20.1-0ubuntu2.14
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/controlC0:  lightdm1310 F pulseaudio
  DistroRelease: Ubuntu 16.04
  HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
  IwConfig: Error: [Errno 2] No such file or directory
  Lsusb:
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
   Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
   Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
  MachineType: innotek GmbH VirtualBox
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US
   SHELL=/bin/bash
  ProcFB: 0 vboxdrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic 
root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
  ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-103-generic N/A
   linux-backports-modules-4.4.0-103-generic  N/A
   linux-firmware 1.157.14
  RfKill:
   
  Tags:  xenial xenial
  Uname: Linux 4.4.0-103-generic x86_64
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  _MarkForUpload: False
  dmi.bios.date: 12/01/2006
  dmi.bios.vendor: innotek GmbH
  dmi.bios.version: VirtualBox
  dmi.board.name: VirtualBox
  dmi.board.vendor: Oracle Corporation
  dmi.board.version: 1.2
  dmi.chassis.type: 1
  dmi.chassis.vendor: Oracle Corporation
  dmi.modalias: 
dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
  dmi.product.name: VirtualBox
  dmi.product.version: 1.2
  dmi.sys.vendor: innotek GmbH

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1737033/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp