[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2019-07-24 Thread Brad Figg
** Tags added: cscc

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2019-02-14 Thread Andy Whitcroft
This bug was erroneously marked for verification in bionic; verification
is not required and verification-needed-bionic is being removed.

** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2019-02-14 Thread Andy Whitcroft
** Tags removed: verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2019-02-14 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2019-02-14 Thread Andy Whitcroft
** Tags removed: verification-needed-bionic
** Tags added: kernel-fixup-verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-10-25 Thread quanxian
** Changed in: intel
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-10-11 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-10-01 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 4.18.0-8.9

---
linux (4.18.0-8.9) cosmic; urgency=medium

  * linux: 4.18.0-8.9 -proposed tracker (LP: #1791663)

  * Cosmic update to v4.18.7 stable release (LP: #1791660)
- rcu: Make expedited GPs handle CPU 0 being offline
- net: 6lowpan: fix reserved space for single frames
- net: mac802154: tx: expand tailroom if necessary
- 9p/net: Fix zero-copy path in the 9p virtio transport
- spi: davinci: fix a NULL pointer dereference
- spi: pxa2xx: Add support for Intel Ice Lake
- spi: spi-fsl-dspi: Fix imprecise abort on VF500 during probe
- spi: cadence: Change usleep_range() to udelay(), for atomic context
- mmc: block: Fix unsupported parallel dispatch of requests
- mmc: renesas_sdhi_internal_dmac: mask DMAC interrupts
- mmc: renesas_sdhi_internal_dmac: fix #define RST_RESERVED_BITS
- readahead: stricter check for bdi io_pages
- block: fix infinite loop if the device loses discard capability
- block: blk_init_allocated_queue() set q->fq as NULL in the fail case
- block: really disable runtime-pm for blk-mq
- blkcg: Introduce blkg_root_lookup()
- block: Introduce blk_exit_queue()
- block: Ensure that a request queue is dissociated from the cgroup 
controller
- apparmor: fix bad debug check in apparmor_secid_to_secctx()
- dma-buf: Move BUG_ON from _add_shared_fence to _add_shared_inplace
- libertas: fix suspend and resume for SDIO connected cards
- media: Revert "[media] tvp5150: fix pad format frame height"
- mailbox: xgene-slimpro: Fix potential NULL pointer dereference
- Replace magic for trusting the secondary keyring with #define
- Fix kexec forbidding kernels signed with keys in the secondary keyring to
  boot
- powerpc/fadump: handle crash memory ranges array index overflow
- powerpc/64s: Fix page table fragment refcount race vs speculative 
references
- powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
- powerpc/pkeys: Give all threads control of their key permissions
- powerpc/pkeys: Deny read/write/execute by default
- powerpc/pkeys: key allocation/deallocation must not change pkey registers
- powerpc/pkeys: Save the pkey registers before fork
- powerpc/pkeys: Fix calculation of total pkeys.
- powerpc/pkeys: Preallocate execute-only key
- powerpc/nohash: fix pte_access_permitted()
- powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
- powerpc/powernv/pci: Work around races in PCI bridge enabling
- cxl: Fix wrong comparison in cxl_adapter_context_get()
- IB/mlx5: Honor cnt_set_id_valid flag instead of set_id
- IB/mlx5: Fix leaking stack memory to userspace
- IB/srpt: Fix srpt_cm_req_recv() error path (1/2)
- IB/srpt: Fix srpt_cm_req_recv() error path (2/2)
- IB/srpt: Support HCAs with more than two ports
- overflow.h: Add arithmetic shift helper
- RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
- ib_srpt: Fix a use-after-free in srpt_close_ch()
- ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
- RDMA/rxe: Set wqe->status correctly if an unexpected response is received
- 9p: fix multiple NULL-pointer-dereferences
- fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr 
failed
- 9p/virtio: fix off-by-one error in sg list bounds check
- net/9p/client.c: version pointer uninitialized
- net/9p/trans_fd.c: fix race-condition by flushing workqueue before the
  kfree()
- dm integrity: change 'suspending' variable from bool to int
- dm thin: stop no_space_timeout worker when switching to write-mode
- dm cache metadata: save in-core policy_hint_size to on-disk superblock
- dm cache metadata: set dirty on all cache blocks after a crash
- dm crypt: don't decrease device limits
- dm writecache: fix a crash due to reading past end of dirty_bitmap
- uart: fix race between uart_put_char() and uart_shutdown()
- Drivers: hv: vmbus: Fix the offer_in_progress in vmbus_process_offer()
- Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
- iio: sca3000: Fix missing return in switch
- iio: ad9523: Fix displayed phase
- iio: ad9523: Fix return value for ad952x_store()
- extcon: Release locking when sending the notification of connector state
- eventpoll.h: wrap casts in () properly
- vmw_balloon: fix inflation of 64-bit GFNs
- vmw_balloon: do not use 2MB without batching
- vmw_balloon: VMCI_DOORBELL_SET does not check status
- vmw_balloon: fix VMCI use when balloon built into kernel
- rtc: omap: fix resource leak in registration error path
- rtc: omap: fix potential crash on power off
- tracing: Do not call start/stop() functions when tracing_on does not 
change
- tracing/blktrace: Fix to allow setting same value
- printk/tracing: Do not trace printk_nmi_enter()
- livepatch: 

[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-09-27 Thread Thadeu Lima de Souza Cascardo
c7486104a5ce applied to cosmic master-next.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-09-12 Thread quanxian
sorry, there will be more commits available.

c7486104a5ce x86/mce: Fix set_mce_nospec() to avoid #GP fault

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-08-30 Thread Thadeu Lima de Souza Cascardo
** Changed in: linux (Ubuntu Cosmic)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

2018-08-30 Thread Thadeu Lima de Souza Cascardo
** Also affects: linux (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Cosmic)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Cosmic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Cosmic)
 Assignee: (unassigned) => Thadeu Lima de Souza Cascardo (cascardo)

** Changed in: linux (Ubuntu Cosmic)
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b20
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb30
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp