Re: [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-14 Thread Michael Ellerman
Hi Bart,

Bart Van Assche  writes:
>
...
> diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt 
> b/Documentation/features/locking/cmpxchg64/arch-support.txt
> new file mode 100644
> index ..65b3290ce5d5
> --- /dev/null
> +++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
> @@ -0,0 +1,31 @@
> +#
> +# Feature name:  cmpxchg64
> +# Kconfig:   ARCH_HAVE_CMPXCHG64
> +# description:   arch supports the cmpxchg64() API
> +#
> +---
> +| arch |status|
> +---
> +|   alpha: |  ok  |
> +| arc: | TODO |
> +| arm: |!thumb|
> +|   arm64: |  ok  |
> +| c6x: | TODO |
> +|   h8300: | TODO |
> +| hexagon: | TODO |
> +|ia64: |  ok  |
> +|m68k: |  ok  |
> +|  microblaze: | TODO |
> +|mips: |64-bit|
> +|   nios2: | TODO |
> +|openrisc: | TODO |
> +|  parisc: |  ok  |
> +| powerpc: |64-bit|

I think that is correct for powerpc, we don't have a 32-bit
implementation and there's no fallback it seems.

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -150,6 +150,7 @@ config PPC
>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
> + select ARCH_HAVE_CMPXCHG64

So shouldn't this should be:

+   select ARCH_HAVE_CMPXCHG64  if PPC64

And it should be sorted alphabetically, ie. above the previous NMI entry.

cheers


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 08:22:11PM +0800, Ming Lei wrote:
> Hi Jianchao,
> 
> On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> > Hi ming
> > 
> > On 05/14/2018 05:38 PM, Ming Lei wrote:
> > >> Here is the deadlock scenario.
> > >>
> > >> nvme_eh_work // EH0
> > >>   -> nvme_reset_dev //hold reset_lock
> > >> -> nvme_setup_io_queues
> > >>   -> nvme_create_io_queues
> > >> -> nvme_create_queue
> > >>   -> set nvmeq->cq_vector
> > >>   -> adapter_alloc_cq
> > >>   -> adapter_alloc_sq
> > >>  irq has not been requested
> > >>  io timeout 
> > >> nvme_eh_work //EH1
> > >>   -> nvme_dev_disable
> > >> -> quiesce the adminq //> 
> > >> here !
> > >> -> nvme_suspend_queue
> > >>   print out warning Trying to 
> > >> free already-free IRQ 133
> > >> -> nvme_cancel_request // 
> > >> complete the timeout admin request
> > >>   -> require reset_lock
> > >>   -> adapter_delete_cq
> > > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > > nvme_dev_disable() in EH1 will complete it which is set as 
> > > REQ_FAILFAST_DRIVER,
> > > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > > should have been terminated immediately.
> > 
> > Please refer to the following segment:
> > 
> > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> > {
> > struct nvme_dev *dev = nvmeq->dev;
> > int result;
> > ...
> > nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > result = adapter_alloc_cq(dev, qid, nvmeq);
> > if (result < 0)
> > goto release_vector;
> > 
> > result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
> > here
> > if (result < 0)
> > goto release_cq;
> > 
> > nvme_init_queue(nvmeq, qid);
> > result = queue_request_irq(nvmeq);
> > if (result < 0)
> > goto release_sq;
> > 
> > return result;
> > 
> >  release_sq:
> > dev->online_queues--;
> > adapter_delete_sq(dev, qid);
> >  release_cq:// we will be here !
> > adapter_delete_cq(dev, qid);// another cq delete admin 
> > command will be sent out.
> >  release_vector:
> > nvmeq->cq_vector = -1;
> > return result;
> > }
> 
> Given admin queue has been suspended, all admin IO should have
> been terminated immediately, so could you test the following patch?
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..44e38be259f2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
>   nvmeq->cq_vector = -1;
>   spin_unlock_irq(>q_lock);
>  
> - if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
> - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
> -
>   pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
>  
>   return 0;
> @@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   dev->ctrl.admin_q = NULL;
>   return -ENODEV;
>   }
> - } else
> - blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> + }
>  
>   return 0;
>  }

We still have to quiesce admin queue before canceling request, so looks
the following patch is better, so please ignore the above patch and try
the following one and see if your hang can be addressed:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..c2adc76472a8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->ctrl.admin_q = NULL;
return -ENODEV;
}
-   } else
-   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+   }
 
return 0;
 }
@@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown, bool
 */
if (shutdown)
nvme_start_queues(>ctrl);
+
+   /*
+* Avoid to suck reset because timeout may happen during reset and
+* reset may hang forever if admin queue is kept as quiesced
+*/
+   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
mutex_unlock(>shutdown_lock);
 }
 

-- 
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Keith Busch
On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
> > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > > [  760.727488] nvme nvme1: EH: fail controller
> > > 
> > > The above issue(hang in nvme_remove()) is still an old issue, which
> > > is because queues are kept as quiesce during remove, so could you
> > > please test the following change?
> > > 
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 1dec353388be..c78e5a0cde06 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> > >  */
> > > if (ctrl->state == NVME_CTRL_DEAD)
> > > nvme_kill_queues(ctrl);
> > > +   else {
> > > +   if (ctrl->admin_q)
> > > +   blk_mq_unquiesce_queue(ctrl->admin_q);
> > > +   nvme_start_queues(ctrl);
> > > +   }
> > > 
> > > down_write(>namespaces_rwsem);
> > > list_splice_init(>namespaces, _list);
> > 
> > The above won't actually do anything here since the broken link puts the
> > controller in the DEAD state, so we've killed the queues which also
> > unquiesces them.
> 
> I suggest you to double check if the controller is set as DEAD
> in nvme_remove() since there won't be any log dumped when this happen.

Yes, it's dead. pci_device_is_present returns false when the link is
broken.

Also, the logs showed the capacity was set to 0, which only happens when
we kill the namespace queues, which supposedly restarts the queues too.
 
> If controller is set as DEAD and queues are killed, and all IO should
> have been dispatched to driver and nvme_queueu_rq() will fail them all,
> then there isn't any reason to see the hang in your stack trace log.

Right, that's the idea. It just doesn't appear to be working here.


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 09:18:21AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > > [  760.679960] nvme nvme1: controller is down; will reset: 
> > > CSTS=0x, PCI_STATUS=0x
> > > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > > [  760.727103] nvme :86:00.0: Refused to change power state, 
> > > currently in D3
> > 
> > EH may not cover this kind of failure, so it fails in the 1st try.
> 
> Indeed, the test is simulating a permanently broken link, so recovery is
> not expected. A success in this case is just completing driver
> unbinding.
>  
> > > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > [  760.727488] nvme nvme1: EH: fail controller
> > 
> > The above issue(hang in nvme_remove()) is still an old issue, which
> > is because queues are kept as quiesce during remove, so could you
> > please test the following change?
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 1dec353388be..c78e5a0cde06 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> >  */
> > if (ctrl->state == NVME_CTRL_DEAD)
> > nvme_kill_queues(ctrl);
> > +   else {
> > +   if (ctrl->admin_q)
> > +   blk_mq_unquiesce_queue(ctrl->admin_q);
> > +   nvme_start_queues(ctrl);
> > +   }
> > 
> > down_write(>namespaces_rwsem);
> > list_splice_init(>namespaces, _list);
> 
> The above won't actually do anything here since the broken link puts the
> controller in the DEAD state, so we've killed the queues which also
> unquiesces them.

I suggest you to double check if the controller is set as DEAD
in nvme_remove() since there won't be any log dumped when this happen.

If controller is set as DEAD and queues are killed, and all IO should
have been dispatched to driver and nvme_queueu_rq() will fail them all,
then there isn't any reason to see the hang in your stack trace log.

> 
> > BTW, in my environment, it is hard to trigger this failure, so not see
> > this issue, but I did verify the nested EH which can recover from error
> > in reset.
> 
> It's actually pretty easy to trigger this one. I just modify block/019 to
> remove the check for a hotplug slot then run it on a block device that's
> not hot-pluggable.

I will try this test, and hope I can reproduce it in my environment.

Thanks,
Ming


Re: [PATCHv2 blktests] nvme: Add command timeout injection test

2018-05-14 Thread Keith Busch
On Mon, May 14, 2018 at 02:01:36PM -0700, Omar Sandoval wrote:
> On Mon, May 14, 2018 at 02:42:41PM -0600, Keith Busch wrote:
> > This test will run a background IO process and inject an admin command
> > with a very short timeout that is all but guaranteed to expire without
> > a completion: the async event request.
> > 
> > Signed-off-by: Keith Busch 
> > ---
> > v1 -> v2:
> > 
> >   Changed description since its not a test for a specific
> >   regression-fixing patch.
> > 
> >   Added fio requirement.
> > 
> >   Missing quotes around device name.
> 
> I tried this on my QEMU VM, which apparently doesn't emulate this
> command:

Ah, qemu mainline isn't spec compliant and doesn't support AEN requests. I
was testing on real hardware.

This older one has support:

  http://git.infradead.org/users/kbusch/qemu-nvme.git/

It is quite old, though.
 
> Can this test use a loopback target like the nvme tests Johannes has
> added recently?

It looks like the in-kernel nvme target supports it. I'll have to give
it a try.


Re: [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-14 Thread Geert Uytterhoeven
Hi Bart,

On Mon, May 14, 2018 at 10:42 PM, Bart Van Assche
 wrote:
> On Mon, 2018-05-14 at 11:50 -0700, Max Filippov wrote:
>> On Mon, May 14, 2018 at 11:46 AM, Bart Van Assche
>>  wrote:
>> > The next patch in this series introduces a call to cmpxchg64()
>> > in the block layer core for those architectures on which this
>> > functionality is available. Make it possible to test whether
>> > cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
>> >
>> > ---
>>
>> [...]
>>
>> > diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
>> > index c921e8bccdc8..8234278a821d 100644
>> > --- a/arch/xtensa/Kconfig
>> > +++ b/arch/xtensa/Kconfig
>> > @@ -23,6 +23,7 @@ config XTENSA
>> > select HAVE_DMA_CONTIGUOUS
>> > select HAVE_EXIT_THREAD
>> > select HAVE_FUNCTION_TRACER
>> > +   select ARCH_HAVE_CMPXCHG64
>>
>> This breaks alphabetical sorting of Kconfig entries.
>
> Hello Max,
>
> Thanks for the feedback. Do you perhaps know whether keeping names in
> alphabetical order is a requirement for arch/xtensa/Kconfig only or
> whether this is required for all arch/*/Kconfig files?

It's a good practice anyway, as it reduces the probability of merge conflicts.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCHv2 blktests] nvme: Add command timeout injection test

2018-05-14 Thread Omar Sandoval
On Mon, May 14, 2018 at 02:42:41PM -0600, Keith Busch wrote:
> This test will run a background IO process and inject an admin command
> with a very short timeout that is all but guaranteed to expire without
> a completion: the async event request.
> 
> Signed-off-by: Keith Busch 
> ---
> v1 -> v2:
> 
>   Changed description since its not a test for a specific
>   regression-fixing patch.
> 
>   Added fio requirement.
> 
>   Missing quotes around device name.

I tried this on my QEMU VM, which apparently doesn't emulate this
command:

$ sudo ./check nvme/006
nvme/006 => nvme0n1 (test nvme admin command timeout handling with in-flight 
io) [failed]
runtime ...  20.505s
--- tests/nvme/006.out  2018-05-14 13:43:31.150443366 -0700
+++ results/nvme0n1/nvme/006.out.bad2018-05-14 13:57:39.334671448 
-0700
@@ -1,3 +1,3 @@
 Running nvme/006
-passthru: Interrupted system call
+NVMe Status:INVALID_OPCODE(4001) Command Result:
 Test complete

Can this test use a loopback target like the nvme tests Johannes has
added recently?

>  tests/nvme/005 | 42 ++
>  tests/nvme/005.out |  3 +++
>  2 files changed, 45 insertions(+)
>  create mode 100755 tests/nvme/005
>  create mode 100644 tests/nvme/005.out
> 
> diff --git a/tests/nvme/005 b/tests/nvme/005
> new file mode 100755
> index 000..f2fcf19
> --- /dev/null
> +++ b/tests/nvme/005
> @@ -0,0 +1,42 @@
> +#!/bin/bash
> +#
> +# Test for nvme timeout handling with inflight IO.
> +#
> +# Copyright (C) 2018 Keith Busch 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +DESCRIPTION="test nvme admin command timeout handling with in-flight io"
> +
> +QUICK=1
> +
> +requires() {
> + _have_fio && _have_program nvme
> +}
> +
> +test_device() {
> + echo "Running ${TEST_NAME}"
> +
> + # start fio job
> + _run_fio_rand_io --filename="${TEST_DEV}" --time_based --runtime=20 &
> +
> + sleep 5
> +
> + # send nvme admin command 'async event request', which  will surely 
> time out
> + nvme admin-passthru "${TEST_DEV}" -o 0xc --timeout=1
> +
> + wait
> +
> + echo "Test complete"
> +}
> diff --git a/tests/nvme/005.out b/tests/nvme/005.out
> new file mode 100644
> index 000..f4ff837
> --- /dev/null
> +++ b/tests/nvme/005.out
> @@ -0,0 +1,3 @@
> +Running nvme/005
> +passthru: Interrupted system call
> +Test complete
> -- 
> 2.14.3
> 


Re: [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-14 Thread Bart Van Assche
On Mon, 2018-05-14 at 11:50 -0700, Max Filippov wrote:
> On Mon, May 14, 2018 at 11:46 AM, Bart Van Assche
>  wrote:
> > The next patch in this series introduces a call to cmpxchg64()
> > in the block layer core for those architectures on which this
> > functionality is available. Make it possible to test whether
> > cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
> > 
> > ---
> 
> [...]
> 
> > diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
> > index c921e8bccdc8..8234278a821d 100644
> > --- a/arch/xtensa/Kconfig
> > +++ b/arch/xtensa/Kconfig
> > @@ -23,6 +23,7 @@ config XTENSA
> > select HAVE_DMA_CONTIGUOUS
> > select HAVE_EXIT_THREAD
> > select HAVE_FUNCTION_TRACER
> > +   select ARCH_HAVE_CMPXCHG64
> 
> This breaks alphabetical sorting of Kconfig entries.

Hello Max,

Thanks for the feedback. Do you perhaps know whether keeping names in
alphabetical order is a requirement for arch/xtensa/Kconfig only or
whether this is required for all arch/*/Kconfig files?

Bart.

[PATCHv2 blktests] nvme: Add command timeout injection test

2018-05-14 Thread Keith Busch
This test will run a background IO process and inject an admin command
with a very short timeout that is all but guaranteed to expire without
a completion: the async event request.

Signed-off-by: Keith Busch 
---
v1 -> v2:

  Changed description since its not a test for a specific
  regression-fixing patch.

  Added fio requirement.

  Missing quotes around device name.

 tests/nvme/005 | 42 ++
 tests/nvme/005.out |  3 +++
 2 files changed, 45 insertions(+)
 create mode 100755 tests/nvme/005
 create mode 100644 tests/nvme/005.out

diff --git a/tests/nvme/005 b/tests/nvme/005
new file mode 100755
index 000..f2fcf19
--- /dev/null
+++ b/tests/nvme/005
@@ -0,0 +1,42 @@
+#!/bin/bash
+#
+# Test for nvme timeout handling with inflight IO.
+#
+# Copyright (C) 2018 Keith Busch 
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+DESCRIPTION="test nvme admin command timeout handling with in-flight io"
+
+QUICK=1
+
+requires() {
+   _have_fio && _have_program nvme
+}
+
+test_device() {
+   echo "Running ${TEST_NAME}"
+
+   # start fio job
+   _run_fio_rand_io --filename="${TEST_DEV}" --time_based --runtime=20 &
+
+   sleep 5
+
+   # send nvme admin command 'async event request', which  will surely 
time out
+   nvme admin-passthru "${TEST_DEV}" -o 0xc --timeout=1
+
+   wait
+
+   echo "Test complete"
+}
diff --git a/tests/nvme/005.out b/tests/nvme/005.out
new file mode 100644
index 000..f4ff837
--- /dev/null
+++ b/tests/nvme/005.out
@@ -0,0 +1,3 @@
+Running nvme/005
+passthru: Interrupted system call
+Test complete
-- 
2.14.3



Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 14 May 2018, Johannes Weiner wrote:

> Since I'm using the same model and infrastructure for memory and IO
> load as well, IMO it makes more sense to present them in a coherent
> interface instead of trying to retrofit and change the loadavg file,
> which might not even be possible.

Well I keep looking at the loadavg output from numerous tools and then in
my mind I divide by the number of processors, guess if any of the threads
would be doing I/O and if I cannot figure that out groan and run "vmstat"
for awhile to figure that out.

Lets have some numbers there that make more sense please.



Re: [PATCH blktests] nvme: Add command timeout injection test

2018-05-14 Thread Omar Sandoval
On Mon, May 14, 2018 at 02:02:37PM -0600, Keith Busch wrote:
> This test will run a background IO process and inject an admin command
> with a very short timeout that is all but guaranteed to expire without
> a completion: the async event request.

Thanks, a few comments below.

> Signed-off-by: Keith Busch 
> ---
>  tests/nvme/005 | 42 ++
>  tests/nvme/005.out |  3 +++
>  2 files changed, 45 insertions(+)
>  create mode 100755 tests/nvme/005
>  create mode 100644 tests/nvme/005.out
> 
> diff --git a/tests/nvme/005 b/tests/nvme/005
> new file mode 100755
> index 000..3fe9cbe
> --- /dev/null
> +++ b/tests/nvme/005
> @@ -0,0 +1,42 @@
> +#!/bin/bash
> +#
> +# Regression test for timeout handling.

Is there a specific patch that this is testing? If so, could you add:

Regression test for patch "so and so".

> +# Copyright (C) 2018 Keith Busch
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +DESCRIPTION="test nvme admin command timeout handling with in-flight io"
> +
> +QUICK=1
> +
> +requires() {
> + _have_program nvme

Also needs && _have_fio.

> +}
> +
> +test_device() {
> + echo "Running ${TEST_NAME}"
> +
> + # start fio job
> + _run_fio_rand_io --filename="${TEST_DEV}" --time_based --runtime=20 &
> +
> + sleep 5
> +
> + # send nvme admin command 'async event request', which  will surely 
> time out
> + nvme admin-passthru ${TEST_DEV} -o 0xc --timeout=1

Missing quotes around ${TEST_DEV}.

> + wait
> +
> + echo "Test complete"
> +}
> diff --git a/tests/nvme/005.out b/tests/nvme/005.out
> new file mode 100644
> index 000..f4ff837
> --- /dev/null
> +++ b/tests/nvme/005.out
> @@ -0,0 +1,3 @@
> +Running nvme/005
> +passthru: Interrupted system call
> +Test complete
> -- 
> 2.14.3
> 


[PATCH blktests] nvme: Add command timeout injection test

2018-05-14 Thread Keith Busch
This test will run a background IO process and inject an admin command
with a very short timeout that is all but guaranteed to expire without
a completion: the async event request.

Signed-off-by: Keith Busch 
---
 tests/nvme/005 | 42 ++
 tests/nvme/005.out |  3 +++
 2 files changed, 45 insertions(+)
 create mode 100755 tests/nvme/005
 create mode 100644 tests/nvme/005.out

diff --git a/tests/nvme/005 b/tests/nvme/005
new file mode 100755
index 000..3fe9cbe
--- /dev/null
+++ b/tests/nvme/005
@@ -0,0 +1,42 @@
+#!/bin/bash
+#
+# Regression test for timeout handling.
+#
+# Copyright (C) 2018 Keith Busch
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+DESCRIPTION="test nvme admin command timeout handling with in-flight io"
+
+QUICK=1
+
+requires() {
+   _have_program nvme
+}
+
+test_device() {
+   echo "Running ${TEST_NAME}"
+
+   # start fio job
+   _run_fio_rand_io --filename="${TEST_DEV}" --time_based --runtime=20 &
+
+   sleep 5
+
+   # send nvme admin command 'async event request', which  will surely 
time out
+   nvme admin-passthru ${TEST_DEV} -o 0xc --timeout=1
+
+   wait
+
+   echo "Test complete"
+}
diff --git a/tests/nvme/005.out b/tests/nvme/005.out
new file mode 100644
index 000..f4ff837
--- /dev/null
+++ b/tests/nvme/005.out
@@ -0,0 +1,3 @@
+Running nvme/005
+passthru: Interrupted system call
+Test complete
-- 
2.14.3



Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-14 Thread Kent Overstreet
Thanks!

On Mon, May 14, 2018 at 3:24 PM, Jens Axboe  wrote:
> On 5/8/18 7:33 PM, Kent Overstreet wrote:
>>  - Add separately allowed mempools, biosets: bcachefs uses both all over the
>>place
>>
>>  - Bit of utility code - bio_copy_data_iter(), zero_fill_bio_iter()
>>
>>  - bio_list_copy_data(), the bi_next check - defensiveness because of a bug I
>>had fun chasing down at one point
>>
>>  - add some exports, because bcachefs does dio its own way
>>  - show whether fua is supported in sysfs, because I don't know of anything 
>> that
>>exports whether the _block layer_ specifically thinks fua is supported.
>
> Thanks Kent, applied for 4.18 with the update patch 1.
>
> --
> Jens Axboe
>


Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-14 Thread Jens Axboe
On 5/8/18 7:33 PM, Kent Overstreet wrote:
>  - Add separately allowed mempools, biosets: bcachefs uses both all over the
>place
> 
>  - Bit of utility code - bio_copy_data_iter(), zero_fill_bio_iter()
> 
>  - bio_list_copy_data(), the bi_next check - defensiveness because of a bug I
>had fun chasing down at one point
> 
>  - add some exports, because bcachefs does dio its own way
>  - show whether fua is supported in sysfs, because I don't know of anything 
> that
>exports whether the _block layer_ specifically thinks fua is supported.

Thanks Kent, applied for 4.18 with the update patch 1.

-- 
Jens Axboe



Re: [PATCH 01/10] mempool: Add mempool_init()/mempool_exit()

2018-05-14 Thread Kent Overstreet
On Fri, May 11, 2018 at 03:11:45PM -0600, Jens Axboe wrote:
> On 5/8/18 7:33 PM, Kent Overstreet wrote:
> > Allows mempools to be embedded in other structs, getting rid of a
> > pointer indirection from allocation fastpaths.
> > 
> > mempool_exit() is safe to call on an uninitialized but zeroed mempool.
> 
> Looks fine to me. I'd like to carry it through the block branch, as some
> of the following cleanups depend on it. Kent, can you post a v2 with
> the destroy -> exit typo fixed up?
> 
> But would be nice to have someone sign off on it...

Done - it's now up in my git repo:
http://evilpiepirate.org/git/bcachefs.git bcachefs-block


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Johannes Weiner
On Mon, May 14, 2018 at 03:39:33PM +, Christopher Lameter wrote:
> On Mon, 7 May 2018, Johannes Weiner wrote:
> 
> > What to make of this number? If CPU utilization is at 100% and CPU
> > pressure is 0, it means the system is perfectly utilized, with one
> > runnable thread per CPU and nobody waiting. At two or more runnable
> > tasks per CPU, the system is 100% overcommitted and the pressure
> > average will indicate as much. From a utilization perspective this is
> > a great state of course: no CPU cycles are being wasted, even when 50%
> > of the threads were to go idle (and most workloads do vary). From the
> > perspective of the individual job it's not great, however, and they
> > might do better with more resources. Depending on what your priority
> > is, an elevated "some" number may or may not require action.
> 
> This looks awfully similar to loadavg. Problem is that loadavg gets
> screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
> loadavg instead?

Counting iowaiting tasks is one thing, but there are a few more things
that make it hard to use for telling the impact of CPU competition:

- It's not normalized to available CPU count. The loadavg in isolation
  doesn't mean anything, and you have to know the number of CPUs and
  any CPU bindings / restrictions in effect, which presents at least
  some difficulty when monitoring a big heterogeneous fleet.

- The way it's sampled makes it impossible to use for latencies. You
  could be mostly idle but periodically have herds of tasks competing
  for the CPU for short, low-latency operations. Even if we changed
  this in the implementation, you're still stuck with the interface
  that has...

- ...a short-term load window of 1m. This is generally fairly coarse
  for something that can be loaded and unloaded as abruptly as the CPU

I'm trying to fix these with a portable way of aggregating multi-cpu
states, as well as tracking the true time spent in a state instead of
sampling it. Plus a smaller short-term window of 10s, but that's
almost irrelevant because I'm exporting the absolute state time clock
so you can calculate your own averages over any time window you want.

Since I'm using the same model and infrastructure for memory and IO
load as well, IMO it makes more sense to present them in a coherent
interface instead of trying to retrofit and change the loadavg file,
which might not even be possible.


Re: [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-14 Thread Max Filippov
On Mon, May 14, 2018 at 11:46 AM, Bart Van Assche
 wrote:
> The next patch in this series introduces a call to cmpxchg64()
> in the block layer core for those architectures on which this
> functionality is available. Make it possible to test whether
> cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
>
> ---

[...]

> diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
> index c921e8bccdc8..8234278a821d 100644
> --- a/arch/xtensa/Kconfig
> +++ b/arch/xtensa/Kconfig
> @@ -23,6 +23,7 @@ config XTENSA
> select HAVE_DMA_CONTIGUOUS
> select HAVE_EXIT_THREAD
> select HAVE_FUNCTION_TRACER
> +   select ARCH_HAVE_CMPXCHG64

This breaks alphabetical sorting of Kconfig entries.

> select HAVE_FUTEX_CMPXCHG if !MMU
> select HAVE_HW_BREAKPOINT if PERF_EVENTS
> select HAVE_IRQ_TIME_ACCOUNTING

-- 
Thanks.
-- Max


[PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-14 Thread Bart Van Assche
The next patch in this series introduces a call to cmpxchg64()
in the block layer core for those architectures on which this
functionality is available. Make it possible to test whether
cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.

Signed-off-by: Bart Van Assche 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: David S. Miller 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: H. Peter Anvin 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Arnd Bergmann 

f
---
 .../features/locking/cmpxchg64/arch-support.txt| 31 ++
 arch/Kconfig   |  3 +++
 arch/alpha/Kconfig |  1 +
 arch/arm/Kconfig   |  1 +
 arch/arm64/Kconfig |  1 +
 arch/ia64/Kconfig  |  1 +
 arch/m68k/Kconfig  |  1 +
 arch/mips/Kconfig  |  1 +
 arch/parisc/Kconfig|  1 +
 arch/powerpc/Kconfig   |  1 +
 arch/s390/Kconfig  |  1 +
 arch/sparc/Kconfig |  1 +
 arch/x86/Kconfig   |  1 +
 arch/xtensa/Kconfig|  1 +
 14 files changed, 46 insertions(+)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt 
b/Documentation/features/locking/cmpxchg64/arch-support.txt
new file mode 100644
index ..65b3290ce5d5
--- /dev/null
+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
@@ -0,0 +1,31 @@
+#
+# Feature name:  cmpxchg64
+# Kconfig:   ARCH_HAVE_CMPXCHG64
+# description:   arch supports the cmpxchg64() API
+#
+---
+| arch |status|
+---
+|   alpha: |  ok  |
+| arc: | TODO |
+| arm: |!thumb|
+|   arm64: |  ok  |
+| c6x: | TODO |
+|   h8300: | TODO |
+| hexagon: | TODO |
+|ia64: |  ok  |
+|m68k: |  ok  |
+|  microblaze: | TODO |
+|mips: |64-bit|
+|   nios2: | TODO |
+|openrisc: | TODO |
+|  parisc: |  ok  |
+| powerpc: |64-bit|
+|s390: |  ok  |
+|  sh: | TODO |
+|   sparc: |  ok  |
+|  um: | TODO |
+|   unicore32: | TODO |
+| x86: |  ok  |
+|  xtensa: |  ok  |
+---
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..bd54eb125b15 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -358,6 +358,9 @@ config HAVE_ALIGNED_STRUCT_PAGE
  on a struct page for better performance. However selecting this
  might increase the size of a struct page by a word.
 
+config ARCH_HAVE_CMPXCHG64
+   bool
+
 config HAVE_CMPXCHG_LOCAL
bool
 
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index b2022885ced8..94fc28e30ed2 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -16,6 +16,7 @@ config ALPHA
select GENERIC_IRQ_SHOW
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_HAVE_CMPXCHG64
select AUDIT_ARCH
select GENERIC_CLOCKEVENTS
select GENERIC_CPU_VULNERABILITIES
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..7be06e46d329 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -21,6 +21,7 @@ config ARM
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
select ARCH_WANT_IPC_PARSE_VERSION
select BUILDTIME_EXTABLE_SORT if MMU
select CLONE_BACKWARDS
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..998e454a51b5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -23,6 +23,7 @@ config ARM64
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_HAVE_CMPXCHG64
select ARCH_INLINE_READ_LOCK if !PREEMPT
select ARCH_INLINE_READ_LOCK_BH if !PREEMPT

[PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-14 Thread Bart Van Assche
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 183 ++---
 block/blk-mq.h | 117 +++
 block/blk-timeout.c|  95 ++---
 block/blk.h|  14 ++--
 include/linux/blkdev.h |  47 ++---
 7 files changed, 230 insertions(+), 233 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..a7301fcda734 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..40c9aa085613 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
+   WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -494,7 +494,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -547,8 +548,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -593,36 +593,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static 

[PATCH v9 0/2] blk-mq: Rework blk-mq timeout handling again

2018-05-14 Thread Bart Van Assche
Hello Jens,

This is the ninth incarnation of the blk-mq timeout handling rework. All
previously posted comments have been addressed. Please consider this patch
series for inclusion in the upstream kernel.

Bart.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.
Bart Van Assche (2):
  arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  blk-mq: Rework blk-mq timeout handling again

 .../features/locking/cmpxchg64/arch-support.txt|  31 
 arch/Kconfig   |   3 +
 arch/alpha/Kconfig |   1 +
 arch/arm/Kconfig   |   1 +
 arch/arm64/Kconfig |   1 +
 arch/ia64/Kconfig  |   1 +
 arch/m68k/Kconfig  |   1 +
 arch/mips/Kconfig  |   1 +
 arch/parisc/Kconfig|   1 +
 arch/powerpc/Kconfig   |   1 +
 arch/s390/Kconfig  |   1 +
 arch/sparc/Kconfig |   1 +
 arch/x86/Kconfig   |   1 +
 arch/xtensa/Kconfig|   1 +
 block/blk-core.c   |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 183 ++---
 block/blk-mq.h | 117 ++---
 block/blk-timeout.c|  95 ++-
 block/blk.h|  14 +-
 include/linux/blkdev.h |  47 +++---
 21 files changed, 276 insertions(+), 233 deletions(-)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

-- 
2.16.3



Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-14 Thread Bart Van Assche
On Mon, 2018-05-14 at 13:15 +0800, jianchao.wang wrote:
> a 32bit deadline is indeed OK to judge whether a request is timeout or not.
> but how is the expires value determined for __blk_add_timer -> mod_timer ?
> as we know, when a request is started, we need to arm a timer for it.
> the expires value is 'unsigned long'.

This has been addressed in the just posted version 9. Can you have a look?
Although I had tested timeout handling I think this had not caused my tests
to fail because the following code limits request timeout examination to 5s
in the future:

#define BLK_MAX_TIMEOUT (5 * HZ)

unsigned long blk_rq_timeout(unsigned long timeout)
{
unsigned long maxt;

maxt = round_jiffies_up(jiffies + BLK_MAX_TIMEOUT);
if (time_after(timeout, maxt))
timeout = maxt;

return timeout;
}

Thanks,

Bart.




Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-14 Thread Paolo Valente


> Il giorno 14 mag 2018, alle ore 19:31, Jens Axboe  ha 
> scritto:
> 
> On 5/14/18 11:16 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 10 mag 2018, alle ore 18:14, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Fri, 2018-05-04 at 19:17 +0200, Paolo Valente wrote:
 When invoked for an I/O request rq, [ ... ]
>>> 
>>> Tested-by: Bart Van Assche 
>>> 
>>> 
>>> 
>> 
>> Any decision for this fix, Jens?
> 
> Guess I didn't reply, but I did commit this on Thursday.
> 

Great, thank you!

Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Bart Van Assche

On 05/14/18 08:39, Christopher Lameter wrote:

On Mon, 7 May 2018, Johannes Weiner wrote:

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.


This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?


The following article explains why it probably made sense in 1993 to 
include TASK_UNINTERRUPTIBLE in loadavg and also why this no longer 
makes sense today:


http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Bart.


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-14 Thread Jens Axboe
On 5/14/18 11:16 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 10 mag 2018, alle ore 18:14, Bart Van Assche 
>>  ha scritto:
>>
>> On Fri, 2018-05-04 at 19:17 +0200, Paolo Valente wrote:
>>> When invoked for an I/O request rq, [ ... ]
>>
>> Tested-by: Bart Van Assche 
>>
>>
>>
> 
> Any decision for this fix, Jens?

Guess I didn't reply, but I did commit this on Thursday.

-- 
Jens Axboe



Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-14 Thread Paolo Valente


> Il giorno 10 mag 2018, alle ore 18:14, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-05-04 at 19:17 +0200, Paolo Valente wrote:
>> When invoked for an I/O request rq, [ ... ]
> 
> Tested-by: Bart Van Assche 
> 
> 
> 

Any decision for this fix, Jens?

Thanks,
Paolo

Re: [PATCH 31/33] iomap: add support for sub-pagesize buffered I/O without buffer heads

2018-05-14 Thread Goldwyn Rodrigues


On 05/09/2018 02:48 AM, Christoph Hellwig wrote:
> After already supporting a simple implementation of buffered writes for
> the blocksize == PAGE_SIZE case in the last commit this adds full support
> even for smaller block sizes.   There are three bits of per-block
> information in the buffer_head structure that really matter for the iomap
> read and write path:
> 
>  - uptodate status (BH_uptodate)
>  - marked as currently under read I/O (BH_Async_Read)
>  - marked as currently under write I/O (BH_Async_Write)
> 
> Instead of having new per-block structures this now adds a per-page
> structure called struct iomap_page to track this information in a slightly
> different form:
> 
>  - a bitmap for the per-block uptodate status.  For worst case of a 64k
>page size system this bitmap needs to contain 128 bits.  For the
>typical 4k page size case it only needs 8 bits, although we still
>need a full unsigned long due to the way the atomic bitmap API works.
>  - two atomic_t counters are used to track the outstanding read and write
>counts
> 
> There is quite a bit of boilerplate code as the buffered I/O path uses
> various helper methods, but the actual code is very straight forward.
> 
> In this commit the code can't actually be used yet, as we need to
> switch from the old implementation to the new one together with the
> XFS writeback code.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 262 +-
>  include/linux/iomap.h |  32 ++
>  2 files changed, 264 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a3861945504f..4e7ac6aa88ef 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -109,6 +110,107 @@ iomap_block_needs_zeroing(struct inode *inode, loff_t 
> pos, struct iomap *iomap)
> return iomap->type != IOMAP_MAPPED || pos > i_size_read(inode);
>  }
>  
> +static struct iomap_page *
> +iomap_page_create(struct inode *inode, struct page *page)
> +{
> + struct iomap_page *iop = to_iomap_page(page);
> +
> + if (iop || i_blocksize(inode) == PAGE_SIZE)
> + return iop;

Why is this an equal comparison operator? Shouldn't this be >= to
include filesystem blocksize greater than PAGE_SIZE?

-- 
Goldwyn


Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Christopher Lameter
On Mon, 7 May 2018, Johannes Weiner wrote:

> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resources. Depending on what your priority
> is, an elevated "some" number may or may not require action.

This looks awfully similar to loadavg. Problem is that loadavg gets
screwed up by tasks blocked waiting for I/O. Isnt there some way to fix
loadavg instead?



Re: [PATCH 6/6] block: consistently use GFP_NOIO instead of __GFP_NORECLAIM

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> Same numerical value (for now at least), but a much better documentation
> of intent.

There is a typo in the subject of this patch: __GFP_NORECLAIM should be
changed into __GFP_RECLAIM. Otherwise:

Reviewed-by: Bart Van Assche 





Re: [PATCH 5/6] block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> We just can't do I/O when doing block layer requests allocations,
> so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.

Reviewed-by: Bart Van Assche 




Re: [PATCH 4/6] block: pass an explicit gfp_t to get_request

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> blk_old_get_request already has it at hand, and in blk_queue_bio, which
> is the fast path, it is constant.

Reviewed-by: Bart Van Assche 




Re: [PATCH 3/6] block: sanitize blk_get_request calling conventions

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> Switch everyone to blk_get_request_flags, and then rename
> blk_get_request_flags to blk_get_request.

Reviewed-by: Bart Van Assche 





Re: [PATCH 2/6] block: fix __get_request documentation

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Bart Van Assche 





Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Keith Busch
Hi Ming,

On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
> > PCI_STATUS=0x
> > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > [  760.727103] nvme :86:00.0: Refused to change power state, currently 
> > in D3
> 
> EH may not cover this kind of failure, so it fails in the 1st try.

Indeed, the test is simulating a permanently broken link, so recovery is
not expected. A success in this case is just completing driver
unbinding.
 
> > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > [  760.727488] nvme nvme1: EH: fail controller
> 
> The above issue(hang in nvme_remove()) is still an old issue, which
> is because queues are kept as quiesce during remove, so could you
> please test the following change?
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1dec353388be..c78e5a0cde06 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>  */
> if (ctrl->state == NVME_CTRL_DEAD)
> nvme_kill_queues(ctrl);
> +   else {
> +   if (ctrl->admin_q)
> +   blk_mq_unquiesce_queue(ctrl->admin_q);
> +   nvme_start_queues(ctrl);
> +   }
> 
> down_write(>namespaces_rwsem);
> list_splice_init(>namespaces, _list);

The above won't actually do anything here since the broken link puts the
controller in the DEAD state, so we've killed the queues which also
unquiesces them.

> BTW, in my environment, it is hard to trigger this failure, so not see
> this issue, but I did verify the nested EH which can recover from error
> in reset.

It's actually pretty easy to trigger this one. I just modify block/019 to
remove the check for a hotplug slot then run it on a block device that's
not hot-pluggable.


Re: [PATCH 1/6] scsi/osd: remove the gfp argument to osd_start_request

2018-05-14 Thread Bart Van Assche
On Wed, 2018-05-09 at 09:54 +0200, Christoph Hellwig wrote:
> Always GFP_KERNEL, and keeping it would cause serious complications for
> the next change.

This patch description is very brief. Shouldn't the description of this patch
mention whether or not any functionality is changed (I think no functionality
has been changed)?

Bart.





Re: fix confusion around GFP_* flags and blk_get_request

2018-05-14 Thread Jens Axboe
On 5/14/18 8:38 AM, Christoph Hellwig wrote:
> Jens, any comments?

Looks good to me.

-- 
Jens Axboe



Re: fix confusion around GFP_* flags and blk_get_request

2018-05-14 Thread Christoph Hellwig
Jens, any comments?

On Wed, May 09, 2018 at 09:54:02AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series sorts out the mess around how we use gfp flags in the
> block layer get_request interface.
> 
> Changes since RFC:
>   - don't switch to GFP_NOIO for allocations in blk_get_request.
> blk_get_request is used by the multipath code in potentially dead lock
> prone areas, so this will need a separate audit and maybe a flag.
---end quoted text---


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
Hi Jianchao,

On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/14/2018 05:38 PM, Ming Lei wrote:
> >> Here is the deadlock scenario.
> >>
> >> nvme_eh_work // EH0
> >>   -> nvme_reset_dev //hold reset_lock
> >> -> nvme_setup_io_queues
> >>   -> nvme_create_io_queues
> >> -> nvme_create_queue
> >>   -> set nvmeq->cq_vector
> >>   -> adapter_alloc_cq
> >>   -> adapter_alloc_sq
> >>  irq has not been requested
> >>  io timeout 
> >> nvme_eh_work //EH1
> >>   -> nvme_dev_disable
> >> -> quiesce the adminq //> here 
> >> !
> >> -> nvme_suspend_queue
> >>   print out warning Trying to free 
> >> already-free IRQ 133
> >> -> nvme_cancel_request // complete 
> >> the timeout admin request
> >>   -> require reset_lock
> >>   -> adapter_delete_cq
> > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > nvme_dev_disable() in EH1 will complete it which is set as 
> > REQ_FAILFAST_DRIVER,
> > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > should have been terminated immediately.
> 
> Please refer to the following segment:
> 
> static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> {
>   struct nvme_dev *dev = nvmeq->dev;
>   int result;
> ...
>   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
>   result = adapter_alloc_cq(dev, qid, nvmeq);
>   if (result < 0)
>   goto release_vector;
> 
>   result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
> here
>   if (result < 0)
>   goto release_cq;
> 
>   nvme_init_queue(nvmeq, qid);
>   result = queue_request_irq(nvmeq);
>   if (result < 0)
>   goto release_sq;
> 
>   return result;
> 
>  release_sq:
>   dev->online_queues--;
>   adapter_delete_sq(dev, qid);
>  release_cq:// we will be here !
>   adapter_delete_cq(dev, qid);// another cq delete admin 
> command will be sent out.
>  release_vector:
>   nvmeq->cq_vector = -1;
>   return result;
> }

Given admin queue has been suspended, all admin IO should have
been terminated immediately, so could you test the following patch?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..44e38be259f2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
nvmeq->cq_vector = -1;
spin_unlock_irq(>q_lock);
 
-   if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-   blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
return 0;
@@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->ctrl.admin_q = NULL;
return -ENODEV;
}
-   } else
-   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+   }
 
return 0;
 }

> 
> 
> > 
> > I guess the issue should be that nvme_create_io_queues() ignores the 
> > failure.
> > 
> > Could you dump the stack trace of EH0 reset task? So that we may see
> > where EH0 reset kthread hangs.
> 
> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
> [<0>] blk_execute_rq+0xf7/0x150
> [<0>] __nvme_submit_sync_cmd+0x94/0x110
> [<0>] nvme_submit_sync_cmd+0x1b/0x20
> [<0>] adapter_delete_queue+0xad/0xf0
> [<0>] nvme_reset_dev+0x1b67/0x2450
> [<0>] nvme_eh_work+0x19c/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0x

Even without this patch, the above hang can happen in reset context,
so this issue isn't related with the introduced 'reset_lock'.

> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
> [<0>] nvme_eh_work+0x11a/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0x
> 
> > 
> >> -> adapter_delete_queue // submit to the adminq which has been 
> >> quiesced.
> >>   -> nvme_submit_sync_cmd
> >> -> blk_execute_rq
> >>   -> wait_for_completion_io_timeout
> >>   hang_check is true, so there is no hung task warning for 
> >> this context
> >>
> >> EH0 submit cq delete admin command, but it will never be completed or 
> >> timed out, because the admin request queue has
> >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
> >> reset_lock and make 

[PATCH blktests] Documentation: document prerequisite scriptlets

2018-05-14 Thread Johannes Thumshirn
The config file is bash and it gets sourced, so all bash magic is
doable in there as well. Document it so others don't have to
re-discover this gem as well.

Signed-off-by: Johannes Thumshirn 
---
 Documentation/running-tests.md | 12 
 1 file changed, 12 insertions(+)

diff --git a/Documentation/running-tests.md b/Documentation/running-tests.md
index a479d5e94c5e..b477c0679683 100644
--- a/Documentation/running-tests.md
+++ b/Documentation/running-tests.md
@@ -66,3 +66,15 @@ command line option.
 QUICK_RUN=1
 TIMEOUT=30
 ```
+
+### Pre-test setups
+
+Some tests, may need special prerequisites, like configfs being
+mounted for NVMe over Fabrics tests. You can add your custom bash
+scriptlets to `config` to get this done, e.g.:
+
+```sh
+if ! test $(grep -q configfs /proc/mounts) ; then
+mount -t configfs none /sys/kernel/config
+fi
+```
-- 
2.16.3



Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

On 05/14/2018 05:38 PM, Ming Lei wrote:
>> Here is the deadlock scenario.
>>
>> nvme_eh_work // EH0
>>   -> nvme_reset_dev //hold reset_lock
>> -> nvme_setup_io_queues
>>   -> nvme_create_io_queues
>> -> nvme_create_queue
>>   -> set nvmeq->cq_vector
>>   -> adapter_alloc_cq
>>   -> adapter_alloc_sq
>>  irq has not been requested
>>  io timeout 
>> nvme_eh_work //EH1
>>   -> nvme_dev_disable
>> -> quiesce the adminq //> here !
>> -> nvme_suspend_queue
>>   print out warning Trying to free 
>> already-free IRQ 133
>> -> nvme_cancel_request // complete 
>> the timeout admin request
>>   -> require reset_lock
>>   -> adapter_delete_cq
> If the admin IO submitted in adapter_alloc_sq() is timed out,
> nvme_dev_disable() in EH1 will complete it which is set as 
> REQ_FAILFAST_DRIVER,
> then adapter_alloc_sq() should return error, and the whole reset in EH0
> should have been terminated immediately.

Please refer to the following segment:

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
struct nvme_dev *dev = nvmeq->dev;
int result;
...
nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
goto release_vector;

result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
here
if (result < 0)
goto release_cq;

nvme_init_queue(nvmeq, qid);
result = queue_request_irq(nvmeq);
if (result < 0)
goto release_sq;

return result;

 release_sq:
dev->online_queues--;
adapter_delete_sq(dev, qid);
 release_cq:// we will be here !
adapter_delete_cq(dev, qid);// another cq delete admin 
command will be sent out.
 release_vector:
nvmeq->cq_vector = -1;
return result;
}


> 
> I guess the issue should be that nvme_create_io_queues() ignores the failure.
> 
> Could you dump the stack trace of EH0 reset task? So that we may see
> where EH0 reset kthread hangs.

root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
[<0>] blk_execute_rq+0xf7/0x150
[<0>] __nvme_submit_sync_cmd+0x94/0x110
[<0>] nvme_submit_sync_cmd+0x1b/0x20
[<0>] adapter_delete_queue+0xad/0xf0
[<0>] nvme_reset_dev+0x1b67/0x2450
[<0>] nvme_eh_work+0x19c/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x
root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
[<0>] nvme_eh_work+0x11a/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x

> 
>> -> adapter_delete_queue // submit to the adminq which has been 
>> quiesced.
>>   -> nvme_submit_sync_cmd
>> -> blk_execute_rq
>>   -> wait_for_completion_io_timeout
>>   hang_check is true, so there is no hung task warning for 
>> this context
>>
>> EH0 submit cq delete admin command, but it will never be completed or timed 
>> out, because the admin request queue has
>> been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
>> reset_lock and make things forward.
> The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
> admin command, which won't be retried because it is set as
> REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
> nvme_dev_disable().

This cq delete admin command is sent out after EH 1 nvme_dev_disable completed 
and failed the
previous timeout sq alloc admin command. please refer to the code segment above.

Thanks
jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 04:21:04PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Please refer to my test log and analysis.
> 
> [  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
> [  229.872649] nvme nvme0: EH 0: before shutdown
> [  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
> [  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
> [  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
> [  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
> [  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
> [  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
> [  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
> [  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
> [  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
> [  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
> [  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
> [  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
> [  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
> [  229.874041] nvme nvme0: Abort status: 0x0
> [  229.874064] nvme nvme0: Abort status: 0x0
> [  229.874078] nvme nvme0: Abort status: 0x0
> [  229.874092] nvme nvme0: Abort status: 0x0
> [  229.874106] nvme nvme0: Abort status: 0x0
> [  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
> [  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> [  290.840746] nvme nvme0: EH 1: before shutdown
> 
> [  290.992650] [ cut here ]
> [  290.992751] Trying to free already-free IRQ 133
> [  290.992783] WARNING: CPU: 6 PID: 227 at 
> /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 
> __free_irq+0xf6/0x420
> [  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 
> 4.17.0-rc3+ #37
> [  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
> [  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
> [  290.993436] RIP: 0010:__free_irq+0xf6/0x420
> ...
> [  290.993541] Call Trace:
> [  290.993581]  free_irq+0x4c/0xa0
> [  290.993600]  pci_free_irq+0x18/0x30
> [  290.993615]  nvme_dev_disable+0x20b/0x720
> [  290.993745]  nvme_eh_work+0xdd/0x4b0
> [  290.993866]  process_one_work+0x3ca/0xaa0
> [  290.993960]  worker_thread+0x89/0x6c0
> [  290.994017]  kthread+0x18d/0x1e0
> [  290.994061]  ret_from_fork+0x24/0x30
> [  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
> [  338.913348]   Tainted: GW 4.17.0-rc3+ #37
> [  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  338.913809] kworker/u16:4   D0   227  2 0x8000
> [  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
> [  338.913857] Call Trace:
> [  338.914002]  schedule+0x52/0xe0
> [  338.914019]  schedule_preempt_disabled+0x14/0x20
> [  338.914029]  __mutex_lock+0x5b4/0xca0
> [  338.914228]  nvme_eh_work+0x11a/0x4b0
> [  338.914344]  process_one_work+0x3ca/0xaa0
> [  338.914440]  worker_thread+0x89/0x6c0
> [  338.914496]  kthread+0x18d/0x1e0
> [  338.914542]  ret_from_fork+0x24/0x30
> [  338.914648]

The above isn't a new issue too, since nvme_dev_disable() can be
run before the controller is recovered(nvme_setup_io_queues() done)
without this patchset.

This can happen in several cases, such as the one you listed below, or
two or more timed-out req are triggered from different queues.

This issue is that genirq won't work well if the same IRQ is freed by
two times.

> 
> Here is the deadlock scenario.
> 
> nvme_eh_work // EH0
>   -> nvme_reset_dev //hold reset_lock
> -> nvme_setup_io_queues
>   -> nvme_create_io_queues
> -> nvme_create_queue
>   -> set nvmeq->cq_vector
>   -> adapter_alloc_cq
>   -> adapter_alloc_sq
>  irq has not been requested
>  io timeout 
> nvme_eh_work //EH1
>   -> nvme_dev_disable
> -> quiesce the adminq //> here !
> -> nvme_suspend_queue
>   print out warning Trying to free 
> already-free IRQ 133
> -> nvme_cancel_request // complete 
> the timeout admin request
>   -> require reset_lock
>   -> adapter_delete_cq

If the admin IO submitted in adapter_alloc_sq() is timed out,
nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
then adapter_alloc_sq() should return error, and the whole reset in EH0
should have been terminated immediately.

I guess the issue should be that nvme_create_io_queues() ignores the failure.

Could you dump the stack trace of EH0 reset task? So that we may see
where EH0 reset kthread hangs.

> -> adapter_delete_queue // submit to the adminq which has been 
> quiesced.

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Peter Zijlstra
On Thu, May 10, 2018 at 09:41:32AM -0400, Johannes Weiner wrote:
> So there is a reason I'm tracking productivity states per-cpu and not
> globally. Consider the following example periods on two CPUs:
> 
> CPU 0
> Task 1: | EXECUTING  | memstalled |
> Task 2: | runqueued  | EXECUTING  |
> 
> CPU 1
> Task 3: | memstalled | EXECUTING  |
> 
> If we tracked only the global number of stalled tasks, similarly to
> nr_uninterruptible, the number would be elevated throughout the whole
> sampling period, giving a pressure value of 100% for "some stalled".
> And, since there is always something executing, a "full stall" of 0%.

But if you read the comment about SMP IO-wait; see commit:

  e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")

you'll see that per-cpu accounting has issues too.

Also, note that in your example above you have 1 memstalled task (at any
one time), but _2_ CPUs. So at most you should end up with a 50% value.
There is no way 1 task could consume 2 CPUs worth of time.

Furthermore, associating a blocked task to any particular CPU is
fundamentally broken and I'll hard NAK anything that relies on it.

> Now consider what happens when the Task 3 sequence is the other way
> around:
> 
> CPU 0
> Task 1: | EXECUTING  | memstalled |
> Task 2: | runqueued  | EXECUTING  |
> 
> CPU 1
> Task 3: | EXECUTING  | memstalled |
> 
> Here the number of stalled tasks is elevated only during half of the
> sampling period, this time giving a pressure reading of 50% for "some"
> (and again 0% for "full").

That entirely depends on your averaging; an exponentially decaying
average would not typically result in 50% for the above case. But I
think we can agree that this results in one 0% and one 100% sample -- we
have two stalled tasks and two CPUs.

> That's a different measurement, but in terms of workload progress, the
> sequences are functionally equivalent. In both scenarios the same
> amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
> and the same amount of potentially productive CPU time is lost due to
> the contention of memory. We really ought to read the same pressure.

And you do -- subject to the averaging used, as per the above.

The first gives two 50% samples, the second gives 0%, 100%.

> So what I'm doing is calculating the productivity loss on each CPU in
> a sampling period as if they were independent time slices. It doesn't
> matter how you slice and dice the sequences within each one - if used
> CPU time and lost CPU time have the same proportion, we have the same
> pressure.

I'm still thinking you can do basically the same without the stong CPU
relation.

> To illustrate:
> 
> CPU X
> 1234
> Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
> Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
> Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |
> 
> You can clearly see the 50% of walltime in which *somebody* isn't
> advancing (2 and 4), and the 25% of walltime in which *no* tasks are
> (3). Same amount of work, same memory stalls, same pressure numbers.
> 
> Globalized state tracking would produce those numbers on the single
> CPU (obviously), but once concurrency gets into the mix, it's
> questionable what its results mean. It certainly isn't able to
> reliably detect equivalent slowdowns of individual tasks ("some" is
> all over the place), and in this example wasn't able to capture the
> impact of contention on overall work completion ("full" is 0%).
> 
> * CPU 0: some = 50%, full =  0%
>   CPU 1: some = 50%, full = 50%
> avg: some = 50%, full = 25%

I'm not entirely sure I get your point here; but note that a task
doesn't sleep on a CPU. When it sleeps it is not strictly associated
with a CPU, only when it runs does it have an association.

What is the value of accounting a sleep state to a particular CPU if the
task when wakes up on another? Where did the sleep take place?

All we really can say is that a task slept, and if we can reduce the
reason for its sleeping (IO, reclaim, whatever) then it could've ran
sooner. And then you can make predictions based on the number of CPUs
and global idle time, how much that could improve things.



Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

Please refer to my test log and analysis.

[  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
[  229.872649] nvme nvme0: EH 0: before shutdown
[  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
[  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
[  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
[  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
[  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
[  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
[  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
[  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
[  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
[  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
[  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
[  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
[  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
[  229.874041] nvme nvme0: Abort status: 0x0
[  229.874064] nvme nvme0: Abort status: 0x0
[  229.874078] nvme nvme0: Abort status: 0x0
[  229.874092] nvme nvme0: Abort status: 0x0
[  229.874106] nvme nvme0: Abort status: 0x0
[  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
[  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  290.840746] nvme nvme0: EH 1: before shutdown

[  290.992650] [ cut here ]
[  290.992751] Trying to free already-free IRQ 133
[  290.992783] WARNING: CPU: 6 PID: 227 at 
/home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 
__free_irq+0xf6/0x420
[  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 
4.17.0-rc3+ #37
[  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
[  290.993436] RIP: 0010:__free_irq+0xf6/0x420
...
[  290.993541] Call Trace:
[  290.993581]  free_irq+0x4c/0xa0
[  290.993600]  pci_free_irq+0x18/0x30
[  290.993615]  nvme_dev_disable+0x20b/0x720
[  290.993745]  nvme_eh_work+0xdd/0x4b0
[  290.993866]  process_one_work+0x3ca/0xaa0
[  290.993960]  worker_thread+0x89/0x6c0
[  290.994017]  kthread+0x18d/0x1e0
[  290.994061]  ret_from_fork+0x24/0x30
[  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
[  338.913348]   Tainted: GW 4.17.0-rc3+ #37
[  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  338.913809] kworker/u16:4   D0   227  2 0x8000
[  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
[  338.913857] Call Trace:
[  338.914002]  schedule+0x52/0xe0
[  338.914019]  schedule_preempt_disabled+0x14/0x20
[  338.914029]  __mutex_lock+0x5b4/0xca0
[  338.914228]  nvme_eh_work+0x11a/0x4b0
[  338.914344]  process_one_work+0x3ca/0xaa0
[  338.914440]  worker_thread+0x89/0x6c0
[  338.914496]  kthread+0x18d/0x1e0
[  338.914542]  ret_from_fork+0x24/0x30
[  338.914648]

Here is the deadlock scenario.

nvme_eh_work // EH0
  -> nvme_reset_dev //hold reset_lock
-> nvme_setup_io_queues
  -> nvme_create_io_queues
-> nvme_create_queue
  -> set nvmeq->cq_vector
  -> adapter_alloc_cq
  -> adapter_alloc_sq
 irq has not been requested
 io timeout 
nvme_eh_work //EH1
  -> nvme_dev_disable
-> quiesce the adminq //> here !
-> nvme_suspend_queue
  print out warning Trying to free 
already-free IRQ 133
-> nvme_cancel_request // complete the 
timeout admin request
  -> require reset_lock
  -> adapter_delete_cq
-> adapter_delete_queue // submit to the adminq which has been 
quiesced.
  -> nvme_submit_sync_cmd
-> blk_execute_rq
  -> wait_for_completion_io_timeout
  hang_check is true, so there is no hung task warning for this 
context

EH0 submit cq delete admin command, but it will never be completed or timed 
out, because the admin request queue has
been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
reset_lock and make things forward.

Thanks
Jianchao


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-14 Thread Dmitry Vyukov
On Wed, May 9, 2018 at 4:02 PM, Theodore Y. Ts'o  wrote:
> On Wed, May 09, 2018 at 10:49:54AM +0200, Dmitry Vyukov wrote:
>> Hi Ted,
>>
>> Did you follow all instructions (commit, config, compiler, etc)?
>> syzbot does not have any special magic, it just executes the described
>> steps and then collects the kernel crash output it sees.
>
> No, I didn't.  Unfortunately, I don't have the time to repro it on
> exactly the config, commit, compiler.  And debugging tons of Syzkaller
> commits is not on my OKR list, and I have lots of P0/P1 bugs and
> projects that are competing for my attention.
>
> I tried repro'ing it using the standard compiler, and using -rc4 as a
> base.  If it doesn't repro there, using my standard kernel config ---
> and it requires root, and in my judgement it's *highly* unlikely to
> happen in real life --- then it becomes a P2 or P3 bug, it's not worth
> my time to build a kernel using exactly the commit, config, and
> compiler that Syzkaller specified.  Someday, you or I or someone else
> will build at tool that builds the kernel in a GCE VM, using the
> specified config and a compiler, and which minimizes the amount of
> human toil needed to do the sort of investigation you seem to want to
> dump on upstream developers.
>
> There's a *reason* why many upstream developers have been asking for
> improvements in the syzkaller tool to reduce toil.  If it's fair for
> you to ask for us to do work, then it's also fair for us to ask you to
> do work.  And if the answer is "sorry, I don't have the time; other
> things are higher priority", that's a fair answer coming from both
> sides.


Hi Ted,

I am working on syzbot/syzkaller all of my time. Repro script is on my
plate: https://github.com/google/syzkaller/issues/563. I will do it.
Not because somebody asks me, but because I am interested in making
kernel more correct, stable and secure.