from:"Keith Busch"

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Keith Busch

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe  wrote:
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky 
> >>> 
>  From Avihai,
> >>> 
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>> 
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>> 
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >> 
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> > 
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
> 
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
> 
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

Not sure if Haswell will be useful for such testing. It looks like many
of those subscribe to 'quirk_relaxedordering_disable'.

Re: [bug report] node: Add memory-side caching attributes

2021-04-01 Thread Keith Busch

On Thu, Apr 01, 2021 at 12:00:39PM +0300, Dan Carpenter wrote:
> Hi Keith,
> 
> I've been trying to figure out ways Smatch can check for device managed
> resources.  Like adding rules that if we call dev_set_name(>bar)
> then it's device managaged and if there is a kfree(foo) without calling
> device_put(>bar) then that's a resource leak.
> 
> Of course one of the rules is that if you call device_register(dev) then
> you can't kfree(dev), it has to released with device_put(dev) and that's
> true even if the register fails.  But this code here feels very
> intentional so maybe there is an exception to the rule?

Thanks for the notice. There was no intentional use of this, so I
think it was a mistake. The proposal in the follow on replies looks much
better.
 
> The patch acc02a109b04: "node: Add memory-side caching attributes"
> from Mar 11, 2019, leads to the following static checker warning:
> 
>   drivers/base/node.c:285 node_init_cache_dev()
>   error: kfree after device_register(): 'dev'
> 
> drivers/base/node.c
>263  static void node_init_cache_dev(struct node *node)
>264  {
>265  struct device *dev;
>266  
>267  dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>268  if (!dev)
>269  return;
>270  
>271  dev->parent = >dev;
>272  dev->release = node_cache_release;
>273  if (dev_set_name(dev, "memory_side_cache"))
>274  goto free_dev;
>275  
>276  if (device_register(dev))
> ^^^
>277  goto free_name;
>278  
>279  pm_runtime_no_callbacks(dev);
>280  node->cache_dev = dev;
>281  return;
>282  free_name:
>283  kfree_const(dev->kobj.name);
>284  free_dev:
>285  kfree(dev);
> ^^
>286  }
> 
> regards,
> dan carpenter

[PATCH] overflow: improve check_shl_overflow comment

2021-04-01 Thread Keith Busch

A 'false' return means the value was safely set, so the comment should
say 'true' for when it is not considered safe.

Cc: Jason Gunthorpe 
Cc: Kees Cook 
Signed-off-by: Keith Busch 
---
 include/linux/overflow.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/overflow.h b/include/linux/overflow.h
index ef74051d5cfed..0f12345c21fb5 100644
--- a/include/linux/overflow.h
+++ b/include/linux/overflow.h
@@ -235,7 +235,7 @@ static inline bool __must_check __must_check_overflow(bool 
overflow)
  * - 'a << s' sets the sign bit, if any, in '*d'.
  *
  * '*d' will hold the results of the attempted shift, but is not
- * considered "safe for use" if false is returned.
+ * considered "safe for use" if true is returned.
  */
 #define check_shl_overflow(a, s, d) __must_check_overflow(({   \
typeof(a) _a = a;   \
-- 
2.25.4

Re: [PATCH v2 2/3] nvme: Remove superflues else in nvme_ctrl_loss_tmo_store

2021-04-01 Thread Keith Busch

For the subject, s/superflues/superfluous

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-30 Thread Keith Busch

On Tue, Mar 30, 2021 at 10:34:25AM -0700, Sagi Grimberg wrote:
> 
> > > It is, but in this situation, the controller is sending a second
> > > completion that results in a use-after-free, which makes the
> > > transport irrelevant. Unless there is some other flow (which is
> > > unclear
> > > to me) that causes this which is a bug that needs to be fixed rather
> > > than hidden with a safeguard.
> > > 
> > 
> > The kernel should not crash regardless of any network traffic that is
> > sent to the system.  It should not be possible to either intentionally
> > of mistakenly contruct packets that will deny service in this way.
> 
> This is not specific to nvme-tcp. I can build an rdma or pci controller
> that can trigger the same crash... I saw a similar patch from Hannes
> implemented in the scsi level, and not the individual scsi transports..

If scsi wants this too, this could be made generic at the blk-mq level.
We just need to make something like blk_mq_tag_to_rq(), but return NULL
if the request isn't started.
 
> I would also mention, that a crash is not even the scariest issue that
> we can see here, because if the request happened to be reused we are
> in the silent data corruption realm...

If this does happen, I think we have to come up with some way to
mitigate it. We're not utilizing the full 16 bits of the command_id, so
maybe we can append something like a generation sequence number that can
be checked for validity.

Re: [PATCH] nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev

2021-03-25 Thread Keith Busch

On Thu, Mar 25, 2021 at 09:48:37AM +, Niklas Cassel wrote:
> From: Niklas Cassel 
> 
> When a passthru command targets a specific namespace, the ns parameter to
> nvme_user_cmd()/nvme_user_cmd64() is set. However, there is currently no
> validation that the nsid specified in the passthru command targets the
> namespace/nsid represented by the block device that the ioctl was
> performed on.
> 
> Add a check that validates that the nsid in the passthru command matches
> that of the supplied namespace.
> 
> Signed-off-by: Niklas Cassel 
> ---
> Currently, if doing NVME_IOCTL_IO_CMD on the controller char device,
> if and only if there is a single namespace in the ctrl->namespaces list,
> nvme_dev_user_cmd() will call nvme_user_cmd() with ns parameter set.
> While it might be good that we validate the nsid even in this case,
> perhaps we want to allow a nsid value in the passthru command to be
> 0x0 and/or the NSID broadcast value? (Only when NVME_IOCTL_IO_CMD was
> done on the controller char device though.)

There are no IO commands accepting a 0 NSID, so rejecting those from the
driver should be okay.

FLUSH is the only IO command that takes a broadcast NSID. I suspect at
least some entities were unfortunately sending broadcast flush through
this interface, so it's possible we'll hear of breakage, but I'd agree
your patch is still the right thing to do.

>  drivers/nvme/host/core.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 40215a0246e4..e4591a4c68a8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1632,6 +1632,8 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct 
> nvme_ns *ns,
>   return -EFAULT;
>   if (cmd.flags)
>   return -EINVAL;
> + if (ns && cmd.nsid != ns->head->ns_id)
> + return -EINVAL;
>  
>   memset(, 0, sizeof(c));
>   c.common.opcode = cmd.opcode;
> @@ -1676,6 +1678,8 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, 
> struct nvme_ns *ns,
>   return -EFAULT;
>   if (cmd.flags)
>   return -EINVAL;
> + if (ns && cmd.nsid != ns->head->ns_id)
> + return -EINVAL;
>  
>   memset(, 0, sizeof(c));
>   c.common.opcode = cmd.opcode;
> -- 
> 2.30.2

Re: [PATCH 1/1] nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a

2021-03-10 Thread Keith Busch

On Wed, Mar 10, 2021 at 02:41:10PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 02:21:56PM +0100, Christoph Hellwig wrote:
> > Can you try this patch instead?
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2021-February/023183.html
> 
> Actually, please try the patch below instead, it looks like our existing
> logic messes up the units:

This seems like a good opportunity to incorporate TP4040 for non-MDTS
command limits. I sent a proposal here

  http://lists.infradead.org/pipermail/linux-nvme/2021-March/023522.html

And it defaults to your suggestion if the controller doesn't implement
the capability.

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-02 Thread Keith Busch

On Tue, Mar 02, 2021 at 08:18:40AM +0100, Hannes Reinecke wrote:
> On 3/1/21 9:59 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> >> On 3/1/21 5:05 PM, Keith Busch wrote:
> >>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> >>>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> >>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >>>>>> Crashing is bad, silent data corruption is worse. Is there truly no
> >>>>>> defense against that? If not, why should anyone rely on this?
> >>>>>
> >>>>> If we receive an response for which we don't have a started request, we
> >>>>> know that something is wrong. Couldn't we in just reset the connection
> >>>>> in this case? We don't have to pretend nothing has happened and
> >>>>> continuing normally. This would avoid a host crash and would not create
> >>>>> (more) data corruption. Or I am just too naive?
> >>>>>
> >>>> This is actually a sensible solution.
> >>>> Please send a patch for that.
> >>>
> >>> Is a bad frame a problem that can be resolved with a reset?
> >>>
> >>> Even if so, the reset doesn't indicate to the user if previous commands
> >>> completed with bad data, so it still seems unreliable.
> >>>
> >> We need to distinguish two cases here.
> >> The one is use receiving a frame with an invalid tag, leading to a crash.
> >> This can be easily resolved by issuing a reset, as clearly the command was
> >> garbage and we need to invoke error handling (which is reset).
> >>
> >> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> >> which is _currently_ valid. This is a case which will fail _even now_, as 
> >> we
> >> have simply no way of detecting this.
> >>
> >> So what again do we miss by fixing the first case?
> >> Apart from a system which does _not_ crash?
> > 
> > I'm just saying each case is a symptom of the same problem. The only
> > difference from observing one vs the other is a race with the host's
> > dispatch. And since you're proposing this patch, it sounds like this
> > condition does happen on tcp compared to other transports where we don't
> > observe it. I just thought the implication that data corruption happens
> > is a alarming.
> > 
> Oh yes, it is.
> But sadly TCP inherently suffers from this, as literally anyone can
> spoof frames on the network.
> Other transports like RDMA or FC do not suffer to that extend as
> spoofing frames there is far more elaborate, and not really possible
> without dedicated hardware equipment.
> 
> That's why there is header and data digest; that will protect you
> against accidental frame corruption (as this case clearly is; the
> remainder of the frame is filled with zeroes).
> It would not protect you against deliberate frame corruption; that's why
> there is TPAR 8010 (TLS encryption for NVMe-TCP).
> 
> Be it as it may, none of these methods are in use here, and none of
> these methods can be made mandatory. So we need to deal with the case at
> hand.
> 
> And in my opinion crashing is the _worst_ options of all.
> Tear the connection down, reset the thing, whatever.
> 
> But do not crash.
> Customers tend to have a very dim view on crashing machines, and have a
> very limited capacity for being susceptible to our reasoning in these cases.

I was pushing the data corruption angle because fixing that should
address both cases. If there's really nothing you can do about
corruption, your approach here makes sense, and I defer to Sagi and
Christoph for inclusion.

I still wouldn't trust my data to storage behaving this way, though. :)

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Keith Busch

On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> On 3/1/21 5:05 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> > > On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> > > > > Crashing is bad, silent data corruption is worse. Is there truly no
> > > > > defense against that? If not, why should anyone rely on this?
> > > > 
> > > > If we receive an response for which we don't have a started request, we
> > > > know that something is wrong. Couldn't we in just reset the connection
> > > > in this case? We don't have to pretend nothing has happened and
> > > > continuing normally. This would avoid a host crash and would not create
> > > > (more) data corruption. Or I am just too naive?
> > > > 
> > > This is actually a sensible solution.
> > > Please send a patch for that.
> > 
> > Is a bad frame a problem that can be resolved with a reset?
> > 
> > Even if so, the reset doesn't indicate to the user if previous commands
> > completed with bad data, so it still seems unreliable.
> > 
> We need to distinguish two cases here.
> The one is use receiving a frame with an invalid tag, leading to a crash.
> This can be easily resolved by issuing a reset, as clearly the command was
> garbage and we need to invoke error handling (which is reset).
> 
> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> which is _currently_ valid. This is a case which will fail _even now_, as we
> have simply no way of detecting this.
> 
> So what again do we miss by fixing the first case?
> Apart from a system which does _not_ crash?

I'm just saying each case is a symptom of the same problem. The only
difference from observing one vs the other is a race with the host's
dispatch. And since you're proposing this patch, it sounds like this
condition does happen on tcp compared to other transports where we don't
observe it. I just thought the implication that data corruption happens
is a alarming.

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Keith Busch

On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >> Crashing is bad, silent data corruption is worse. Is there truly no
> >> defense against that? If not, why should anyone rely on this?
> > 
> > If we receive an response for which we don't have a started request, we
> > know that something is wrong. Couldn't we in just reset the connection
> > in this case? We don't have to pretend nothing has happened and
> > continuing normally. This would avoid a host crash and would not create
> > (more) data corruption. Or I am just too naive?
> > 
> This is actually a sensible solution.
> Please send a patch for that.

Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Keith Busch

On Fri, Feb 26, 2021 at 05:42:46PM +0100, Hannes Reinecke wrote:
> On 2/26/21 5:13 PM, Keith Busch wrote:
> > 
> > That's just addressing a symptom. You can't fully verify the request is
> > valid this way because the host could have started the same command ID
> > the very moment before the code checks it, incorrectly completing an
> > in-flight command and getting data corruption.
> > 
> Oh, I am fully aware.
> 
> Bad frames are just that, bad frames.
> We can only fully validate that when digests are enabled, but I gather that
> controllers sending out bad frames wouldn't want to enable digests, either.
> So relying on that is possibly not an option.
> 
> So really what I'm trying to avoid is the host crashing on a bad frame.
> That kind of thing always resonates bad with customers.
> And tripping over an uninitialized command is just too stupid IMO.

Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Keith Busch

On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
> On 2/26/21 1:35 PM, Daniel Wagner wrote:
> > On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> > > Well, I think we should probably figure out why that is happening first.
> > 
> > I got my hands on a tcpdump trace. I've trimmed it to this:
> > 
> > 
> > No. Time   SourceDestination   Protocol 
> > Length Info
> >1 0.00   10.228.194.30 10.228.38.214 NVMe
> >  138NVMe Write
> >2 0.000285   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90 Ready To Transfer
> >3 0.000591   10.228.194.30 10.228.38.214 NVMe
> >  4186   NVMe Write: Data
> >4 0.000673   10.228.38.214 10.228.194.30 TCP 
> >  66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 
> > TSecr=1497295579
> >5 0.002140   10.228.38.214 10.228.194.30 NVMe
> >  90 NVMe Write: Response
> >6 0.002511   10.228.194.30 10.228.38.175 NVMe
> >  138NVMe Write
> >7 0.002812   10.228.38.175 10.228.194.30 
> > NVMe/TCP 90 Ready To Transfer
> >8 0.003006   10.228.194.30 10.228.38.175 NVMe
> >  4186   NVMe Write: Data
> >9 0.003098   10.228.38.175 10.228.194.30 TCP 
> >  66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 
> > TSecr=3601034207
> >   10 0.004420   10.228.38.175 10.228.194.30 NVMe
> >  90 NVMe Write: Response
> >   11 0.004890   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90
> >   12 0.004969   10.228.38.175 10.228.194.30 
> > NVMe/TCP 90
> > 
> > 
> > The last few seconds contain only normal writes and suddenly the host
> > receives two invalid packets. From what I see the host doesn't misbehave
> > at all. I wonder if it would be possible to detect the invalid packet by
> > locking at the PDU header only. If this would be possible we could
> > discard it early and do not try to use the invalid command id.
> > 
> > Here the last two packets with details:
> > 
> > 
> > No. Time   SourceDestination   Protocol 
> > Length Info
> >   11 0.004890   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90
> > 
> > Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> >  Encapsulation type: Ethernet (1)
> >  Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
> >  [Time shift for this packet: 0.0 seconds]
> >  Epoch Time: 1614100568.780574000 seconds
> >  [Time delta from previous captured frame: 0.00047 seconds]
> >  [Time delta from previous displayed frame: 0.00047 seconds]
> >  [Time since reference or first frame: 0.00489 seconds]
> >  Frame Number: 11
> >  Frame Length: 90 bytes (720 bits)
> >  Capture Length: 90 bytes (720 bits)
> >  [Frame is marked: False]
> >  [Frame is ignored: False]
> >  [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> >  [Coloring Rule Name: TCP]
> >  [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: 
> > Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >  Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >  Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >   ..0.     = LG bit: Globally unique address 
> > (factory default)
> >   ...0     = IG bit: Individual address 
> > (unicast)
> >  Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >  Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >   ..0.     = LG bit: Globally unique address 
> > (factory default)
> >   ...0     = IG bit: Individual address 
> > (unicast)
> >  Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
> >  0100  = Version: 4
> >   0101 = Header Length: 20 bytes (5)
> >  Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> >   00.. = Differentiated Services Codepoint: Default (0)
> >   ..00 = Explicit Congestion Notification: Not ECN-Capable 
> > Transport (0)
> >  Total Length: 76
> >  Identification: 0x (0)
> >  Flags: 0x40, Don't fragment
> >  0...  = Reserved bit: Not set
> >  .1..  = Don't fragment: Set
> >  ..0.  = More fragments: Not set
> >  Fragment Offset: 0
> >  Time to Live: 64
> >  Protocol: TCP (6)
> >  Header Checksum: 0x [validation disabled]
> >  [Header checksum status: Unverified]
> >  Source Address: 10.228.38.214
> >  Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, 
> >

Re: [PATCH] nvme-core: Switch to using the new API kobj_to_dev()

2021-02-22 Thread Keith Busch

On Sat, Feb 20, 2021 at 05:10:18PM +0800, Yang Li wrote:
> fixed the following coccicheck:
> ./drivers/nvme/host/core.c:3440:60-61: WARNING opportunity for
> kobj_to_dev()
> ./drivers/nvme/host/core.c:3679:60-61: WARNING opportunity for
> kobj_to_dev()
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

This was rejected the last time it posted 6 months ago:

http://lists.infradead.org/pipermail/linux-nvme/2020-September/019463.html

Re: [RFC PATCH v5 0/4] add simple copy support

2021-02-20 Thread Keith Busch

On Sat, Feb 20, 2021 at 06:01:56PM +, David Laight wrote:
> From: SelvaKumar S
> > Sent: 19 February 2021 12:45
> > 
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> 
> Sounds to me like the real reason is that the copy just ends up changing
> some indirect block pointers rather than having to actually copy the data.

I guess an implementation could do that, but I think that's missing the
point of the command. The intention is to copy the data to a new
location on the media for host managed garbage collection.

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Keith Busch

On Fri, Feb 12, 2021 at 12:58:27PM -0800, Sagi Grimberg wrote:
> > blk_mq_tag_to_rq() will always return a request if the command_id is
> > in the valid range. Check if the request has been started. If we
> > blindly process the request we might double complete a request which
> > can be fatal.
> 
> How did you get to this one? did the controller send a completion for
> a completed/bogus request?

If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?

Re: [PATCH] nvme: Add 48-bit DMA address quirk

2021-02-03 Thread Keith Busch

On Wed, Feb 03, 2021 at 12:22:31PM +0100, Filippo Sironi wrote:
> 
> On 2/3/21 12:15 PM, Christoph Hellwig wrote:
> > 
> > On Wed, Feb 03, 2021 at 12:12:31PM +0100, Filippo Sironi wrote:
> > > I don't disagree on the first part of your sentence, this is a big
> > > oversight.
> > 
> > But it is not what your commit log suggests.
> 
> I can definitely rephrase the commit.
> 
> > > On the other hand, those controllers are out there and are in use by a lot
> > > of customers.  We can keep relying on luck, hoping that customers don't 
> > > run
> > > into troubles or we can merge a few lines of code :)
> > 
> > Your patch does not just quirk a few controllers out there, but all
> > current and future controllers with an Amazon vendor ID.  We could
> > probably talk about quirking an existing vendor ID or two as long as
> > this doesn't happen for future hardware.
> 
> I know that the hardware team is working on this but I don't know the
> timelines and there are a few upcoming controllers - of which I don't know
> the device ids yet - that have the same issue.
> 
> To avoid issues, it is easier to apply the quirk to all Amazon NVMe
> controllers for now till the new lines of controllers with the fix comes
> out.  At that point, we'll be able to restrict the application to the known
> bad controllers.

Just set the quirk for the known device id's and append more as needed
(which should hopefully never happen). Sure, your future broken devices
may not work with the first kernel that introduced the quirk, but this
is how the quirks should be documented in the code.

Re: [PATCH V2 3/3] Adding device_dma_parameters->offset_preserve_mask to NVMe driver.

2021-02-01 Thread Keith Busch

On Mon, Feb 01, 2021 at 10:30:17AM -0800, Jianxiong Gao wrote:
> @@ -868,12 +871,24 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
> struct request *req,
>   if (!iod->nents)
>   goto out_free_sg;
>  
> + offset_ret = dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1);
> + if (offset_ret) {
> + dev_warn(dev->dev, "dma_set_min_align_mask failed to set 
> offset\n");
> + goto out_free_sg;
> + }
> +
>   if (is_pci_p2pdma_page(sg_page(iod->sg)))
>   nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
>   iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
>   else
>   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
>rq_dma_dir(req), DMA_ATTR_NO_WARN);
> +
> + offset_ret = dma_set_min_align_mask(dev->dev, 0);
> + if (offset_ret) {
> + dev_warn(dev->dev, "dma_set_min_align_mask failed to reset 
> offset\n");
> + goto out_free_sg;
> + }
>   if (!nr_mapped)
>   goto out_free_sg;

Why is this setting being done and undone on each IO? Wouldn't it be
more efficient to set it once during device initialization?

And more importantly, this isn't thread safe: one CPU may be setting the
device's dma alignment mask to 0 while another CPU is expecting it to be
NVME_CTRL_PAGE_SIZE - 1.

Re: [PATCH 2/3] Add swiotlb offset preserving mapping when dma_dma_parameters->page_offset_mask is non zero.

2021-01-28 Thread Keith Busch

On Thu, Jan 28, 2021 at 12:15:28PM -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 27, 2021 at 04:38:28PM -0800, Jianxiong Gao wrote:
> > For devices that need to preserve address offset on mapping through
> > swiotlb, this patch adds offset preserving based on page_offset_mask
> > and keeps the offset if the mask is non zero. This is needed for
> > device drivers like NVMe.
> 
> 
> 
> Didn't you send this patch like a month ago and someone pointed
> out that the right fix would be in the NVMe driver?
> 
> Is there an issue with fixing the NVMe driver?

You got it backwards. The initial "fix" used a flag specific to the nvme
driver, and it was pointed out that it should just be the generic
behaviour.

Re: [RFC PATCH v3 1/2] block: add simple copy support

2020-12-11 Thread Keith Busch

On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote:
> +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload 
> *payload,
> + gfp_t gfp_mask)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct bio *bio;
> + void *buf = NULL;
> + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> +
> + nr_srcs = payload->copy_range;
> + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;

The default value for this limit is 0, and this is the function for when
the device doesn't support copy. Are we expecting drivers to set this
value to something else for that case?

> + cur_dest = payload->dest;
> + buf = kvmalloc(max_range_len, GFP_ATOMIC);
> + if (!buf)
> + return -ENOMEM;
> +
> + for (i = 0; i < nr_srcs; i++) {
> + bio = bio_alloc(gfp_mask, 1);
> + bio->bi_iter.bi_sector = payload->range[i].src;
> + bio->bi_opf = REQ_OP_READ;
> + bio_set_dev(bio, bdev);
> +
> + cur_size = payload->range[i].len << SECTOR_SHIFT;
> + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +offset_in_page(payload));

'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I
think you need to allocate the bio with bio_map_kern() or something like
that instead with that kind of memory.

> + if (ret != cur_size) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> + if (ret)
> + goto out;
> +
> + bio = bio_alloc(gfp_mask, 1);
> + bio_set_dev(bio, bdev);
> + bio->bi_opf = REQ_OP_WRITE;
> + bio->bi_iter.bi_sector = cur_dest;
> + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +offset_in_page(payload));
> + if (ret != cur_size) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> + if (ret)
> + goto out;
> +
> + cur_dest += payload->range[i].len;
> + }

I think this would be a faster implementation if the reads were
asynchronous with a payload buffer allocated specific to that read, and
the callback can enqueue the write part. This would allow you to
accumulate all the read data and write it in a single call.

Re: [PATCH] nvme: hwmon: fix crash on device teardown

2020-12-10 Thread Keith Busch

On Wed, Dec 09, 2020 at 06:32:27PM -0300, Enzo Matsumiya wrote:
> +void nvme_hwmon_exit(struct nvme_ctrl *ctrl)
> +{
> + hwmon_device_unregister(ctrl->dev);
> +}

The hwmon registration uses the devm_ version, so don't we need to use
the devm_hwmon_device_unregister() here?

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-04 Thread Keith Busch

On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
> On 2020/12/04 20:02, SelvaKumar S wrote:
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > This is an RFC. Looking forward for any feedbacks or other alternate
> > designs for plumbing simple copy to IO stack.
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> > 
> > This implementation accepts destination, no of sources and arrays of
> > source ranges from application and attach it as payload to the bio and
> > submits to the device.
> > 
> > Following limits are added to queue limits and are exposed in sysfs
> > to userspace
> > - *max_copy_sectors* limits the sum of all source_range length
> > - *max_copy_nr_ranges* limits the number of source ranges
> > - *max_copy_range_sectors* limit the maximum number of sectors
> > that can constitute a single source range.
> 
> Same comment as before. I think this is a good start, but for this to be 
> really
> useful to users and kernel components alike, this really needs copy emulation
> for drives that do not have a native copy feature, similarly to what write 
> zeros
> handling for instance: if the drive does not have a copy command (simple copy
> for NVMe or XCOPY for scsi), then the block layer should issue read/write
> commands to seamlessly execute the copy. Otherwise, this will only serve a 
> small
> niche for users and will not be optimal for FS and DM drivers that could be
> simplified with a generic block layer copy functionality.
> 
> This is my 10 cents though, others may differ about this.

Yes, I agree that copy emulation support should be included with the
hardware enabled solution.

Re: [RFC PATCH 2/2] nvme: add simple copy support

2020-12-01 Thread Keith Busch

On Tue, Dec 01, 2020 at 11:09:49AM +0530, SelvaKumar S wrote:
> +static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
> +struct nvme_id_ns *id)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + struct request_queue *queue = disk->queue;
> +
> + if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
> + queue->limits.max_copy_sectors = 0;
> + blk_queue_flag_clear(QUEUE_FLAG_COPY, queue);
> + return;
> + }
> +
> + /* setting copy limits */
> + ns->mcl = le64_to_cpu(id->mcl);
> + ns->mssrl = le32_to_cpu(id->mssrl);
> + ns->msrc = id->msrc;

These are not used anywhere outside this function, so there's no need to
add members to the struct.

> + if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, queue))
> + return;

The queue limits are not necessarily the same each time we're called to
update the disk info, so this return shouldn't be here.

> +
> + queue->limits.max_copy_sectors = ns->mcl * (1 << (ns->lba_shift - 9));
> + queue->limits.max_copy_range_sectors = ns->mssrl *
> + (1 << (ns->lba_shift - 9));
> + queue->limits.max_copy_nr_ranges = ns->msrc + 1;
> +}

<>

> @@ -2045,6 +2133,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
>   set_capacity_and_notify(disk, capacity);
>  
>   nvme_config_discard(disk, ns);
> + nvme_config_copy(disk, ns, id);
>   nvme_config_write_zeroes(disk, ns);
>  
>   if (id->nsattr & NVME_NS_ATTR_RO)
> @@ -3014,6 +3103,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>   ctrl->oaes = le32_to_cpu(id->oaes);
>   ctrl->wctemp = le16_to_cpu(id->wctemp);
>   ctrl->cctemp = le16_to_cpu(id->cctemp);
> + ctrl->ocfs = le32_to_cpu(id->ocfs);

ocfs is not used anywhere.

Re: [PATCH v2] nvme: Cache DMA descriptors to prevent corruption.

2020-11-20 Thread Keith Busch

On Fri, Nov 20, 2020 at 09:02:43AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 19, 2020 at 05:27:37PM -0800, Tom Roeder wrote:
> > This patch changes the NVMe PCI implementation to cache host_mem_descs
> > in non-DMA memory instead of depending on descriptors stored in DMA
> > memory. This change is needed under the malicious-hypervisor threat
> > model assumed by the AMD SEV and Intel TDX architectures, which encrypt
> > guest memory to make it unreadable. Some versions of these architectures
> > also make it cryptographically hard to modify guest memory without
> > detection.
> 
> I don't think this is a useful threat model, and I've not seen a
> discussion on lkml where we had any discussion on this kind of threat
> model either.
> 
> Before you start sending patches that regress optimizations in various
> drivers (and there will be lots with this model) we need to have a
> broader discussion first.
> 
> And HMB support, which is for low-end consumer devices that are usually
> not directly assigned to VMs aren't a good starting point for this.

Yeah, while doing this for HMB isn't really a performance concern, this
method for chaining SGL/PRP lists would be.

And perhaps more importantly, the proposed mitigation only lets the
guest silently carry on from such an attack while the device is surely
corrupting something. I think we'd rather free the wrong address since
that may at least eventually raise an error.

Re: [PATCH] nvme: Cache DMA descriptors to prevent corruption.

2020-11-19 Thread Keith Busch

On Thu, Nov 19, 2020 at 10:59:19AM -0800, Tom Roeder wrote:
> This patch changes the NVMe PCI implementation to cache host_mem_descs
> in non-DMA memory instead of depending on descriptors stored in DMA
> memory. This change is needed under the malicious-hypervisor threat
> model assumed by the AMD SEV and Intel TDX architectures, which encrypt
> guest memory to make it unreadable. Some versions of these architectures
> also make it cryptographically hard to modify guest memory without
> detection.
> 
> On these architectures, Linux generally leaves DMA memory unencrypted so
> that devices can still communicate directly with the kernel: DMA memory
> remains readable to and modifiable by devices. This means that this
> memory is also accessible to a hypervisor.
> 
> However, this means that a malicious hypervisor could modify the addr or
> size fields of descriptors and cause the NVMe driver to call
> dma_free_attrs on arbitrary addresses or on the right addresses but with
> the wrong size. To prevent this attack, this commit changes the code to
> cache those descriptors in non-DMA memory and to use the cached values
> when freeing the memory they describe.
 
If the hypervisor does that, then the device may use the wrong
addresses, too. I guess you can't do anything about that from the
driver, though.

> + /* Cache the host_mem_descs in non-DMA memory so a malicious hypervisor
> +  * can't change them.
> +  */
> + struct nvme_host_mem_buf_desc *host_mem_descs_cache;
>   void **host_mem_desc_bufs;

This is never seen by an nvme device, so no need for an nvme specific
type here. You can use arch native types.

Re: [PATCH 0/2] nvme-pic: improve max I/O queue handling

2020-11-12 Thread Keith Busch

On Thu, Nov 12, 2020 at 04:45:35PM +0100, Niklas Schnelle wrote:
> You got to get something wrong, I hope in this case it's just the subject
> of the cover letter :D

I suppose the change logs could be worded a little better :)

> Thanks for the review, I appreciate it. Might be getting ahead of
> myself but I'm curious who would take this change through their
> tree if accepted?

The linux-nvme tree is located here:

  http://git.infradead.org/nvme.git

Christoph is currently handling patch commits there.

Re: [PATCH 0/2] nvme-pic: improve max I/O queue handling

2020-11-12 Thread Keith Busch

On Thu, Nov 12, 2020 at 09:23:00AM +0100, Niklas Schnelle wrote:
> while searching for a bug around zPCI + NVMe IRQ handling on a distro
> kernel, I got confused around handling of the maximum number
> of I/O queues in the NVMe driver.
> I think I groked it in the end but would like to propose the following
> improvements, that said I'm quite new to this code.
> I tested both patches on s390x (with a debug config) and x86_64 so
> with both data center and consumer NVMes.
> For the second patch, since I don't own a device with the quirk, I tried
> always returning 1 from nvme_max_io_queues() and confirmed that on my
> Evo 970 Pro this resulted in about half the performance in a fio test
> but did not otherwise break things. I couldn't find a reason why
> allocating only the I/O queues we actually use would be problematic in
> the code either but I might have missed something of course.

I don't think you missed anything, and the series looks like a
reasonable cleanup. I suspect the code was left over from a time when we
didn't allocate the possible queues up-front.

Reviewed-by: Keith Busch

Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace

2020-11-09 Thread Keith Busch

On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote:
> Allow userspace to obtain CMB memory by mmaping the controller's
> char device. The mmap call allocates and returns a hunk of CMB memory,
> (the offset is ignored) so userspace does not have control over the
> address within the CMB.
> 
> A VMA allocated in this way will only be usable by drivers that set
> FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
> checked the first time the pages are mapped for DMA.
> 
> Currently this is only supported by O_DIRECT to an PCI NVMe device
> or through the NVMe passthrough IOCTL.

Rather than make this be specific to nvme, could pci p2pdma create an
mmap'able file for any resource registered with it?

Re: [PATCH V3 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001

2020-10-29 Thread Keith Busch

On Thu, Oct 29, 2020 at 11:33:06AM +0900, Keith Busch wrote:
> On Thu, Oct 29, 2020 at 02:20:27AM +, Gloria Tsai wrote:
> > Corrected the description of this bug that SSD will not do GC after 
> > receiving shutdown cmd.
> > Do GC before shutdown -> delete IO Q -> shutdown from host -> breakup GC -> 
> > D3hot -> enter PS4 -> have a chance swap block -> use wrong pointer on 
> > device SRAM -> over program
> 
> What do you mean by "wrong pointer"? At the place in the sequence you're
> referring to, the PCI BME is disabled: you can't access *any* host RAM,
> so there's no "correct" pointer either.

Re-reading your message, I do see you said "device" rather than "host",
so my response may not be relevant.

Re: [PATCH V3 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001

2020-10-28 Thread Keith Busch

On Thu, Oct 29, 2020 at 02:20:27AM +, Gloria Tsai wrote:
> Corrected the description of this bug that SSD will not do GC after receiving 
> shutdown cmd.
> Do GC before shutdown -> delete IO Q -> shutdown from host -> breakup GC -> 
> D3hot -> enter PS4 -> have a chance swap block -> use wrong pointer on device 
> SRAM -> over program

What do you mean by "wrong pointer"? At the place in the sequence you're
referring to, the PCI BME is disabled: you can't access *any* host RAM,
so there's no "correct" pointer either.

Re: [PATCH 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001

2020-10-27 Thread Keith Busch

On Wed, Oct 28, 2020 at 12:54:38AM +0900, Jongpil Jung wrote:
> suspend.
> 
> When NVMe device receive D3hot from host, NVMe firmware will do
> garbage collection. While NVMe device do Garbage collection,
> firmware has chance to going incorrect address.
> In that case, NVMe storage device goes to no device available state.
> Finally, host can't access the device any more.
> 
> Quirk devices will not use simple suspend even if HMB is enabled.
> In case of poweroff scenario, NVMe receive "PME turn off".
> So garbage collection will not be happening.
> 
> Liteon(SSSTC) will fix the issue, that's why quirk apply on specific
> vendor id and firmware version.

This is a concerning quirk. We use the simple suspend when HMB is
enabled because at least some platforms disable device DMA access in the
runtime suspend state. Many devices continue to access their HMB while
in low power, so we can't let both conditions occur concurrently.
Unless you know for sure this device doesn't access host memory in
low-power, there will be at least some platform combinations where this
quirk will fail.

Re: [PATCH v2] nvmet: fix uninitialized work for zero kato

2020-10-14 Thread Keith Busch

On Wed, Oct 14, 2020 at 11:36:50AM +0800, zhenwei pi wrote:
> Fixes:
> Don't run keep alive work with zero kato.

"Fixes" tags need to have a git commit id followed by the commit
subject. I can't find any commit with that subject, though.

Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null

2020-09-22 Thread Keith Busch

The commit subject is a too long. We should really try to keep these to
50 characters or less.

  nvme-pci: fix NULL req in completion handler

Otherwise, looks fine.

Reviewed-by: Keith Busch

Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null

2020-09-21 Thread Keith Busch

On Mon, Sep 21, 2020 at 03:49:09PM +, Tianxianting wrote:
> HI Keith,
> Thanks for your comments,
> I will submit a new patch of version 2 for the further reviewing,  v2 patch 
> will contains:
> 1, retain existing judgement and dev_warn;

No no, go ahead and remove the existing check just as you've done. That
check becomes redundant with the safer one you're adding, and we don't
want redundant checks in the fast path. My only suggestion is to use
the same dev_warn() in your new check.

> 2, add the check whether req is null(already did in this patch)
> 3, simplify and make the changelog succinct according to you said " This is 
> what I'm thinking:".
> Is it right?
> Should I retain the nvme_irq crash log in changelog, mention the difference 
> between nvmeq->q_depth and tagset queue_depth? 

The tagset's queue_depth is a valid point to mention as well. The
dirver's current indirect check is not necessarily in sync with the
actual tagset.
 
> Thanks
> 
> -Original Message-
> From: Keith Busch [mailto:kbu...@kernel.org] 
> Sent: Monday, September 21, 2020 11:08 PM
> To: tianxianting (RD) 
> Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; 
> linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] nvme: replace meaningless judgement by checking whether 
> req is null
> 
> On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote:
> > @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> > *nvmeq, u16 idx)
> > struct nvme_completion *cqe = >cqes[idx];
> > struct request *req;
> >  
> > -   if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
> > -   dev_warn(nvmeq->dev->ctrl.device,
> > -   "invalid id %d completed on queue %d\n",
> > -   cqe->command_id, le16_to_cpu(cqe->sq_id));
> > -   return;
> > -   }
> > -
> > /*
> >  * AEN requests are special as they don't time out and can
> >  * survive any kind of queue freeze and often don't respond to @@ 
> > -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> > *nvmeq, u16 idx)
> > }
> >  
> > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
> > +   if (unlikely(!req)) {
> > +   dev_warn(nvmeq->dev->ctrl.device,
> > +   "req is null for tag %d completed on queue %d\n",
> > +   cqe->command_id, le16_to_cpu(cqe->sq_id));
> > +   return;
> > +   }
> 
> This is making sense now, though I think we should retain the existing
> dev_warn() since it's still accurate and provides continuity for people who 
> are used to looking for these sorts of messages.
> 
> Your changelog is a bit much though. I think we can say it a bit more 
> succinctly. This is what I'm thinking:
> 
>   The driver registers interrupts for queues before initializing the
>   tagset because it uses the number of successful request_irq() calls
>   to configure the tagset parameters. This allows a race condition with
>   the current tag validity check if the controller happens to produce
>   an interrupt with a corrupted CQE before the tagset is initialized.
> 
>   Replace the driver's indirect tag check with the one already provided
>   by the block layer.

Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null

2020-09-21 Thread Keith Busch

On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote:
> @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> *nvmeq, u16 idx)
>   struct nvme_completion *cqe = >cqes[idx];
>   struct request *req;
>  
> - if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
> - dev_warn(nvmeq->dev->ctrl.device,
> - "invalid id %d completed on queue %d\n",
> - cqe->command_id, le16_to_cpu(cqe->sq_id));
> - return;
> - }
> -
>   /*
>* AEN requests are special as they don't time out and can
>* survive any kind of queue freeze and often don't respond to
> @@ -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> *nvmeq, u16 idx)
>   }
>  
>   req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
> + if (unlikely(!req)) {
> + dev_warn(nvmeq->dev->ctrl.device,
> + "req is null for tag %d completed on queue %d\n",
> + cqe->command_id, le16_to_cpu(cqe->sq_id));
> + return;
> + }

This is making sense now, though I think we should retain the existing
dev_warn() since it's still accurate and provides continuity for people
who are used to looking for these sorts of messages.

Your changelog is a bit much though. I think we can say it a bit more
succinctly. This is what I'm thinking:

  The driver registers interrupts for queues before initializing the
  tagset because it uses the number of successful request_irq() calls
  to configure the tagset parameters. This allows a race condition with
  the current tag validity check if the controller happens to produce
  an interrupt with a corrupted CQE before the tagset is initialized.

  Replace the driver's indirect tag check with the one already provided
  by the block layer.

Re: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()

2020-09-18 Thread Keith Busch

On Fri, Sep 18, 2020 at 06:44:20PM +0800, Xianting Tian wrote:
> @@ -940,7 +940,9 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> *nvmeq, u16 idx)
>   struct nvme_completion *cqe = >cqes[idx];
>   struct request *req;
>  
> - if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
> + if (unlikely(cqe->command_id >=
> + nvmeq->qid ? nvmeq->dev->tagset.queue_depth :
> + nvmeq->dev->admin_tagset.queue_depth)) {

Both of these values are set before blk_mq_alloc_tag_set(), so you still
have a race. The interrupt handler probably just shouldn't be registered
with the queue before the tagset is initialized since there can't be any
work for the handler to do before that happens anyway.

The controller is definitely broken, though, and will lead to
unavoidable corruption if it's really behaving this way.

Re: [PATCH] nvme: fix NULL pointer dereference

2020-09-18 Thread Keith Busch

On Thu, Sep 17, 2020 at 11:32:12PM -0400, Tong Zhang wrote:
> Please correct me if I am wrong.
> After a bit more digging I found out that it is indeed command_id got
> corrupted is causing this problem. Although the tag and command_id
> range is checked like you said, the elements in rqs cannot be
> guaranteed to be not NULL. thus although the range check is passed,
> blk_mq_tag_to_rq() can still return NULL. 

I think your describing a sequence problem in initialization. We
shouldn't have interrupts wired up to uninitialized tagsets.

A more appropriate sequence would setup request_irq() after the tagset
is ready. It makes handling a failed irq setup a bit weird for io
queues, though.

> It is clear that the current sanitization is not enough and there's
> more implication about this -- when all rqs got populated, a corrupted
> command_id may silently corrupt other data not belonging to the
> current command.

The block layer doesn't do anything with requests that haven't been
started, so if your controller completes non-existent commands, then
nothing particular will happen with the rqs.

If the request had been started and the controller provides a corrupted
completion, then the fault lies entirely with the controller and you
should raise that issue with your vendor. There's no way the driver can
distinguish a genuine completion from a corrupted one.

Re: [PATCH] nvme: fix NULL pointer dereference

2020-09-17 Thread Keith Busch

On Thu, Sep 17, 2020 at 12:56:59PM -0400, Tong Zhang wrote:
> The command_id in CQE is writable by NVMe controller, driver should
> check its sanity before using it.

We already do that.

Re: [PATCH] nvme: fix doulbe irq free

2020-09-17 Thread Keith Busch

On Thu, Sep 17, 2020 at 11:22:54AM -0400, Tong Zhang wrote:
> On Thu, Sep 17, 2020 at 4:30 AM Christoph Hellwig  wrote:
> >
> > On Wed, Sep 16, 2020 at 11:37:00AM -0400, Tong Zhang wrote:
> > > the irq might already been released before reset work can run
> >
> > If it is we have a problem with the state machine, as we shouldn't
> > have two resets running at the same time.  Can you share your reproducer?
> >
> 
> Hi Christoph,
> Thanks for replying.
> 
> It is possible that pci_free_irq() can be called twice for the same IRQ,
> the first call can be
> 
> [  124.413113] nvme nvme0: I/O 6 QID 0 timeout, disable controller
> [  124.414317] Workqueue: kblockd blk_mq_timeout_work
> [  124.414317] Call Trace:
> [  124.414317]  dump_stack+0x7d/0xb0
> [  124.414317]  nvme_suspend_queue.cold+0x11/0x58
> [  124.414317]  nvme_dev_disable+0x116/0x7b0
> [  124.414317]  nvme_timeout+0x309/0x320
> [  124.414317]  ? nvme_shutdown+0x40/0x40
> [  124.414317]  ? load_balance+0xe76/0x1450
> [  124.414317]  ? deref_stack_reg+0x40/0x40
> [  124.414317]  ? sched_clock_local+0x99/0xc0
> [  124.414317]  blk_mq_check_expired+0x313/0x340
> [  124.414317]  ? blk_mq_freeze_queue_wait+0x180/0x180
> [  124.414317]  ? _find_next_bit.constprop.0+0x3e/0xf0
> [  124.414317]  blk_mq_queue_tag_busy_iter+0x2e9/0x4a0
> [  124.414317]  ? blk_mq_freeze_queue_wait+0x180/0x180
> [  124.414317]  ? blk_mq_all_tag_iter+0x10/0x10
> [  124.414317]  ? _raw_write_lock_irqsave+0xd0/0xd0
> [  124.414317]  ? run_rebalance_domains+0x80/0x80
> [  124.414317]  ? blk_mq_freeze_queue_wait+0x180/0x180
> [  124.414317]  ? debug_object_deactivate.part.0+0x1c4/0x210
> [  124.414317]  blk_mq_timeout_work+0xaa/0x1f0
> [  124.414317]  ? __blk_mq_end_request+0x1f0/0x1f0
> [  124.414317]  ? __schedule+0x581/0xc40
> [  124.414317]  ? read_word_at_a_time+0xe/0x20
> [  124.414317]  ? strscpy+0xbf/0x190
> [  124.414317]  process_one_work+0x4ad/0x7e0
> [  124.414317]  worker_thread+0x73/0x690
> [  124.414317]  ? process_one_work+0x7e0/0x7e0
> [  124.414317]  kthread+0x199/0x1f0
> [  124.414317]  ? kthread_create_on_node+0xd0/0xd0
> [  124.414317]  ret_from_fork+0x22/0x30
> 
> And the second call can be
> [  149.763974] nvme nvme0: Failed to read smart log (error -4)
> [  149.774307] Trying to free already-free IRQ 10
> [  149.774385] Call Trace:
> [  149.774385]  pci_free_irq+0x13/0x20
> [  149.774385]  nvme_reset_work.cold+0x1be/0xa78
> [  149.774385]  ? nvme_probe+0x8c0/0x8c0
> [  149.774385]  ? check_preempt_curr+0x9d/0xf0
> [  149.774385]  ? ttwu_do_wakeup.isra.0+0x176/0x190
> [  149.774385]  ? try_to_wake_up+0x37c/0x900
> [  149.774385]  ? migration_cpu_stop+0x1e0/0x1e0
> [  149.774385]  ? __schedule+0x581/0xc40
> [  149.774385]  ? read_word_at_a_time+0xe/0x20
> [  149.774385]  ? strscpy+0xbf/0x190
> [  149.774385]  process_one_work+0x4ad/0x7e0
> [  149.774385]  worker_thread+0x73/0x690
> [  149.774385]  ? process_one_work+0x7e0/0x7e0
> [  149.774385]  kthread+0x199/0x1f0
> [  149.774385]  ? kthread_create_on_node+0xd0/0xd0
> [  149.774385]  ret_from_fork+0x22/0x30
> 
> and clearly something is not working as expected.

And the problem is that nvme_hwmon_init() doesn't return an error, so it
returns to continue initialization on a disabled controller. That's the
bug that needs to be fixed.

Re: [PATCH] nvme: fix NULL pointer dereference

2020-09-16 Thread Keith Busch

On Wed, Sep 16, 2020 at 11:36:49AM -0400, Tong Zhang wrote:
> @@ -960,6 +960,8 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> *nvmeq, u16 idx)
>   }
>  
>   req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
> + if (!req)
> + return;

As I mentioned before, blk_mq_tag_to_rq() returns NULL if the tag
exceeds the depth. We already verify the tag prior to calling this
function, so what's the real root cause for how we're winding up with
NULL here? I'm only asking this because it sounds like there's a bug
somewhere else and this change is masking over it.

>   trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
>   if (!nvme_try_complete_req(req, cqe->status, cqe->result))
>   nvme_pci_complete_rq(req);

Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-09 Thread Keith Busch

On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote:
> diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
> index eea0f453cfb6..8aac5bc60f4c 100644
> --- a/crypto/tcrypt.c
> +++ b/crypto/tcrypt.c
> @@ -2464,7 +2464,7 @@ static int do_test(const char *alg, u32 type, u32 mask, 
> int m, u32 num_mb)
>   test_hash_speed("streebog512", sec,
>   generic_hash_speed_template);
>   if (mode > 300 && mode < 400) break;
> - fallthrough;
> + break;
>   case 399:
>   break;

Just imho, this change makes the preceding 'if' look even more
pointless. Maybe the fallthrough was a deliberate choice? Not that my
opinion matters here as I don't know this module, but it looked a bit
odd to me.

Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-09-02 Thread Keith Busch

On Wed, Sep 02, 2020 at 01:48:19PM -0600, David Fugate wrote:
> Over the years, I've been forwarded numerous emails from VMD customers 
> praising it's ability to prevent Linux kernel panics upon hot-removals
> and inserts of U.2 NVMe drives.

The same nvme and pcie hotplug drivers are used with or without VMD
enabled. Where are the bug reports for linux kernel panics? We have a
very real interest in fixing such issues.

Re: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()

2020-09-01 Thread Keith Busch

On Mon, Aug 31, 2020 at 06:55:53PM +0800, Xianting Tian wrote:
> As blk_mq_tag_to_rq() may return null, so it should be check whether it is
> null before using it to prevent a crash.

It may return NULL if the command id exceeds the number of tags. We
already have a check for a valid command id value, so something is not
adding up here if we're still getting NULL.

>   req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
> + if (unlikely(!req)) {
> + dev_warn(nvmeq->dev->ctrl.device,
> + "req is null(tag:%d) on queue %d\n",
> + cqe->command_id, le16_to_cpu(cqe->sq_id));
> + return;
> + }

Re: [PATCH AUTOSEL 5.8 11/42] nvme: skip noiob for zoned devices

2020-08-31 Thread Keith Busch

On Mon, Aug 31, 2020 at 11:29:03AM -0400, Sasha Levin wrote:
> From: Keith Busch 
> 
> [ Upstream commit c41ad98bebb8f4f0335b3c50dbb7583a6149dce4 ]
> 
> Zoned block devices reuse the chunk_sectors queue limit to define zone
> boundaries. If a such a device happens to also report an optimal
> boundary, do not use that to define the chunk_sectors as that may
> intermittently interfere with io splitting and zone size queries.
> 
> Signed-off-by: Keith Busch 
> Signed-off-by: Sagi Grimberg 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 

You can safely drop this from stable: nvme zoned devices were only introduced
to linux in 5.9.

Re: [PATCH v2] nvme-pci: cancel nvme device request before disabling

2020-08-28 Thread Keith Busch

On Fri, Aug 28, 2020 at 10:17:08AM -0400, Tong Zhang wrote:
> This patch addresses an irq free warning and null pointer dereference
> error problem when nvme devices got timeout error during initialization.
> This problem happens when nvme_timeout() function is called while
> nvme_reset_work() is still in execution. This patch fixed the problem by
> setting flag of the problematic request to NVME_REQ_CANCELLED before
> calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
> an error code and let nvme_submit_sync_cmd() fail gracefully.
> The following is console output.

Thanks, this looks good to me.

Reviewed-by: Keith Busch

Re: [PATCH] nvme-pci: cancel nvme device request before disabling

2020-08-27 Thread Keith Busch

On Fri, Aug 14, 2020 at 12:11:56PM -0400, Tong Zhang wrote:
> On Fri, Aug 14, 2020 at 11:42 AM Keith Busch  wrote:
> > > > On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote:
> > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > > > index ba725ae47305..c4f1ce0ee1e3 100644
> > > > > --- a/drivers/nvme/host/pci.c
> > > > > +++ b/drivers/nvme/host/pci.c
> > > > > @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return 
> > > > > nvme_timeout(struct request *req, bool reserved)
> > > > >   dev_warn_ratelimited(dev->ctrl.device,
> > > > >"I/O %d QID %d timeout, disable controller\n",
> > > > >req->tag, nvmeq->qid);
> > > > > - nvme_dev_disable(dev, true);
> > > > >   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > > > > + nvme_dev_disable(dev, true);
> > > > >   return BLK_EH_DONE;
> 
> > anymore. The driver is not reporting   non-response back for all
> > cancelled requests, and that is probably not what we should be doing.
> 
> OK, thanks for the explanation. I think the bottom line here is to let the
> probe function know and stop proceeding when there's an error.
> I also don't see an obvious reason to set NVME_REQ_CANCELLED
> after nvme_dev_disable(dev, true).

The flag was set after disabling when it didn't happen to matter: the
block layer had a complicated timeout scheme that didn't actually
complete the request until the timeout handler returned, so the flag set
where it is was 'ok'. That's clearly not the case anymore, so yes, I
think we do need your patch.

There is one case you are missing, though:

---
@@ -1267,10 +1267,10 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
+   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
nvme_dev_disable(dev, false);
nvme_reset_ctrl(>ctrl);
 
-   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
}
--

Re: [RFC][PATCH 0/9] [v3] Migrate Pages in lieu of discard

2020-08-24 Thread Keith Busch

On Tue, Aug 18, 2020 at 11:41:22AM -0700, Dave Hansen wrote:
> Dave Hansen (5):
>   mm/numa: node demotion data structure and lookup
>   mm/vmscan: Attempt to migrate page in lieu of discard
>   mm/numa: automatically generate node migration order
>   mm/vmscan: never demote for memcg reclaim
>   mm/numa: new reclaim mode to enable reclaim-based migration
> 
> Keith Busch (2):
>   mm/migrate: Defer allocating new page until needed
>   mm/vmscan: Consider anonymous pages without swap
> 
> Yang Shi (1):
>   mm/vmscan: add page demotion counter
> 
>  Documentation/admin-guide/sysctl/vm.rst |9
>  include/linux/migrate.h |6
>  include/linux/node.h|9
>  include/linux/vm_event_item.h   |2
>  include/trace/events/migrate.h  |3
>  mm/debug.c  |1
>  mm/internal.h   |1
>  mm/migrate.c|  400 
> ++--
>  mm/page_alloc.c |2
>  mm/vmscan.c |   88 ++-
>  mm/vmstat.c |2
>  11 files changed, 439 insertions(+), 84 deletions(-)

The commit summary and diff stat appear to be from an earlier revision
of the series. No big deal, just pointing that out. BTW, thank you for
continuing to enhance this capability! :) I'd missed out on the earlier
discussion from some bad filters that have should now be fixed
(hopefully), so I've some catching up to do.

Re: [PATCH] nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'

2020-08-24 Thread Keith Busch

On Mon, Aug 24, 2020 at 01:00:11PM -0700, Sagi Grimberg wrote:
> > The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
> > in this function.
> > 
> > Use 'spin_lock_irqsave()' also here, as there is no guarantee that
> > interruptions are disabled at that point, according to surrounding code.
> > 
> > Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
> > Signed-off-by: Christophe JAILLET 
> > ---
> > Not tested, only based on what looks logical to me according to
> > surrounding code
> > ---
> >   drivers/nvme/target/fc.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
> > index 55bafd56166a..e6861cc10e7d 100644
> > --- a/drivers/nvme/target/fc.c
> > +++ b/drivers/nvme/target/fc.c
> > @@ -2342,9 +2342,9 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
> > return;
> > if (fcpreq->fcp_error ||
> > fcpreq->transferred_length != fcpreq->transfer_length) {
> > -   spin_lock(>flock);
> > +   spin_lock_irqsave(>flock, flags);
> > fod->abort = true;
> > -   spin_unlock(>flock);
> > +   spin_unlock_irqrestore(>flock, flags);
> > nvmet_req_complete(>req, NVME_SC_INTERNAL);
> > return;
> > 
> 
> James, can I get a reviewed-by from you on this?

afaics, the lock just serializes single writes, in which
WRITE/READ_ONCE() can handle that without a lock, right?

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch

On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote:
> There were queries? My key takeaways were a maintainer NAK followed by
> instructions to make the Intel drive align with the driver by
> implementing NOIOB. While I disagree with the rejection as it appeared
> to be based entirely on politics, I can accept it as the quirk wasn't
> in the spec.

For the record, the suggestion provided, which you agreed to look into,
most broadly enables your hardware on Linux and was entirely to your
benefit. Not quite as dramatic as a political conspiracy.

You later responded with a technical argument against that suggestion;
however, your reason didn't add up, and that's where you left the
thread.

> It's not fair to make this same "your drive should align with the
> driver" demand of Samsung because we *are* talking about a spec'ed
> feature here. Technical critques of their patches and real performance
> degrades observed are fair game and objective; "your company did
> the nastiest possible move violating the normal NVMe procedures to make
> it optional" is not.

Sure, but you're cherry picking comments from the discussion. The
performance impact exists, and it's generally not acceptable from a
maintenance point to duplicate significant code without at least trying
to provide a common solution.

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch

On Wed, Aug 19, 2020 at 03:54:20PM -0600, David Fugate wrote:
> On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote:
> > It's not required, the driver will function quite fine without it. If
> > you
> > want to use ZNS it's required. 
> 
> The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral*
> Linux driver should not either. 

The spec was developed over the course of years with your employer's
involvement, and the software enabling efforts occurred in parallel. The
"optional" part was made that way at the final hour, so please align
your expectations accordingly.

> Agreed, but this standard needs to be applied equally to everyone.
> E.g., harmless contributions such as 
> https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get
> rejected yet clear spec violations from maintainers are accepted?
> type of behavior encourages forking, vendor-specific drivers, etc.
> which is somewhere I hope none of us want to go.

You're the one who left that thread dangling. You offered to have your
firmware accommodate the Intel sponsored feature that makes your patch
unnecessary in the first place. Your follow up made no sense and you
have not responded to the queries about it.

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch

On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver. 

This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.

> Provided there's no glaring technical issues

There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.

Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Keith Busch

On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote:
> On 18.08.2020 09:58, Keith Busch wrote:
> > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> > > a number of customers are requiring the use of normal writes, which we
> > > want to support.
> > 
> > A device that supports append is completely usable for those customers,
> > too. There's no need to create divergence in this driver.
> 
> Not really. You know as well as I do that some features are disabled for
> a particular SSD model on customer requirements. Generic models
> implementing append can submit both I/Os, but those that remove append
> are left out.

You are only supporting my point: if your device supports append, you
get to work in every ZNS use case, otherwise you only get to work in a
subset.

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Keith Busch

On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> a number of customers are requiring the use of normal writes, which we
> want to support.

A device that supports append is completely usable for those customers,
too. There's no need to create divergence in this driver.

Re: v5.9-rc1 commit reliably breaks pci nvme detection

2020-08-17 Thread Keith Busch

On Mon, Aug 17, 2020 at 03:50:11PM +0200, Ahmed S. Darwish wrote:
> Hello,
> 
> Below v5.9-rc1 commit reliably breaks my boot on a Thinkpad e480
> laptop. PCI nvme detection fails, and the kernel becomes not able
> anymore to find the rootfs / parse "root=".
> 
> Bisecting v5.8=>v5.9-rc1 blames that commit. Reverting it *reliably*
> fixes the problem and makes me able to boot v5.9-rc1.

The fix is staged in the nvme tree here:

  
http://git.infradead.org/nvme.git/commit/286155561ecd13b6c85a78eaf2880d3baea03b9e

Re: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth

2020-08-14 Thread Keith Busch

On Fri, Aug 14, 2020 at 11:34:25PM +0800, John Garry wrote:
> Fix by making onto a u32.
> 
> Also use u32 for nvme_dev.q_depth, as we assign this value from
> nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
> avoids the same crash as above.
> 
> Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth")
> Signed-off-by: John Garry 

Looks good to me.

Reviewed-by: Keith Busch

Re: [PATCH] nvme-pci: cancel nvme device request before disabling

2020-08-14 Thread Keith Busch

On Fri, Aug 14, 2020 at 11:37:20AM -0400, Tong Zhang wrote:
> On Fri, Aug 14, 2020 at 11:04 AM Keith Busch  wrote:
> >
> > On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote:
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index ba725ae47305..c4f1ce0ee1e3 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> > > request *req, bool reserved)
> > >   dev_warn_ratelimited(dev->ctrl.device,
> > >"I/O %d QID %d timeout, disable controller\n",
> > >req->tag, nvmeq->qid);
> > > - nvme_dev_disable(dev, true);
> > >   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > > + nvme_dev_disable(dev, true);
> > >   return BLK_EH_DONE;
> >
> > Shouldn't this flag have been set in nvme_cancel_request()?
> 
> nvme_cancel_request() is not setting this flag to cancelled and this is 
> causing

Right, I see that it doesn't, but I'm saying that it should. We used to
do something like that, and I'm struggling to recall why we're not
anymore. The driver is not reporting   non-response back for all
cancelled requests, and that is probably not what we should be doing.

Re: [PATCH] nvme-pci: cancel nvme device request before disabling

2020-08-14 Thread Keith Busch

On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index ba725ae47305..c4f1ce0ee1e3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct 
> request *req, bool reserved)
>   dev_warn_ratelimited(dev->ctrl.device,
>"I/O %d QID %d timeout, disable controller\n",
>req->tag, nvmeq->qid);
> - nvme_dev_disable(dev, true);
>   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + nvme_dev_disable(dev, true);
>   return BLK_EH_DONE;

Shouldn't this flag have been set in nvme_cancel_request()?  It's not
like the timeout out command is the only command to have been cancelled
by this action, nor is it guaranteed that getting here will mean the
request was in fact cancelled. The controller could still provide a real
completion.

Re: [PATCH v2] nvme: Use spin_lock_irq() when taking the ctrl->lock

2020-08-12 Thread Keith Busch

There's an unrelated whitespace change in nvme_init_identify().
Otherwise, looks fine.

Reviewed-by: Keith Busch

Re: [RESEND PATCH] nvme: Use spin_lock_irqsave() when taking the ctrl->lock

2020-08-12 Thread Keith Busch

On Wed, Aug 12, 2020 at 03:01:19PM -0600, Logan Gunthorpe wrote:
> @@ -2971,15 +2971,16 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 
> log_page, u8 lsp, u8 csi,
>  static struct nvme_cel *nvme_find_cel(struct nvme_ctrl *ctrl, u8 csi)
>  {
>   struct nvme_cel *cel, *ret = NULL;
> + unsigned long flags;
> 
> - spin_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   list_for_each_entry(cel, >cels, entry) {
>   if (cel->csi == csi) {
>   ret = cel;
>   break;
>   }
>   }
> - spin_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
> 
>   return ret;
>  }
> @@ -2988,6 +2989,7 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl, 
> u8 csi,
>   struct nvme_effects_log **log)
>  {
>   struct nvme_cel *cel = nvme_find_cel(ctrl, csi);
> + unsigned long flags;
>   int ret;
> 
>   if (cel)
> @@ -3006,9 +3008,9 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl, 
> u8 csi,
> 
>   cel->csi = csi;
> 
> - spin_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   list_add_tail(>entry, >cels);
> - spin_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  out:
>   *log = >log;
>   return 0;
> 

Neither of these are ever called from an interrupt disabled context,
correct? If so, you can just use spin_lock_irq() without saving the
current irq state.

Re: Regression in nvme driver

2020-07-29 Thread Keith Busch

On Wed, Jul 29, 2020 at 07:29:08PM +, Lach wrote:
> Hello
> 
> I caught a regression in the nvme driver, which shows itself on some
> controllers (In my case, at 126h:2263)

Fix is staged for the next 5.8 pull;

  
https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.8=5bedd3afee8eb01ccd256f0cd2cc0fa6f841417a

Re: [PATCH v16 0/9] nvmet: add target passthru commands support

2020-07-24 Thread Keith Busch

On Fri, Jul 24, 2020 at 11:25:11AM -0600, Logan Gunthorpe wrote:
> This is v16 of the passthru patchset which make a bunch of cleanup as
> suggested by Christoph.

Thank, looks great. Just the comment on 6/9, which probably isn't super
important anyway.

Reviewed-by: Keith Busch

Re: [PATCH v16 6/9] nvmet-passthru: Add passthru code to process commands

2020-07-24 Thread Keith Busch

On Fri, Jul 24, 2020 at 11:25:17AM -0600, Logan Gunthorpe wrote:
> + /*
> +  * The passthru NVMe driver may have a limit on the number of segments
> +  * which depends on the host's memory fragementation. To solve this,
> +  * ensure mdts is limitted to the pages equal to the number of

  limited

> + /* don't support fuse commands */
> + id->fuses = 0;

If a host were to set a fuse, the target should return an Invalid Field
error. Just to future-proof, rejecting commands with any flags set
(other than SGL, which you handled in patch 1/9) is probably what should
happen, like:

> +u16 nvmet_parse_passthru_io_cmd(struct nvmet_req *req)
> +{

if (req->cmd->common.flags & ~NVME_CMD_SGL_ALL)
return NVME_SC_INVALID_FIELD;

Or maybe we could obviate the need for 1/9 with something like:

req->cmd->common.flags &= ~NVME_CMD_SGL_ALL;
if (req->cmd->common.flags)
return NVME_SC_INVALID_FIELD;

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Keith Busch

On Mon, Jul 20, 2020 at 04:28:26PM -0700, Sagi Grimberg wrote:
> On 7/20/20 4:17 PM, Keith Busch wrote:
> > On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:
> > > On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:
> > > 
> > > > passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
> > > > means that the driver shouldn't need the ns at all. So if you have a
> > > > dedicated request queue (mapped to the I/O tagset), you don't need the
> > > > ns->queue and we can lose the ns lookup altogether.
> > 
> > We still need a request_queue to dispatch the command. I guess you could
> > make a generic one for the controller that isn't tied to a namespace,
> > but we lose the fair shared tag allocation.
> 
> What do you mean fair shared tag allocation?

See hctx_may_queue().

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Keith Busch

On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:
> On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:
>
> > passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
> > means that the driver shouldn't need the ns at all. So if you have a
> > dedicated request queue (mapped to the I/O tagset), you don't need the
> > ns->queue and we can lose the ns lookup altogether.

We still need a request_queue to dispatch the command. I guess you could
make a generic one for the controller that isn't tied to a namespace,
but we lose the fair shared tag allocation.

> Thanks, that helps clarify things a bit, but which xarray were you
> talking about

That was something that was replacing the list lookup:

  http://lists.infradead.org/pipermail/linux-nvme/2020-July/018242.html

Re: [PATCH 5/5] nvme-pci: Use standard block status macro

2020-07-07 Thread Keith Busch

On Fri, Jul 03, 2020 at 10:49:24AM +0800, Baolin Wang wrote:
>  static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> @@ -844,7 +844,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev 
> *dev, struct request *req,
>   if (dma_mapping_error(dev->dev, iod->meta_dma))
>   return BLK_STS_IOERR;
>   cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
> - return 0;
> + return BLK_STS_OK;
>  }

This is fine, though it takes knowing that this value is 0 for the
subsequent 'if (!ret)' check to make sense. Maybe those should change to
'if (ret != BLK_STS_OK)' so the check uses the same symbol as the
return, and will always work in the unlikely event that the defines
are reordered.

Re: [PATCH] nvme: validate cntlid's only for nvme >= 1.1.0

2020-06-30 Thread Keith Busch

On Tue, Jun 30, 2020 at 04:01:45PM +0200, Maximilian Heyne wrote:
> On 6/30/20 3:36 PM, Christoph Hellwig wrote:
> > And actually - 1.0 did not have the concept of a subsystem.  So having
> > a duplicate serial number for a 1.0 controller actually is a pretty
> > nasty bug.  Can you point me to this broken controller?  Do you think
> > the OEM could fix it up to report a proper version number and controller
> > ID?
> > 
> 
> I meant that the VF NVMe controllers will all land in the same subsystem from
> the kernel's point of view, because, as you said, there was no idea of 
> different
> subsystems in the 1.0 spec.

Each controller should have landed in its own subsystem in this case
rather than the same subsystem.

> It's an older in-house controller. Seems to set the same serial number for all
> VF's. Should the firmware set unique serials for the VF's instead?

Yes, the driver shouldn't be finding duplicate serial numbers.

Re: [RFC PATCH] nvme-pci: Move the sg table allocation/free into init/exit_request

2020-06-28 Thread Keith Busch

On Sun, Jun 28, 2020 at 06:34:46PM +0800, Baolin Wang wrote:
> Move the sg table allocation and free into the init_request() and
> exit_request(), instead of allocating sg table when queuing requests,
> which can benefit the IO performance.

If you want to pre-allocate something per-request, you can add the size
to the tagset's cmd_size.

But this is adding almost 4k per request. Considering how many requests
we try to allocate, that's a bit too large to count on being available
or sequestor for this driver.

Re: [PATCH 2/3] nvme-pci: Add controller memory buffer supported macro

2020-06-23 Thread Keith Busch

On Tue, Jun 23, 2020 at 06:27:51PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 23, 2020 at 09:24:33PM +0800, Baolin Wang wrote:
> > Introduce a new capability macro to indicate if the controller
> > supports the memory buffer or not, instead of reading the
> > NVME_REG_CMBSZ register.
> 
> This is a complex issue.  The CMBS bit was only added in NVMe 1.4 as
> a backwards incompatible change, as the CMB addressing scheme can lead
> to data corruption.  The CMBS was added as part of the horribe hack
> that also involves the CBA field, which we'll need to see before
> using it to work around the addressing issue.  At the same time we
> should also continue supporting the legacy pre-1.4 CMB with a warning
> (and may reject it if we know we run in a VM).

Well, a CMB from an emulated controller (like qemu's) can be used within
a VM. It's only if you direct assign a PCI function that CMB usage
breaks.

Re: [PATCH 1/3] nvme: Add Arbitration Burst support

2020-06-23 Thread Keith Busch

On Tue, Jun 23, 2020 at 09:24:32PM +0800, Baolin Wang wrote:
> +void nvme_set_arbitration_burst(struct nvme_ctrl *ctrl)
> +{
> + u32 result;
> + int status;
> +
> + if (!ctrl->rab)
> + return;
> +
> + /*
> +  * The Arbitration Burst setting indicates the maximum number of
> +  * commands that the controller may launch at one time from a
> +  * particular Submission Queue. It is recommended that host software
> +  * configure the Arbitration Burst setting as close to the recommended
> +  * value by the controller as possible.
> +  */
> + status = nvme_set_features(ctrl, NVME_FEAT_ARBITRATION, ctrl->rab,

Since 'rab' is an 8-bit field, but the feature's AB value is only 3
bits, we should add a validity check.

> +}
> +EXPORT_SYMBOL_GPL(nvme_set_arbitration_burst);

I don't see any particular reason to export this function just for the
pci transport to use. Just make it a local static function an call it
from nvme_init_identify().

Re: [PATCH 1/3] nvme: Add Arbitration Burst support

2020-06-23 Thread Keith Busch

On Wed, Jun 24, 2020 at 09:34:08AM +0800, Baolin Wang wrote:
> OK, I understaood your concern. Now we will select the RR arbitration as 
> default
> in nvme_enable_ctrl(), but for some cases, we will not set the arbitration 
> burst
> values from userspace, and we still want to use the defaut arbitration burst 
> that
> recommended by the controller, taking into consideration any latency 
> requirements.
> 
> So I'd like to add a parameter to decide if we can use the default arbitration
> burst like below, how do yo think? Thanks.

I wouldn't call this the 'default' arbitration since it usually is not
the default. This is the 'recommended' arbitration value.
 
> static bool use_default_arb;
> module_param(use_default_arb, bool, 0444);
> MODULE_PARM_DESC(use_default_arb, "use default arbitration burst that
> recommended by the controller");

"use controller's recommended arbitration burst"

Re: [PATCH 1/3] nvme: Add Arbitration Burst support

2020-06-23 Thread Keith Busch

On Tue, Jun 23, 2020 at 10:39:01AM -0700, Sagi Grimberg wrote:
> 
> > >  From the NVMe spec, "In order to make efficient use of the non-volatile
> > > memory, it is often advantageous to execute multiple commands from a
> > > Submission Queue in parallel. For Submission Queues that are using
> > > weighted round robin with urgent priority class or round robin
> > > arbitration, host software may configure an Arbitration Burst setting".
> > > Thus add Arbitration Burst setting support.
> > 
> > But if the user changed it to something else that better matches how
> > they want to use queues, the driver is just going to undo that setting
> > on the next reset.
> 
> Where do we do priority class arbitration? no one sets it

Not the priority class, we don't use WRR in this driver. The RR
arbitration can be set from user space and saved across controller
resets, like:

  # nvme set-feature -f 1 -v 3 --save

This patch would undo the saved feature value.

Re: [PATCH 1/3] nvme: Add Arbitration Burst support

2020-06-23 Thread Keith Busch

On Tue, Jun 23, 2020 at 09:24:32PM +0800, Baolin Wang wrote:
> From the NVMe spec, "In order to make efficient use of the non-volatile
> memory, it is often advantageous to execute multiple commands from a
> Submission Queue in parallel. For Submission Queues that are using
> weighted round robin with urgent priority class or round robin
> arbitration, host software may configure an Arbitration Burst setting".
> Thus add Arbitration Burst setting support.

But if the user changed it to something else that better matches how
they want to use queues, the driver is just going to undo that setting
on the next reset.

Re: [PATCH 1/1] nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()

2020-05-26 Thread Keith Busch

Fixes: fa46c6fb5d61 ("nvme/pci: move cqe check after device shutdown")

Re: [PATCH 1/1] nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()

2020-05-26 Thread Keith Busch

Looks good to me.

Reviewed-by: Keith Busch

Re: [PATCH 0/2] Add support for StorageD3Enable _DSD property

2020-04-29 Thread Keith Busch

On Wed, Apr 29, 2020 at 05:20:09AM +, Williams, Dan J wrote:
> On Tue, 2020-04-28 at 08:27 -0700, David E. Box wrote:
> > On Tue, 2020-04-28 at 16:22 +0200, Christoph Hellwig wrote:
> > > On Tue, Apr 28, 2020 at 07:09:59AM -0700, David E. Box wrote:
> > > > > I'm not sure who came up with the idea to put this into ACPI,
> > > > > but
> > > > > it
> > > > > belongs into NVMe.  Please talk to the NVMe technical working
> > > > > group
> > > > > instead of trying to overrules them in an unrelated group that
> > > > > doesn't
> > > > > apply to all of PCIe.
> > > > 
> > > > Agreed that this is not ideal since it does not apply to all of
> > > > PCIe.
> > > > But as the property already exists on shipping systems, we need
> > > > to
> > > > be
> > > > able to read it in the NVMe driver and the patch is consitent
> > > > with
> > > > the
> > > > way properties under PCI ports are read.
> > > 
> > > The point is that it is not the BIOSes job do decide how Linux does
> > > power management.  For example D3 has really horrible entry and
> > > exit
> > > latencies in many cases, and will lead to higher power usage.
> > 
> > The platform can know which pm policies will save the most power. But
> > since the solution doesn't apply to all PCIe devices (despite BIOS
> > specifying it that way) I'll withdraw this patch. Thanks.
> 
> Wait, why withdraw? In this case the platform is unfortunately
> preventing the standard driver from making a proper determination. So
> while I agree that it's not the BIOSes job, when the platform actively
> prevents proper operation due to some ill conceived non-standard
> platform property what is Linux left to do on these systems?
> 
> The *patch* is not trying to overrule NVME, and the best I can say is
> that the Intel Linux team was not in the loop when this was being
> decided between the platform BIOS implemenation and  whomever  thought
> they could just publish random ACPI properties that impacted NVME
> operation [1].
> 
> So now David is trying to get these platform unbroken because they are
> already shipping with this b0rkage.

Rather than quirking all these cases, which I get the feeling there
are many more than we've currently got in our quirk list, perhaps it'd
be simpler to default to the simple suspend. AFAIK, the simple suspend
works for all platforms, though it may not realize the best power savings
and/or exit latency.

Re: [PATCH v2] nvme-pci: Shutdown when removing dead controller

2019-10-07 Thread Keith Busch

On Mon, Oct 07, 2019 at 01:50:11PM -0400, Tyler Ramer wrote:
> Shutdown the controller when nvme_remove_dead_controller is
> reached.
> 
> If nvme_remove_dead_controller() is called, the controller won't
> be comming back online, so we should shut it down rather than just
> disabling.
> 
> Remove nvme_kill_queues() as nvme_dev_remove() will take care of
> unquiescing queues.


We do still need to kill the queues, though. The shutdown == true just flushes
all pending requests. Killing queues does that too, but it also sets the
request_queue to dying, which will terminate syncing any dirty pages.
 
> ---
> 
> Changes since v1:
> * Clean up commit message
> * Remove nvme_kill_queues()
> ---
>  drivers/nvme/host/pci.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c0808f9eb8ab..68d5fb880d80 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2509,8 +2509,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
>  static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>  {
>   nvme_get_ctrl(>ctrl);
> - nvme_dev_disable(dev, false);
> - nvme_kill_queues(>ctrl);
> + nvme_dev_disable(dev, true);
>   if (!queue_work(nvme_wq, >remove_work))
>   nvme_put_ctrl(>ctrl);
>  }
> -- 
> 2.23.0
>

Re: [PATCH] nvme-pci: Shutdown when removing dead controller

2019-10-07 Thread Keith Busch

On Mon, Oct 07, 2019 at 11:13:12AM -0400, Tyler Ramer wrote:
> > Setting the shutdown to true is
> > usually just to get the queues flushed, but the nvme_kill_queues() that
> > we call accomplishes the same thing.
> 
> The intention of this patch was to clean up another location where
> nvme_dev_disable()
> is called with shutdown == false, but the device is being removed due
> to a failure
> condition, so it should be shutdown.
> 
> Perhaps though, given nvme_kill_queues() provides a subset of the
> functionality of
> nvme_dev_disable() with shutdown == true, we can just use
> nvme_dev_disable() and
> remove nvme_kill_queues()?
> 
> This will make nvme_remove_dead_ctrl() more in line with nvme_remove(),
> nvme_shutdown(), etc.

It's fine to use the shutdown == true in this path as well, but I just wanted
to understand what problem it was fixing. It doesn't sound like your scenario
is going to end up setting CC.SHN, so the only thing the shutdown should
accomplish is flushing pending IO, but we already call kill_queues() right
after the nvme_dev_disable(), so that part should be okay.

Re: [PATCH] nvme-pci: Shutdown when removing dead controller

2019-10-06 Thread Keith Busch

On Fri, Oct 04, 2019 at 11:36:42AM -0400, Tyler Ramer wrote:
> Here's a failure we had which represents the issue the patch is
> intended to solve:
> 
> Aug 26 15:00:56 testhost kernel: nvme nvme4: async event result 00010300
> Aug 26 15:01:27 testhost kernel: nvme nvme4: controller is down; will
> reset: CSTS=0x3, PCI_STATUS=0x10
> Aug 26 15:02:10 testhost kernel: nvme nvme4: Device not ready; aborting reset
> Aug 26 15:02:10 testhost kernel: nvme nvme4: Removing after probe
> failure status: -19
> 
> The CSTS warnings comes from nvme_timeout, and is printed by
> nvme_warn_reset. A reset then occurs
> Controller state should be NVME_CTRL_RESETTING
> 
> Now, in nvme_reset_work, controller is never marked "CONNECTING"  at:
> 
>  if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING))
> 
> because several lines above, we can determine that
> nvme_pci_configure_admin_queues returns
> a bad result, which triggers a goto out_unlock and prints "removing
> after probe failure status: -19"
> 
> Because state is never changed to NVME_CTRL_CONNECTING or
> NVME_CTRL_DELETING, the
> logic added in 
> https://github.com/torvalds/linux/commit/2036f7263d70e67d70a67899a468588cb7356bc9
> should not apply. 

Nor should it, because there are no IO in flight at this point, there
can't be any timeout work to check the state.

> We can further validate that dev->ctrl.state ==
> NVME_CTRL_RESETTING thanks to
> the WARN_ON in nvme_reset_work.

I'm not sure I see what this is fixing. Setting the shutdown to true is
usually just to get the queues flushed, but the nvme_kill_queues() that
we call accomplishes the same thing.
 
> On Thu, Oct 3, 2019 at 3:13 PM Tyler Ramer  wrote:
> >
> > Always shutdown the controller when nvme_remove_dead_controller is
> > reached.
> >
> > It's possible for nvme_remove_dead_controller to be called as part of a
> > failed reset, when there is a bad NVME_CSTS. The controller won't
> > be comming back online, so we should shut it down rather than just
> > disabling.
> >
> > Signed-off-by: Tyler Ramer 
> > ---
> >  drivers/nvme/host/pci.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index c0808f9eb8ab..c3f5ba22c625 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2509,7 +2509,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
> >  static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
> >  {
> > nvme_get_ctrl(>ctrl);
> > -   nvme_dev_disable(dev, false);
> > +   nvme_dev_disable(dev, true);
> > nvme_kill_queues(>ctrl);
> > if (!queue_work(nvme_wq, >remove_work))
> > nvme_put_ctrl(>ctrl);
> > --
> > 2.23.0
> >

Re: [PATCH v3] nvme: allow 64-bit results in passthru commands

2019-09-24 Thread Keith Busch

On Tue, Sep 24, 2019 at 11:05:36AM -0700, Sagi Grimberg wrote:
> Looks fine to me,
> 
> Reviewed-by: Sagi Grimberg 
> 
> Keith, Christoph?

Looks good to me, too.

Reviewed-by: Keith Busch

Re: [PATCH] nvme: fix an error code in nvme_init_subsystem()

2019-09-23 Thread Keith Busch

On Mon, Sep 23, 2019 at 05:18:36PM +0300, Dan Carpenter wrote:
> "ret" should be a negative error code here, but it's either success or
> possibly uninitialized.
> 
> Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller 
> list")
> Signed-off-by: Dan Carpenter 

Thanks, patch looks good.

Reviewed-by: Keith Busch

Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state

2019-09-20 Thread Keith Busch

On Wed, Sep 18, 2019 at 01:15:55PM -0500, Mario Limonciello wrote:
> The action of saving the PCI state will cause numerous PCI configuration
> space reads which depending upon the vendor implementation may cause
> the drive to exit the deepest NVMe state.
> 
> In these cases ASPM will typically resolve the PCIe link state and APST
> may resolve the NVMe power state.  However it has also been observed
> that this register access after quiesced will cause PC10 failure
> on some device combinations.
> 
> To resolve this, move the PCI state saving to before SetFeatures has been
> called.  This has been proven to resolve the issue across a 5000 sample
> test on previously failing disk/system combinations.
> 
> Signed-off-by: Mario Limonciello 

This looks good. It clashes with something I posted yesterday, but
I'll rebase after this one.

Reviewed-by: Keith Busch

Re: NVMe Poll CQ on timeout

2019-09-19 Thread Keith Busch

On Thu, Sep 19, 2019 at 01:47:50PM +, Bharat Kumar Gogada wrote:
> Hi All,
> 
> We are testing NVMe cards on ARM64 platform, the card uses MSI-X interrupts.
> We are hitting following case in drivers/nvme/host/pci.c
> /*
>  * Did we miss an interrupt?
>  */
> if (__nvme_poll(nvmeq, req->tag)) {
> dev_warn(dev->ctrl.device,
>  "I/O %d QID %d timeout, completion polled\n",
>  req->tag, nvmeq->qid);
> return BLK_EH_DONE;
> }
> 
> Can anyone tell when does nvme_timeout gets invoked ?

Timeout is invoked when the driver didn't see a completion to a
submitted command.

> In what cases we see this interrupt miss ?

That usually happens for one of two reasons:

 1. The device didn't send any MSIx message for a CQE

 2. The device sent the MSIx message before posting the CQE

I've also seen h/w errata where the MSIx and CQE are re-ordered, which
can also lead to this.

A hardware trace would provide the most detailed view of what's
happening. You might be able to infer if you carefully account for
commands sent, interrupts received, and spurious interrupts detected.

> We are seeing this issue only for reads with following fio command 
> fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randread --bs=128k 
> --direct=0 \
> --size=128M --numjobs=3 --group_reporting --filename=/dev/nvme0n1
> 
> We are not seeing issue with --rw=randwrite for same size.
> 
> Please let us know what can cause this issue.

Re: [PATCH] nvme-pci: Save PCI state before putting drive into deepest state

2019-09-18 Thread Keith Busch

On Wed, Sep 11, 2019 at 06:42:33PM -0500, Mario Limonciello wrote:
> ---
>  drivers/nvme/host/pci.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 732d5b6..9b3fed4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2894,6 +2894,13 @@ static int nvme_suspend(struct device *dev)
>   if (ret < 0)
>   goto unfreeze;
>  
> + /*
> +  * A saved state prevents pci pm from generically controlling the
> +  * device's power. If we're using protocol specific settings, we don't
> +  * want pci interfering.
> +  */
> + pci_save_state(pdev);
> +
>   ret = nvme_set_power_state(ctrl, ctrl->npss);
>   if (ret < 0)
>   goto unfreeze;
> @@ -2908,12 +2915,6 @@ static int nvme_suspend(struct device *dev)
>   ret = 0;
>   goto unfreeze;

We would need to clear the saved state here, though. You can also
infact remove the unfreeze label and goto.

>   }
> - /*
> -  * A saved state prevents pci pm from generically controlling the
> -  * device's power. If we're using protocol specific settings, we don't
> -  * want pci interfering.
> -  */
> - pci_save_state(pdev);
>  unfreeze:
>   nvme_unfreeze(ctrl);
>   return ret;
> --

Re: [PATCH 0/2] nvme: Add kernel API for admin command

2019-09-18 Thread Keith Busch

On Wed, Sep 18, 2019 at 03:26:11PM +0200, Christoph Hellwig wrote:
> Even if we had a use case for that the bounce buffer is just too ugly
> to live.  And I'm really sick and tired of Intel wasting our time for
> their out of tree monster given that they haven't even tried helping
> to improve the in-kernel write caching layers.

Right, I don't have any stake in that out-of-tree caching either. I am
more just interested in trying to get Linux to generically reach as many
NVMe spec features as possible, and extended formats have been in-spec
since the beginning of nvme.

And yes, that bouncing is really nasty, but it's really only needed for
PRP, so maybe let's just not ignore that transfer mode and support
extended metadata iff the controller supports SGLs. We just need a
special SGL setup routine to weave the data and metadata.

Re: [PATCH] nvme-pci: Save PCI state before putting drive into deepest state

2019-09-17 Thread Keith Busch

On Wed, Sep 11, 2019 at 06:42:33PM -0500, Mario Limonciello wrote:
> The action of saving the PCI state will cause numerous PCI configuration
> space reads which depending upon the vendor implementation may cause
> the drive to exit the deepest NVMe state.
> 
> In these cases ASPM will typically resolve the PCIe link state and APST
> may resolve the NVMe power state.  However it has also been observed
> that this register access after quiesced will cause PC10 failure
> on some device combinations.
> 
> To resolve this, move the PCI state saving to before SetFeatures has been
> called.  This has been proven to resolve the issue across a 5000 sample
> test on previously failing disk/system combinations.
>
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/nvme/host/pci.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 732d5b6..9b3fed4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2894,6 +2894,13 @@ static int nvme_suspend(struct device *dev)
>   if (ret < 0)
>   goto unfreeze;
>  
> + /*
> +  * A saved state prevents pci pm from generically controlling the
> +  * device's power. If we're using protocol specific settings, we don't
> +  * want pci interfering.
> +  */
> + pci_save_state(pdev);
> +
>   ret = nvme_set_power_state(ctrl, ctrl->npss);
>   if (ret < 0)
>   goto unfreeze;
> @@ -2908,12 +2915,6 @@ static int nvme_suspend(struct device *dev)
>   ret = 0;
>   goto unfreeze;
>   }
> - /*
> -  * A saved state prevents pci pm from generically controlling the
> -  * device's power. If we're using protocol specific settings, we don't
> -  * want pci interfering.
> -  */
> - pci_save_state(pdev);
>  unfreeze:
>   nvme_unfreeze(ctrl);
>   return ret;

In the event that something else fails after the point you've saved
the state, we need to fallback to the behavior for when the driver
doesn't save the state, right?

Re: [PATCH 0/2] nvme: Add kernel API for admin command

2019-09-17 Thread Keith Busch

On Mon, Sep 16, 2019 at 12:13:24PM +, Baldyga, Robert wrote:
> Ok, fair enough. We want to keep things hidden behind certain layers,
> and that's definitely a good thing. But there is a problem with these
> layers - they do not expose all the features. For example AFAIK there
> is no clear way to use 512+8 format via block layer (nor even a way 
> to get know that bdev is formatted to particular lbaf). It's impossible
> to use it without layering violations, which nobody sees as a perfect
> solution, but it is the only one that works.

I think your quickest path to supporting such a format is to consider
each part separately, then bounce and interleave/unmix the data +
metadata at another layer that understands how the data needs to be laid
out in host memory. Something like this RFC here:

  http://lists.infradead.org/pipermail/linux-nvme/2018-February/015844.html

It appears connecting to infradead lists is a bit flaky at the moment,
so not sure if you'll be able to read the above link right now.

Anyway, the RFC would have needed a bit of work to be considered, like
using a mempool for the purpose, but it does generically make such
formats usable through the block stack so maybe try flushing out that
idea.

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-06 Thread Keith Busch

On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote:
> On Fri, Sep 06, 2019 at 05:50:49PM +, Long Li wrote:
> > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > >
> > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > >CPU from the cpumask for the effective interrupt handling?
> > 
> > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 
> > 32 hardware queues.
> 
> Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> can't avoid effective CPUs overlapping at all.

Sure, but it's at most half, meanwhile the CPU that's dispatching requests
would naturally be throttled by the other half who's completions are
interrupting that CPU, no?

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-06 Thread Keith Busch

On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote:
> 
> > 
> > Ok, so the real problem is per-cpu bounded tasks.
> > 
> > I share Thomas opinion about a NAPI like approach.
> 
> We already have that, its irq_poll, but it seems that for this
> use-case, we get lower performance for some reason. I'm not
> entirely sure why that is, maybe its because we need to mask interrupts
> because we don't have an "arm" register in nvme like network devices
> have?

For MSI, that's the INTMS/INTMC NVMe registers. MSI-x, though, has to
disarm it in its table entry, and the Linux implementation will do a
posted read in that path, which is a bit too expensive.

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-06 Thread Keith Busch

On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> When one IRQ flood happens on one CPU:
> 
> 1) softirq handling on this CPU can't make progress
> 
> 2) kernel thread bound to this CPU can't make progress
> 
> For example, network may require softirq to xmit packets, or another irq
> thread for handling keyboards/mice or whatever, or rcu_sched may depend
> on that CPU for making progress, then the irq flood stalls the whole
> system.
> 
> > 
> > AFAIU, there are fast medium where the responses to requests are faster
> > than the time to process them, right?
> 
> Usually medium may not be faster than CPU, now we are talking about
> interrupts, which can be originated from lots of devices concurrently,
> for example, in Long Li'test, there are 8 NVMe drives involved.

Why are all 8 nvmes sharing the same CPU for interrupt handling?
Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
CPU from the cpumask for the effective interrupt handling?

[PATCHv2] nvme: Assign subsy instance from first ctrl

2019-09-05 Thread Keith Busch

The namespace disk names must be unique for the lifetime of the
subsystem. This was accomplished by using their parent subsystems'
instances which were allocated independently from the controllers
connected to that subsystem. This allowed name prefixes assigned to
namespaces to match a controller from an unrelated subsystem, and has
created confusion among users examining device nodes.

Ensure a namespace's subsystem instance never clashes with a controller
instance of another subsystem by transferring the instance ownership
to the parent subsystem from the first controller discovered in that
subsystem.

Reviewed-by: Logan Gunthorpe 
Signed-off-by: Keith Busch 
---
v1 -> v2:

  Changelog: reduce sensationalism, fix spelling 

 drivers/nvme/host/core.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 14c0bfb55615..8a8279ece5ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq);
 struct workqueue_struct *nvme_delete_wq;
 EXPORT_SYMBOL_GPL(nvme_delete_wq);
 
-static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
@@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev)
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
 
-   ida_simple_remove(_subsystems_ida, subsys->instance);
+   if (subsys->instance >= 0)
+   ida_simple_remove(_instance_ida, subsys->instance);
kfree(subsys);
 }
 
@@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
if (!subsys)
return -ENOMEM;
-   ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL);
-   if (ret < 0) {
-   kfree(subsys);
-   return ret;
-   }
-   subsys->instance = ret;
+
+   subsys->instance = -1;
mutex_init(>lock);
kref_init(>ref);
INIT_LIST_HEAD(>ctrls);
@@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys->dev.class = nvme_subsys_class;
subsys->dev.release = nvme_release_subsystem;
subsys->dev.groups = nvme_subsys_attrs_groups;
-   dev_set_name(>dev, "nvme-subsys%d", subsys->instance);
+   dev_set_name(>dev, "nvme-subsys%d", ctrl->instance);
device_initialize(>dev);
 
mutex_lock(_subsystems_lock);
@@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
goto out_put_subsystem;
}
 
+   if (!found)
+   subsys->instance = ctrl->instance;
ctrl->subsys = subsys;
list_add_tail(>subsys_entry, >ctrls);
mutex_unlock(_subsystems_lock);
@@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev)
container_of(dev, struct nvme_ctrl, ctrl_device);
struct nvme_subsystem *subsys = ctrl->subsys;
 
-   ida_simple_remove(_instance_ida, ctrl->instance);
+   if (subsys && ctrl->instance != subsys->instance)
+   ida_simple_remove(_instance_ida, ctrl->instance);
+
kfree(ctrl->effects);
nvme_mpath_uninit(ctrl);
__free_page(ctrl->discard_page);
@@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void)
 
 static void __exit nvme_core_exit(void)
 {
-   ida_destroy(_subsystems_ida);
class_destroy(nvme_subsys_class);
class_destroy(nvme_class);
unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
-- 
2.14.5

[PATCH] nvme: Restore device naming sanity

2019-09-04 Thread Keith Busch

The namespace names must be unique for the lifetime of the subsystem.
This was accomplished by using their parent subsystems' instances which
was independent of the controllers connected to that subsystem.

The consequence of that naming scheme meant that name prefixes given to
namespaces may match a controller from an unrelated subsystem. This has
understandbly invited confusion when examining device nodes.

Ensure the namespace's subsystem instance never clashes with a
controller instance of another subsystem by transferring the instance
ownership to parent subsystem from the first controller discovered in
that subsystem.

Reviewed-by: Logan Gunthorpe 
Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 14c0bfb55615..8a8279ece5ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq);
 struct workqueue_struct *nvme_delete_wq;
 EXPORT_SYMBOL_GPL(nvme_delete_wq);
 
-static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
@@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev)
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
 
-   ida_simple_remove(_subsystems_ida, subsys->instance);
+   if (subsys->instance >= 0)
+   ida_simple_remove(_instance_ida, subsys->instance);
kfree(subsys);
 }
 
@@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
if (!subsys)
return -ENOMEM;
-   ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL);
-   if (ret < 0) {
-   kfree(subsys);
-   return ret;
-   }
-   subsys->instance = ret;
+
+   subsys->instance = -1;
mutex_init(>lock);
kref_init(>ref);
INIT_LIST_HEAD(>ctrls);
@@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys->dev.class = nvme_subsys_class;
subsys->dev.release = nvme_release_subsystem;
subsys->dev.groups = nvme_subsys_attrs_groups;
-   dev_set_name(>dev, "nvme-subsys%d", subsys->instance);
+   dev_set_name(>dev, "nvme-subsys%d", ctrl->instance);
device_initialize(>dev);
 
mutex_lock(_subsystems_lock);
@@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
goto out_put_subsystem;
}
 
+   if (!found)
+   subsys->instance = ctrl->instance;
ctrl->subsys = subsys;
list_add_tail(>subsys_entry, >ctrls);
mutex_unlock(_subsystems_lock);
@@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev)
container_of(dev, struct nvme_ctrl, ctrl_device);
struct nvme_subsystem *subsys = ctrl->subsys;
 
-   ida_simple_remove(_instance_ida, ctrl->instance);
+   if (subsys && ctrl->instance != subsys->instance)
+   ida_simple_remove(_instance_ida, ctrl->instance);
+
kfree(ctrl->effects);
nvme_mpath_uninit(ctrl);
__free_page(ctrl->discard_page);
@@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void)
 
 static void __exit nvme_core_exit(void)
 {
-   ida_destroy(_subsystems_ida);
class_destroy(nvme_subsys_class);
class_destroy(nvme_class);
unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
-- 
2.14.5

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-09-04 Thread Keith Busch

On Wed, Sep 04, 2019 at 11:01:22AM -0600, Logan Gunthorpe wrote:
> Oh, yes that's simpler than the struct/kref method and looks like it
> will accomplish the same thing. I did some brief testing with it and it
> seems to work for me (though I don't have any subsystems with multiple
> controllers). If you want to make a patch out of it you can add my
>
> Reviewed-by: Logan Gunthorpe 

Thanks! I'll make it a proper patch and send shortly.

For testing multi-controller subsystems, I haven't got proper hardware
either, so I really like the nvme loop target. Here's a very simple json
defining a two namespace subsystem backed by two real nvme devices:

loop.json:
---
{
  "ports": [
{
  "addr": {
"adrfam": "",
"traddr": "",
"treq": "not specified",
"trsvcid": "",
"trtype": "loop"
  },
  "portid": 1,
  "referrals": [],
  "subsystems": [
"testnqn"
  ]
}
  ],
  "subsystems": [
{
  "attr": {
"allow_any_host": "1"
  },
  "namespaces": [
{
  "device": {
"nguid": "ef90689c-6c46-d44c-89c1-4067801309a8",
"path": "/dev/nvme0n1"
  },
  "enable": 1,
  "nsid": 1
},
{
  "device": {
"nguid": "ef90689c-6c46-d44c-89c1-4067801309a9",
"path": "/dev/nvme1n1"
  },
  "enable": 1,
  "nsid": 2
}
  ],
  "nqn": "testnqn"
}
  ]
}
--

Configure the target:

  # nvmetcli restore loop.json

Connect to it twice:

  # nvme connect -n testnqn -t loop
  # nvme connect -n testnqn -t loop

List the result:

  # nvme list -v
  NVM Express Subsystems

  SubsystemSubsystem-NQN
Controllers
   

 
  nvme-subsys0 
nqn.2014.08.org.nvmexpress:8086108ePHLE7200015N6P4BGN-17335943:ICDPC5ED2ORA6.4T 
 nvme0
  nvme-subsys1 
nqn.2014.08.org.nvmexpress:8086108ePHLE7200015N6P4BGN-27335943:ICDPC5ED2ORA6.4T 
 nvme1
  nvme-subsys2 testnqn  
nvme2, nvme3

  NVM Express Controllers

  Device   SN   MN   FR 
  TxPort AddressSubsystemNamespaces
     
 -- --  
  nvme0PHLE7200015N6P4BGN-1 7335943:ICDPC5ED2ORA6.4T 
QDV1RD07 pcie   :88:00.0   nvme-subsys0 nvme0n1
  nvme1PHLE7200015N6P4BGN-2 7335943:ICDPC5ED2ORA6.4T 
QDV1RD03 pcie   :89:00.0   nvme-subsys1 nvme1n1
  nvme29eb72cbeecc6fdb0 Linux
5.3.0-rc loop  nvme-subsys2 nvme2n1, nvme2n2
  nvme39eb72cbeecc6fdb0 Linux
5.3.0-rc loop  nvme-subsys2 nvme2n1, nvme2n2

  NVM Express Namespaces

  Device   NSID Usage  Format   Controllers
    --  

  nvme0n1  1  3.20  TB /   3.20  TB512   B +  0 B   nvme0
  nvme1n1  1  3.20  TB /   3.20  TB512   B +  0 B   nvme1
  nvme2n1  1  3.20  TB /   3.20  TB512   B +  0 B   nvme2, nvme3
  nvme2n2  2  3.20  TB /   3.20  TB512   B +  0 B   nvme2, nvme3

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-09-04 Thread Keith Busch

On Wed, Sep 04, 2019 at 10:07:12AM -0600, Logan Gunthorpe wrote:
> Yes, I agree, we can't solve the mismatch problem in the general case:
> with sequences of hot plug events there will always be a case that
> mismatches. I just think we can do better in the simple common default case.

This may be something where udev can help us. I might be able to find
some time to look at that, but not today.
 
> > Can we just ensure there is never a matching controller then? This
> > patch will accomplish that and simpler than wrapping the instance in a
> > refcount'ed object:
> > 
> > http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html
> 
> I don't really like that idea. It reduces the confusion caused by
> mismatching numbers, but causes the controller to never match the
> namespace, which is also confusing but in a different way.
> 
> I like the nvme_instance idea. It's not going to be perfect but it has
> some nice properties: the subsystem will try to match the controller's
> instance whenever possible, but in cases where it doesn't, the instance
> number of the subsystem will never be the same as an existing controller.
> 
> I'll see if I can work up a quick patch set and see what people think.

How about this: we have the subsys copy the controller's instance,
and the nvme_free_ctrl() doesn't release it if its subsys matches?

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 14c0bfb55615..8a8279ece5ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq);
 struct workqueue_struct *nvme_delete_wq;
 EXPORT_SYMBOL_GPL(nvme_delete_wq);
 
-static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
@@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev)
struct nvme_subsystem *subsys =
container_of(dev, struct nvme_subsystem, dev);
 
-   ida_simple_remove(_subsystems_ida, subsys->instance);
+   if (subsys->instance >= 0)
+   ida_simple_remove(_instance_ida, subsys->instance);
kfree(subsys);
 }
 
@@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
if (!subsys)
return -ENOMEM;
-   ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL);
-   if (ret < 0) {
-   kfree(subsys);
-   return ret;
-   }
-   subsys->instance = ret;
+
+   subsys->instance = -1;
mutex_init(>lock);
kref_init(>ref);
INIT_LIST_HEAD(>ctrls);
@@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
subsys->dev.class = nvme_subsys_class;
subsys->dev.release = nvme_release_subsystem;
subsys->dev.groups = nvme_subsys_attrs_groups;
-   dev_set_name(>dev, "nvme-subsys%d", subsys->instance);
+   dev_set_name(>dev, "nvme-subsys%d", ctrl->instance);
device_initialize(>dev);
 
mutex_lock(_subsystems_lock);
@@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
goto out_put_subsystem;
}
 
+   if (!found)
+   subsys->instance = ctrl->instance;
ctrl->subsys = subsys;
list_add_tail(>subsys_entry, >ctrls);
mutex_unlock(_subsystems_lock);
@@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev)
container_of(dev, struct nvme_ctrl, ctrl_device);
struct nvme_subsystem *subsys = ctrl->subsys;
 
-   ida_simple_remove(_instance_ida, ctrl->instance);
+   if (subsys && ctrl->instance != subsys->instance)
+   ida_simple_remove(_instance_ida, ctrl->instance);
+
kfree(ctrl->effects);
nvme_mpath_uninit(ctrl);
__free_page(ctrl->discard_page);
@@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void)
 
 static void __exit nvme_core_exit(void)
 {
-   ida_destroy(_subsystems_ida);
class_destroy(nvme_subsys_class);
class_destroy(nvme_class);
unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
--

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-09-04 Thread Keith Busch

On Wed, Sep 04, 2019 at 05:42:15PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 04, 2019 at 08:44:27AM -0600, Keith Busch wrote:
> > Let me step through an example:
> > 
> >   Ctrl A gets instance 0.
> > 
> >   Its subsystem gets the same instance, and takes ref count on it:
> >   all namespaces in this subsystem will use '0'.
> > 
> >   Ctrl B gets instance 1, and it's in the same subsystem as Ctrl A so
> >   no new subsytem is allocated.
> > 
> >   Ctrl A is disconnected, dropping its ref on instance 0, but the
> >   subsystem still has its refcount, making it unavailable.
> > 
> >   Ctrl A is reconnected, and allocates instance 2 because 0 is still in
> >   use.
> > 
> > Now all the namespaces in this subsystem are prefixed with nvme0, but no
> > controller exists with the same prefix. We still have inevitable naming
> > mismatch, right?
> 
> I think th major confusion was that we can use the same handle for
> and unrelated subsystem vs controller, and that would avoid it.
>
> I don't see how we can avoid the controller is entirely different
> from namespace problem ever.

Can we just ensure there is never a matching controller then? This
patch will accomplish that and simpler than wrapping the instance in a
refcount'ed object:

http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-09-04 Thread Keith Busch

On Wed, Sep 04, 2019 at 08:05:58AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 03, 2019 at 10:46:20AM -0600, Keith Busch wrote:
> > Could we possibly make /dev/nvmeX be a subsystem handle without causing
> > trouble for anyone? This would essentially be the same thing as today
> > for non-CMIC controllers with a device-per-controller and only affects
> > the CMIC ones.
> 
> A per-subsyste character device doesn't make sense, as a lot of admin
> command require a specific controller.

Yeah, I was hoping to provide something special for CMIC controllers
so you can do path specific admin, but that looks sure to break user
space.

> If this really is an isue for people we'll just need to refcount the
> handle allocation.  That is:
> 
>  - nvme_init_ctrl allocates a new nvme_instance or so object, which
>does the ida_simple_get.
>  - we allocate a new subsystem that reuses the handle and grabs
>a reference in nvme_init_subsystem, then if we find an existing
>subsystem we drop that reference again.
>  - last free of a ctrl or subsystem also drops a reference, with
>the final free releasing the ida

Let me step through an example:

  Ctrl A gets instance 0.

  Its subsystem gets the same instance, and takes ref count on it:
  all namespaces in this subsystem will use '0'.

  Ctrl B gets instance 1, and it's in the same subsystem as Ctrl A so
  no new subsytem is allocated.

  Ctrl A is disconnected, dropping its ref on instance 0, but the
  subsystem still has its refcount, making it unavailable.

  Ctrl A is reconnected, and allocates instance 2 because 0 is still in
  use.

Now all the namespaces in this subsystem are prefixed with nvme0, but no
controller exists with the same prefix. We still have inevitable naming
mismatch, right?

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-09-03 Thread Keith Busch

On Tue, Sep 03, 2019 at 10:08:01AM -0600, Logan Gunthorpe wrote:
> On 2019-08-31 9:29 a.m., Keith Busch wrote:
> > On Fri, Aug 30, 2019 at 06:01:39PM -0600, Logan Gunthorpe wrote:
> >> To fix this, assign the subsystem's instance based on the instance
> >> number of the controller's instance that first created it. There should
> >> always be fewer subsystems than controllers so the should not be a need
> >> to create extra subsystems that overlap existing controllers.
> > 
> > The subsystem's lifetime is not tied to the controller's. When the
> > controller is removed and releases its instance, the next controller
> > to take that available instance will create naming collisions with the
> > subsystem still using it.
> > 
> 
> Hmm, yes, ok.
> 
> So perhaps we can just make the subsystem prefer the ctrl's instance
> when allocating the ID? Then at least, in the common case, the
> controller numbers will match the subsystem numbers. Only when there's
> random hot-plugs would the numbers get out of sync.

I really don't know about a patch that works only on static
configurations. Connects and disconnects do happen on live systems,
so the numerals will inevitably get out of sync.

Could we possibly make /dev/nvmeX be a subsystem handle without causing
trouble for anyone? This would essentially be the same thing as today
for non-CMIC controllers with a device-per-controller and only affects
the CMIC ones.

Re: [PATCH] nvme-core: Fix subsystem instance mismatches

2019-08-31 Thread Keith Busch

On Fri, Aug 30, 2019 at 06:01:39PM -0600, Logan Gunthorpe wrote:
> To fix this, assign the subsystem's instance based on the instance
> number of the controller's instance that first created it. There should
> always be fewer subsystems than controllers so the should not be a need
> to create extra subsystems that overlap existing controllers.

The subsystem's lifetime is not tied to the controller's. When the
controller is removed and releases its instance, the next controller
to take that available instance will create naming collisions with the
subsystem still using it.

Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq

2019-08-27 Thread Keith Busch

On Tue, Aug 27, 2019 at 08:34:21AM -0600, Keith Busch wrote:
> I think you should probably just have pci_irq_get_affinity() take a flags
> argument, or make a new function like __pci_irq_get_affinity() that
> pci_irq_get_affinity() can call with a default flag.

Sorry, copied the wrong function name; I meant pci_request_irq(),
not pci_irq_get_affinity().

Re: [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD

2019-08-27 Thread Keith Busch

On Tue, Aug 27, 2019 at 04:53:44PM +0800, Ming Lei wrote:
> In case of IRQF_RESCUE_THREAD, the threaded handler is only used to
> handle interrupt when IRQ flood comes, use irq's affinity for this thread
> so that scheduler may select other not too busy CPUs for handling the
> interrupt.
> 
> Cc: Long Li 
> Cc: Ingo Molnar ,
> Cc: Peter Zijlstra 
> Cc: Keith Busch 
> Cc: Jens Axboe 
> Cc: Christoph Hellwig 
> Cc: Sagi Grimberg 
> Cc: John Garry 
> Cc: Thomas Gleixner 
> Cc: Hannes Reinecke 
> Cc: linux-n...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Signed-off-by: Ming Lei 
> ---
>  kernel/irq/manage.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1566abbf50e8..03bc041348b7 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct 
> irqaction *action)
>   if (cpumask_available(desc->irq_common_data.affinity)) {
>   const struct cpumask *m;
>  
> - m = irq_data_get_effective_affinity_mask(>irq_data);
> + /*
> +  * Managed IRQ's affinity is setup gracefull on MUNA locality,

s/MUNA/NUMA

> +  * also if IRQF_RESCUE_THREAD is set, interrupt flood has been
> +  * triggered, so ask scheduler to run the thread on CPUs
> +  * specified by this interrupt's affinity.
> +  */
> + if ((action->flags & IRQF_RESCUE_THREAD) &&
> + irqd_affinity_is_managed(>irq_data))
> + m = desc->irq_common_data.affinity;
> + else
> + m = irq_data_get_effective_affinity_mask(
> + >irq_data);
>   cpumask_copy(mask, m);
>   } else {
>   valid = false;
> --

Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq

2019-08-27 Thread Keith Busch

On Tue, Aug 27, 2019 at 05:09:27PM +0800, Ming Lei wrote:
> On Tue, Aug 27, 2019 at 11:06:20AM +0200, Johannes Thumshirn wrote:
> > On 27/08/2019 10:53, Ming Lei wrote:
> > [...]
> > > + char *devname;
> > > + const struct cpumask *mask;
> > > + unsigned long irqflags = IRQF_SHARED;
> > > + int vector = pci_irq_vector(pdev, nvmeq->cq_vector);
> > > +
> > > + devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid);
> > > + if (!devname)
> > > + return -ENOMEM;
> > > +
> > > + mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector);
> > > + if (mask && cpumask_weight(mask) > 1)
> > > + irqflags |= IRQF_RESCUE_THREAD;
> > > +
> > > + return request_threaded_irq(vector, nvme_irq, NULL, irqflags,
> > > + devname, nvmeq);
> > 
> > This will leak 'devname' which gets allocated by kasprintf() a few lines
> > above.
> 
> It won't, please see pci_free_irq() in which free_irq() returns the
> 'devname' passed in.

Only if request_threaded_irq() doesn't return an error.

I think you should probably just have pci_irq_get_affinity() take a flags
argument, or make a new function like __pci_irq_get_affinity() that
pci_irq_get_affinity() can call with a default flag.

Re: [PATCH v2 0/3] PCI: Add PCI_ERROR_RESPONSE, check for errors

2019-08-23 Thread Keith Busch

On Thu, Aug 22, 2019 at 03:05:48PM -0500, Bjorn Helgaas wrote:
> From: Bjorn Helgaas 
> 
> Reads from a PCI device may fail if the device has been turned off (put
> into D3cold), removed, or if some other error occurs.  The PCI host bridge
> typically fabricates ~0 data to complete the CPU's read.
> 
> We check for that in a few places, but not in a consistent way.  This
> series adds a PCI_ERROR_RESPONSE definition to make the checks more
> consistent and easier to find.  Note that ~0 may indicate a PCI error, but
> it may also be valid read data, so you need more information (such as
> knowing that a register can never contain ~0) before concluding that it's
> an error.
> 
> This series also adds a new check for PCI_ERROR_RESPONSE in the power
> management code because that code frequently encounters devices in D3cold,
> where we previously misinterpreted ~0 data.  It also uses pci_power_name()
> to print D-state names more consistently.
> 
> Rafael, I didn't add your Reviewed-by to "PCI / PM: Return error when
> changing power state from D3cold" because I made small changes to try to
> make the messages more consistent, and I didn't want to presume they'd be
> OK with you.
> 
> Changes since v1:
>   - Add Rafael's Reviewed-By to the first two patches
>   - Drop "PCI / PM: Check for error when reading PME status" because Rafael
> pointed out that some devices can signal PME even when in D3cold, so
> this would require additional rework
>   - Drop "PCI / PM: Check for error when reading Power State" because
> Rafael thinks it's mostly redundant
> 
> Bjorn Helgaas (3):
>   PCI: Add PCI_ERROR_RESPONSE definition
>   PCI / PM: Decode D3cold power state correctly
>   PCI / PM: Return error when changing power state from D3cold

Series looks good to me.

Reviewed-by: Keith Busch

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1501 matches

Mail list logo