Re: [revert commit 9f079dda1433] [mainline] [6.8.0-rc3] [NVME] OOPS kernel crash while booting

2024-02-06 Thread Keith Busch
On Tue, Feb 06, 2024 at 10:05:20PM +0530, Tasmiya Nalatwad wrote:
> Greetings,
> 
> [revert commit 9f079dda1433] [mainline] [6.8.0-rc3] [NVME] OOPS kernel crash
> while booting to kernel
> 
> Reverting below commit fixes the problem
> 
> commit 9f079dda14339ee87d864306a9dc8c6b4e4da40b
>     nvme: allow passthru cmd error logging

Thanks for the report. Let's take a shot at fixing it before considering
a revert. I copied you on the patch proposal.


Re: [PATCH 2/2] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues

2023-07-10 Thread Keith Busch
On Mon, Jul 10, 2023 at 05:14:15PM +0800, Ming Lei wrote:
> On Mon, Jul 10, 2023 at 08:41:09AM +0200, Christoph Hellwig wrote:
> > On Sat, Jul 08, 2023 at 10:02:59AM +0800, Ming Lei wrote:
> > > Take blk-mq's knowledge into account for calculating io queues.
> > > 
> > > Fix wrong queue mapping in case of kdump kernel.
> > > 
> > > On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see
> > > `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> > > still returns all CPUs.
> > 
> > That's simply broken.  Please fix the arch code to make sure
> > it does not return a bogus num_possible_cpus value for these
> 
> That is documented in Documentation/admin-guide/kdump/kdump.rst.
> 
> On arm and ppc64, 'maxcpus=1' is passed for kdump kernel, and "maxcpu=1"
> simply keep one of CPU cores as online, and others as offline.
> 
> So Cc our arch(arm & ppc64) & kdump guys wrt. passing 'maxcpus=1' for
> kdump kernel.
> 
> > setups, otherwise you'll have to paper over it in all kind of
> > drivers.
> 
> The issue is only triggered for drivers which use managed irq &
> multiple hw queues.

Is the problem that the managed interrupt sets the effective irq
affinity to an offline CPU? You mentioned observed timeouts; are you
seeing the "completion polled" nvme message?


Re: [PATCH v4 1/4] PCI: Introduce pci_dev_for_each_resource()

2023-03-10 Thread Keith Busch
On Fri, Mar 10, 2023 at 07:14:13PM +0200, Andy Shevchenko wrote:
> +#define __pci_dev_for_each_resource(dev, res, __i, vartype)  \
> + for (vartype __i = 0;   \
> + res = &(dev)->resource[__i], __i < PCI_NUM_RESOURCES;   \
> + __i++)

...

> +#define pci_dev_for_each_resource_p(dev, res)
> \
> + __pci_dev_for_each_resource(dev, res, i, unsigned int)

It looks dangerous to have a macro declare a variable when starting a new
scope. How do you know the name 'i' won't clash with something defined above?


Re: [PATCH 02/13] nvme-multipath: add error handling support for add_disk()

2021-10-15 Thread Keith Busch
On Fri, Oct 15, 2021 at 04:52:08PM -0700, Luis Chamberlain wrote:
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Since we now can tell for sure when a disk was added, move
> setting the bit NVME_NSHEAD_DISK_LIVE only when we did
> add the disk successfully.
> 
> Nothing to do here as the cleanup is done elsewhere. We take
> care and use test_and_set_bit() because it is protects against
> two nvme paths simultaneously calling device_add_disk() on the
> same namespace head.

Looks good, thank you.

Reviewed-by: Keith Busch 


Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-09 Thread Keith Busch
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote:
> diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
> index eea0f453cfb6..8aac5bc60f4c 100644
> --- a/crypto/tcrypt.c
> +++ b/crypto/tcrypt.c
> @@ -2464,7 +2464,7 @@ static int do_test(const char *alg, u32 type, u32 mask, 
> int m, u32 num_mb)
>   test_hash_speed("streebog512", sec,
>   generic_hash_speed_template);
>   if (mode > 300 && mode < 400) break;
> - fallthrough;
> + break;
>   case 399:
>   break;

Just imho, this change makes the preceding 'if' look even more
pointless. Maybe the fallthrough was a deliberate choice? Not that my
opinion matters here as I don't know this module, but it looked a bit
odd to me.


Re: [PATCH] PCI/AER: only insert one element into kfifo

2018-12-12 Thread Keith Busch
On Wed, Dec 12, 2018 at 04:32:30PM +0800, Yanjiang Jin wrote:
> 'commit ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to
> insert locked elements")' replace kfifo_put() with kfifo_in_spinlocked().
> 
> But as "kfifo_in(fifo, buf, n)" describes:
> " * @n: number of elements to be added".
> 
> We want to insert only one element into kfifo, not "sizeof(entry) = 16".
> Without this patch, we would get 15 uninitialized elements.
> 
> Signed-off-by: Yanjiang Jin 

My bad. I had trouble testing the GHES path for this.  Thanks for the fix.

Reviewed-by: Keith Busch 

> ---
>  drivers/pci/pcie/aer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a90a919..fed29de 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, 
> unsigned int devfn,
> .regs   = aer_regs,
> };
> 
> -   if (kfifo_in_spinlocked(_recover_ring, , sizeof(entry),
> +   if (kfifo_in_spinlocked(_recover_ring, , 1,
>  _recover_ring_lock))
> schedule_work(_recover_work);
> else
> --
> 1.8.3.1


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-20 Thread Keith Busch
On Tue, Nov 20, 2018 at 04:02:21PM -0500, Sinan Kaya wrote:
> On 11/20/2018 3:44 PM, alex_gagn...@dellteam.com wrote:
> > I'd prefer "sure" instead of "think". "I think it breaks some system I'm
> > not telling you about" doesn't help much in figuring out how not to
> > break said system(s).:)
> 
> Sorry, I thought I mentioned why it would break but let me repeat.
> 
> The systems I have seen rely on the HEST table presence as an indicator
> to the OS that firmware first is enabled. If you go look at the _OSC bits
> on such systems, it still says OS owns the AER service.
> 
> The assumption here is that HEST table has precedence over the _OSC bits.
> That's what needs to be clarified in the UEFI forum.
> 
> If this code is to go in and ignore the HEST table presence, then firmware
> will think that it owns AER service and OS will think that it owns AER
> service too.

How does that work? If the OS takes control, it sets up MSIs that FW don't
react to, and disables system errors through PCIe Root Control. Aren't
those sys errs the mechanism FW knows it has something to do, which
means the OS can effectively fence it off?


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Keith Busch
On Mon, Nov 19, 2018 at 12:56:56PM -0500, Sinan Kaya wrote:
> On 11/19/2018 12:41 PM, Keith Busch wrote:
> > > Still, breaking existing systems that rely on HEST table is not cool.
> > > I'd rather have users specify "pcie_ports=native" to skip FF rather than
> > > having broken systems by default to be honest.
> > The pcie_ports=native work-around ignores FF to potentially unknown
> > results, though.
> > 
> 
> How about being able to enable/disable FF in BIOS?
>
> We can't really turn off firmware first in the kernel without asking help
> from the firmware.

The _OSC method this patch utilizes is the ACPI spec defined way for
the kernel to wrest control from firmware. BIOS specific menu settings
shouldn't be our only recourse when we have a spec authority defining
generic OS interfaces to accomplish the same thing.

Unless there is a disagreement on the _OSC interpreation, we'd have to
accept that platforms breaking from this patch are non-compliant.

> Like you said, it causes unpredictable results.
>
> There will be two competing software trying to touch the same registers.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Keith Busch
On Mon, Nov 19, 2018 at 12:42:25PM -0500, Sinan Kaya wrote:
> On 11/19/2018 12:32 PM, Sinan Kaya wrote:
> > > 
> > > But we're not using HEST as a fine grain control. We disable native AER
> > > handling if *any* device has FF set in HEST, and that just forces people
> > > to use pcie_ports=native to get around that.
> > > 
> > 
> > I don't see *any* in the code.  aer_hest_parse() does the HEST table 
> > parsing.
> > It switches to firmware first mode if global flag in HEST is set. Otherwise
> > for each BDF in device, hest_match_pci() is used to do a cross-matching 
> > against
> > HEST table contents.
> > 
> > Am I missing something?
> 
> I see. I think you are talking about aer_firmware_first, right?
> 
> aer_set_firmware_first() and  pcie_aer_get_firmware_first() seem to do the 
> right
> thing.

Right, but what difference does it make if device specific AER checks do
the right thing if pcie_aer_init() doesn't even register it's port driver?
 
> aer_firmware_first is probably getting set because events are all routed to a
> single root port and aer_acpi_firmware_first() is used to decide whether AER
> should be initialized or not.
> 
> I think I understand what is going on now.
> 
> Still, breaking existing systems that rely on HEST table is not cool.
> I'd rather have users specify "pcie_ports=native" to skip FF rather than
> having broken systems by default to be honest.

The pcie_ports=native work-around ignores FF to potentially unknown
results, though.


Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Keith Busch
On Mon, Nov 19, 2018 at 12:32:42PM -0500, Sinan Kaya wrote:
> On 11/19/2018 11:53 AM, Keith Busch wrote:
> > On Mon, Nov 19, 2018 at 11:53:05AM -0500, Tyler Baicar wrote:
> > > On Thu, Nov 15, 2018 at 8:49 PM Sinan Kaya  wrote:
> > > > 
> > > > On 11/15/2018 3:16 PM, Alexandru Gagniuc wrote:
> > > > > I've asked around a few people at Dell and they unanimously agree that
> > > > > _OSC is the correct way to determine ownership of AER. In linux, we
> > > > > use the result of _OSC to enable AER services, but we use HEST to
> > > > > determine AER ownership. That's inconsistent. This series drops the
> > > > > use of HEST in favor of _OSC.
> > > > > 
> > > > > [1]https://lkml.org/lkml/2018/11/15/62
> > > > 
> > > > This change breaks the existing systems that rely on the HEST table
> > > > telling the operating system about firmware first presence.
> > > > 
> > > > Besides, HEST table has much more granularity about which PCI component
> > > > needs firmware such as global/device/switch.
> > > > 
> > > > You should probably circulate these ideas for wider consumption in UEFI
> > > > forum as UEFI owns the HEST table definition.
> > > 
> > > I agree with Sinan, this will break existing systems, and the granularity 
> > > of the
> > > HEST definition is more useful than the single bit in _OSC.
> > 
> > But we're not using HEST as a fine grain control. We disable native AER
> > handling if *any* device has FF set in HEST, and that just forces people
> > to use pcie_ports=native to get around that.
> > 
> 
> I don't see *any* in the code.  aer_hest_parse() does the HEST table parsing.
> It switches to firmware first mode if global flag in HEST is set. Otherwise
> for each BDF in device, hest_match_pci() is used to do a cross-matching 
> against
> HEST table contents.
> 
> Am I missing something?

You might be. :)

static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
{

/*
 * If no specific device is supplied, determine whether
 * FIRMWARE_FIRST is set for *any* PCIe device.
 */
if (!info->pci_dev) {
info->firmware_first |= ff;
return 0;
}



Re: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-19 Thread Keith Busch
On Mon, Nov 19, 2018 at 11:53:05AM -0500, Tyler Baicar wrote:
> On Thu, Nov 15, 2018 at 8:49 PM Sinan Kaya  wrote:
> >
> > On 11/15/2018 3:16 PM, Alexandru Gagniuc wrote:
> > > I've asked around a few people at Dell and they unanimously agree that
> > > _OSC is the correct way to determine ownership of AER. In linux, we
> > > use the result of _OSC to enable AER services, but we use HEST to
> > > determine AER ownership. That's inconsistent. This series drops the
> > > use of HEST in favor of _OSC.
> > >
> > > [1]https://lkml.org/lkml/2018/11/15/62
> >
> > This change breaks the existing systems that rely on the HEST table
> > telling the operating system about firmware first presence.
> >
> > Besides, HEST table has much more granularity about which PCI component
> > needs firmware such as global/device/switch.
> >
> > You should probably circulate these ideas for wider consumption in UEFI
> > forum as UEFI owns the HEST table definition.
> 
> I agree with Sinan, this will break existing systems, and the granularity of 
> the
> HEST definition is more useful than the single bit in _OSC.

But we're not using HEST as a fine grain control. We disable native AER
handling if *any* device has FF set in HEST, and that just forces people
to use pcie_ports=native to get around that.


Re: [PATCH 2/2] PCI/AER: Determine AER ownership based on _OSC instead of HEST

2018-11-15 Thread Keith Busch
On Thu, Nov 15, 2018 at 05:16:03PM -0600, Alexandru Gagniuc wrote:
>  static void aer_set_firmware_first(struct pci_dev *pci_dev)
>  {
> - int rc;
> - struct aer_hest_parse_info info = {
> - .pci_dev= pci_dev,
> - .firmware_first = 0,
> - };
> + struct pci_host_bridge *host = pci_find_host_bridge(pci_dev->bus);
>  
> - rc = apei_hest_parse(aer_hest_parse, );
> -
> - if (rc)
> - pci_dev->__aer_firmware_first = 0;
> - else
> - pci_dev->__aer_firmware_first = info.firmware_first;
> + pci_dev->__aer_firmware_first = !host->native_aer;
>   pci_dev->__aer_firmware_first_valid = 1;
>  }

I think we can clean this up even more by removing the setter and the
__aer_firmware_first fields, and have the pcie_aer_get_firmware_first()
go directly to the host bride->native_aer.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-14 Thread Keith Busch
On Wed, Nov 14, 2018 at 08:52:10PM +, alex_gagn...@dellteam.com wrote:
> But it does in portdrv_core.c:
> 
>   if (dev->aer_cap && pci_aer_available() &&
>   (pcie_ports_native || host->native_aer)) {
>   services |= PCIE_PORT_SERVICE_AER;
> 
> That flag later creates a pcie device that allows aerdrv to attach to.

Oh, right. I saw negotiate_os_control() just uses a stack variable for
the _OSC response, but if I had looked one level deeper, I'd see it
cached in a different structure.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-14 Thread Keith Busch
On Wed, Nov 14, 2018 at 07:22:04PM +, alex_gagn...@dellteam.com wrote:
> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
> > Just to make sure we're on the same page, can you point me to this
> > rule?  I do see that OSPM must request control of AER using _OSC
> > before it touches the AER registers.  What I don't see is the
> > connection between firmware-first and the AER registers.
> 
> ACPI 6.2 - 6.2.11.3, Table 6-197:
> 
> PCI Express Advanced Error Reporting control:
>   * The firmware sets this bit to 1 to grant control over PCI Express 
> Advanced Error Reporting. If firmware allows the OS control of this 
> feature, then in the context of the _OSC method it must ensure that 
> error messages are routed to device interrupts as described in the PCI 
> Express Base Specification[...]
> 
> Now I'm confused too:
>   * HEST -> __aer_firmware_first
>   This is used for touching/not touching AER bits
>   * _OSC -> bridge->native_aer
>   Used to enable/not enable AER portdrv service
> Maybe Keith knows better why we're doing it this way. From ACPI text, it 
> doesn't seem that control of AER would be tied to HEST entries, although 
> in practice, it is.

I'm not sure, that predates me.  HEST does have a FIRMWARE_FIRST flag, but
spec does not say anymore on relation to _OSC control or AER capability.
Nothing in PCIe spec either.

I also don't know why Linux disables the AER driver if only one
device has a FIRMWARE_FIRST HEST. Shouldn't that just be a per-device
decision?

> > The closest I can find is the "Enabled" field in the HEST PCIe
> > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
> > 
> >If the field value is 1, indicates this error source is
> >to be enabled.
> > 
> >If the field value is 0, indicates that the error source
> >is not to be enabled.
> > 
> >If FIRMWARE_FIRST is set in the flags field, the Enabled
> >field is ignored by the OSPM.
> > 
> > AFAICT, Linux completely ignores the Enabled field in these
> > structures.
> 
> I don't think ignoring the field is a problem:
>   * With FFS, OS should ignore it.
>   * Without FFS, we have control, and we get to make the decisions anyway.
> In the latter case we decide whether to use AER, independent of the crap 
> in ACPI. I'm not even sure why "Enabled" matters in native AER handling. 
> Probably one of the check-boxes in "Binary table designer's handbook"?

And why doesn't Linux do anything with _OSC response other than logging
it? If OS control wasn't granted, shouldn't that take priority over HEST?


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-13 Thread Keith Busch
On Tue, Nov 13, 2018 at 10:39:15PM +, alex_gagn...@dellteam.com wrote:
> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
> > The whole issue of firmware-first, the mechanism by which firmware
> > gets control, the System Error enables in Root Port Root Control
> > registers, etc., is very murky to me.  Jon has a sort of similar issue
> > with VMD where he needs to leave System Errors enabled instead of
> > disabling them as we currently do.
> 
> Well, OS gets control via _OSC method, and based on that it should 
> touch/not touch the AER bits. The bits that get set/cleared come from 
> _HPX method, and there's a more about the FFS described in ACPI spec. It 
> seems that if platform, wants to enable VMD, it should pass the correct 
> bits via _HPX. I'm curious to know in what new twisted way FFS doesn't 
> work as intended.

When VMD is enabled, the platform sees a VMD endpoint. It doesn't see
any of the root ports on that domain, so ACPI can't provide policies for
them nor AER registers for the platform to consider controlling.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-09 Thread Keith Busch
On Fri, Nov 09, 2018 at 03:32:57AM -0800, Greg Kroah-Hartman wrote:
> On Fri, Nov 09, 2018 at 08:29:53AM +0100, Lukas Wunner wrote:
> > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > 
> > > I think my stance always has been that this call is not good at all
> > > because once you call it you never really know if it is still true as
> > > the device could have been removed right afterward.
> > > 
> > > So almost any code that relies on it is broken, there is no locking and
> > > it can and will race and you will loose.
> > 
> > Hm, to be honest if that's your impression I think you must have missed a
> > large portion of the discussion we've been having over the past 2 years.
> > 
> > Please consider reading this LWN article, particularly the "Surprise
> > removal" section, to get up to speed:
> > 
> > https://lwn.net/Articles/767885/
> > 
> > You seem to be assuming that all we care about is the *return value* of
> > an mmio read.  However a transaction to a surprise removed device has
> > side effects beyond returning all ones, such as a Completion Timeout
> > which, with thousands of transactions in flight, added up to many seconds
> > to handle removal of an NVMe array and occasionally caused MCEs.
> 
> Again, I still claim this is broken hardware/firmware :)

Indeed it is, but I don't want to abandon people with hardware in hand
if we can make it work despite being broken. Perfection is the enemy of
good. :)
 
> > It is not an option to just blindly carry out device accesses even though
> > it is known the device is gone, Completion Timeouts be damned.
> 
> I don't disagree with you at all, and your other email is great with
> summarizing the issues here.
> 
> What I do object to is somehow relying on that function call as knowing
> that the device really is present or not.  It's a good hint, yes, but
> driver authors still have to be able to handle the bad data coming back
> from when the call races with the device being removed.

The function has always been a private interface. It is not available
for drivers to rely on.

The only thing we're trying to accomplish is not start a transaction
if software knows it will not succeed. There are certainly times when
a transaction will fail that software does not forsee, but we're not
suggesting the intent handles that either.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:42:55PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote:
> > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > > > I'm having second thoughts about this.  One thing I'm uncomfortable
> > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > > > instead of systematic, in the sense that I don't know how we convince
> > > > ourselves that this (and only this) is the correct place to put it.
> > > 
> > > I think my stance always has been that this call is not good at all
> > > because once you call it you never really know if it is still true as
> > > the device could have been removed right afterward.
> > > 
> > > So almost any code that relies on it is broken, there is no locking and
> > > it can and will race and you will loose.
> > 
> > AIUI, we're not trying to create code to rely on this. This more about
> > reducing reliance on hardware. If the software misses the race once and
> > accesses disconnected device memory, that's usually not a big deal to
> > let hardware sort it out, but the point is not to push our luck.
> 
> Then why even care about this call at all?  If you need to really know
> if the read worked, you have to check the value.  If the value is FF
> then you have a huge hint that the hardware is now gone.  And you can
> rely on it being gone, you can never rely on making the call to the
> function to check if the hardware is there to be still valid any point
> in time after the call returns.
> 
> > Surprise hot remove is empirically more reliable the less we interact
> > with hardware and firmware. That shouldn't be necessary, but is just an
> > unfortunate reality.
> 
> You are not "interacting", you are reading/writing to the hardware, as
> you have to do so.  So I really don't understand what you are talking
> about here, sorry.

We're reading hardware memory, yes, but the hardware isn't there.
Something obviously needs to return FF, so we are indirectly interacting
with whatever mechanism handles that. Sometimes that mechanism doesn't
handle it gracefully and instead of having FF to consider, you have a
machine check rebooting your system.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> > I'm having second thoughts about this.  One thing I'm uncomfortable
> > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> > instead of systematic, in the sense that I don't know how we convince
> > ourselves that this (and only this) is the correct place to put it.
> 
> I think my stance always has been that this call is not good at all
> because once you call it you never really know if it is still true as
> the device could have been removed right afterward.
> 
> So almost any code that relies on it is broken, there is no locking and
> it can and will race and you will loose.

AIUI, we're not trying to create code to rely on this. This more about
reducing reliance on hardware. If the software misses the race once and
accesses disconnected device memory, that's usually not a big deal to
let hardware sort it out, but the point is not to push our luck.

Surprise hot remove is empirically more reliable the less we interact
with hardware and firmware. That shouldn't be necessary, but is just an
unfortunate reality.


Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

2018-11-08 Thread Keith Busch
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote:
> 
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it.

You know how the kernel provides ZERO_PAGE, wouldn't it be cool if we
also had a ONES_PAGE and could remap all virtual addresses from a memory
mapped device to that page on an ungraceful disconnect? I do not know
how to accomplish that, so might just be crazy talk... But if it is
possible, that would be a pretty nifty way to solve this problem.


Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-13 Thread Keith Busch
On Thu, Nov 12, 2015 at 11:37:54PM -0800, Christoph Hellwig wrote:
> Jens, Keith: any chance to get this to Linux for 4.4 (and -stable)?

I agreed, looks good to me.

Acked-by: Keith Busch <keith.bu...@intel.com>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-11-03 Thread Keith Busch
On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > index ccc0c1f93daa..a9a5285bdb39 100644
> > --- a/drivers/block/nvme-core.c
> > +++ b/drivers/block/nvme-core.c
> > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct 
> > nvme_dev *dev)
> > u32 aqa;
> > u64 cap = readq(>bar->cap);
> > struct nvme_queue *nvmeq;
> > -   unsigned page_shift = PAGE_SHIFT;
> > +   /*
> > +* default to a 4K page size, with the intention to update this
> > +* path in the future to accomodate architectures with differing
> > +* kernel and IO page sizes.
> > +*/
> > +   unsigned page_shift = 12;
> > unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> > unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> 
> Looks good as a start.  Note that all the MPSMIN/MAX checking could
> be removed as NVMe devices must support 4k pages.

MAX can go, and while it's probably the case that all devices support 4k,
it's not a spec requirement, so we should keep the dev_page_min check.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Keith Busch
On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> Given that it's 4K just about everywhere by default (and sort of
> implicitly expected to be, I guess), I think I'd prefer we default to
> 4K. That should mitigate the performance impact (I'll ask our IO team to
> do some runs, but since this impacts functionality on some hardware, I
> don't think it's too relevant for now). Unless there are NVMe devcies
> with a MPSMAX < 4K? 

Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
but you can hard code it like you've done in your patch.

The spec defines MPSMAX such that it's impossible to find a device
with MPSMAX < 4k.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev