from:"Dan Williams"

Re: [PATCH net-next v9 1/2] net: Add a WWAN subsystem

2021-04-07 Thread Dan Williams

On Mon, 2021-04-05 at 11:52 +0200, Loic Poulain wrote:
> This change introduces initial support for a WWAN framework. Given
> the
> complexity and heterogeneity of existing WWAN hardwares and
> interfaces,
> there is no strict definition of what a WWAN device is and how it
> should
> be represented. It's often a collection of multiple devices that
> perform
> the global WWAN feature (netdev, tty, chardev, etc).

Great to see the continued work on this.

Were you intending to follow-up with functionality to group netdevs
with the control ports?  From my quick look at v9 here it only deals
with MHI control ports (diag, QMI, AT, etc) which is great, but not the
full story.

I think that was a big part of the discussion around Johannes' earlier
series since it's often protocol-specific to tie a particular netdev
with a given control port, but that's something that's really necessary
for a good abstract userspace.

Thoughts here? I'd love to see that functionality too.

Thanks!
Dan

> One usual way to expose modem controls and configuration is via high
> level protocols such as the well known AT command protocol, MBIM or
> QMI. The USB modems started to expose them as character devices, and
> user daemons such as ModemManager learnt to use them.
> 
> This initial version adds the concept of WWAN port, which is a
> logical
> pipe to a modem control protocol. The protocols are rawly exposed to
> user via character device, allowing straigthforward support in
> existing
> tools (ModemManager, ofono...). The WWAN core takes care of the
> generic
> part, including character device management, and relies on port
> driver
> operations to receive/submit protocol data.
> 
> Since the different devices exposing protocols for a same WWAN
> hardware
> do not necessarily know about each others (e.g. two different USB
> interfaces, PCI/MHI channel devices...) and can be created/removed in
> different orders, the WWAN core ensures that all WAN ports
> contributing
> to the 'whole' WWAN feature are grouped under the same virtual WWAN
> device, relying on the provided parent device (e.g. mhi controller,
> USB device). It's a 'trick' I copied from Johannes's earlier WWAN
> subsystem proposal.
> 
> This initial version is purposely minimalist, it's essentially moving
> the generic part of the previously proposed mhi_wwan_ctrl driver
> inside
> a common WWAN framework, but the implementation is open and flexible
> enough to allow extension for further drivers.
> 
> Signed-off-by: Loic Poulain 
> ---
>  v2: not part of the series
>  v3: not part of the series
>  v4: Introduce WWAN framework/subsystem
>  v5: Specify WWAN_CORE module name in Kconfig
>  v6: - Move to unified wwan_core.c file
>  - Make wwan_port a device
>  - Get rid of useless various dev lists (use wwan class)
>  - Get rid of useless wwan dev usage counter and mutex
>  - do not expose wwan_create_device, it's indirectly called via
> create_port
>  - Increase minor count to the whole available minor range
>  - private_data as wwan_create_port parameter
>  v7: Fix change log (mixed up 1/2 and 2/2)
>  v8: - Move fops implementation in wwan_core
> (open/read/write/poll/release)
>  - Create wwan_port_ops
>  - Add wwan_port_rx, wwan_port_txoff and wwan_port_txon functions
>  - Fix code comments
>  - skb based TX/RX
>  v9: - Move wwan_port definition to wwan_core.c (private)
>  - Fix checkpatch + cocci(ERR_CAST) issues
>  - Reword commit message
> 
>  drivers/net/Kconfig  |   2 +
>  drivers/net/Makefile |   1 +
>  drivers/net/wwan/Kconfig |  23 ++
>  drivers/net/wwan/Makefile    |   7 +
>  drivers/net/wwan/wwan_core.c | 552
> +++
>  include/linux/wwan.h | 111 +
>  6 files changed, 696 insertions(+)
>  create mode 100644 drivers/net/wwan/Kconfig
>  create mode 100644 drivers/net/wwan/Makefile
>  create mode 100644 drivers/net/wwan/wwan_core.c
>  create mode 100644 include/linux/wwan.h
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 5895905..74dc8e24 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -502,6 +502,8 @@ source "drivers/net/wan/Kconfig"
>  
>  source "drivers/net/ieee802154/Kconfig"
>  
> +source "drivers/net/wwan/Kconfig"
> +
>  config XEN_NETDEV_FRONTEND
> tristate "Xen network device frontend driver"
> depends on XEN
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 040e20b..7ffd2d0 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -68,6 +68,7 @@ obj-$(CONFIG_SUNGEM_PHY) += sungem_phy.o
>  obj-$(CONFIG_WAN) += wan/
>  obj-$(CONFIG_WLAN) += wireless/
>  obj-$(CONFIG_IEEE802154) += ieee802154/
> +obj-$(CONFIG_WWAN) += wwan/
>  
>  obj-$(CONFIG_VMXNET3) += vmxnet3/
>  obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
> diff --git a/drivers/net/wwan/Kconfig b/drivers/net/wwan/Kconfig
> new file mode 100644
> index 000..fc3f3a1
> --- /dev/null
> +++

Re: [PATCH v10 00/20] dlb: introduce DLB device driver

2021-03-12 Thread Dan Williams

On Fri, Mar 12, 2021 at 1:55 PM Chen, Mike Ximing
 wrote:
>
>
>
> > -Original Message-----
> > From: Dan Williams 
> > Sent: Friday, March 12, 2021 2:18 AM
> > To: Greg KH 
> > Cc: Chen, Mike Ximing ; Netdev 
> > ; David Miller
> > ; Jakub Kicinski ; Arnd Bergmann 
> > ; Pierre-
> > Louis Bossart 
> > Subject: Re: [PATCH v10 00/20] dlb: introduce DLB device driver
> >
> > On Wed, Mar 10, 2021 at 1:02 AM Greg KH  wrote:
> > >
> > > On Wed, Feb 10, 2021 at 11:54:03AM -0600, Mike Ximing Chen wrote:
> > > > Intel DLB is an accelerator for the event-driven programming model of
> > > > DPDK's Event Device Library[2]. The library is used in packet processing
> > > > pipelines that arrange for multi-core scalability, dynamic 
> > > > load-balancing,
> > > > and variety of packet distribution and synchronization schemes
> > >
> > > The more that I look at this driver, the more I think this is a "run
> > > around" the networking stack.  Why are you all adding kernel code to
> > > support DPDK which is an out-of-kernel networking stack?  We can't
> > > support that at all.
> > >
> > > Why not just use the normal networking functionality instead of this
> > > custom char-device-node-monstrosity?
> >
> > Hey Greg,
> >
> > I've come to find out that this driver does not bypass kernel
> > networking, and the kernel functionality I thought it bypassed, IPC /
> > Scheduling, is not even in the picture in the non-accelerated case. So
> > given you and I are both confused by this submission that tells me
> > that the problem space needs to be clarified and assumptions need to
> > be enumerated.
> >
> > > What is missing from todays kernel networking code that requires this
> > > run-around?
> >
> > Yes, first and foremost Mike, what are the kernel infrastructure gaps
> > and pain points that led up to this proposal?
>
> Hi Greg/Dan,
>
> Sorry for the confusion. The cover letter and document did not articulate
> clearly the problem being solved by DLB. We will update the document in
> the next revision.

I'm not sure this answers Greg question about what is missing from
today's kernel implementation?

> In a brief description, Intel DLB is an accelerator that replaces shared 
> memory
> queuing systems. Large modern server-class CPUs,  with local caches
> for each core, tend to incur costly cache misses, cross core snoops
> and contentions.  The impact becomes noticeable at high (messages/sec)
> rates, such as are seen in high throughput packet processing and HPC
> applications. DLB is used in high rate pipelines that require a variety of 
> packet
> distribution & synchronization schemes.  It can be leveraged to accelerate
> user space libraries, such as DPDK eventdev. It could show similar benefits in
> frameworks such as PADATA in the Kernel - if the messaging rate is 
> sufficiently
> high.

Where is PADATA limited by distribution and synchronization overhead?
It's meant for parallelizable work that has minimal communication
between the work units, ordering is about it's only synchronization
overhead, not messaging. It's used for ipsec crypto and page init.
Even potential future bulk work usages that might benefit from PADATA
like like md-raid, ksm, or kcopyd do not have any messaging overhead.

> As can be seen in the following diagram,  DLB operations come into the
> picture only after packets are received by Rx core from the networking
> devices. WCs are the worker cores which process packets distributed by DLB.
> (In case the diagram gets mis-formatted,  please see attached file).
>
>
>   WC1  WC4
>  +-+   ++   +---+  /  \  +---+  /  \  +---+   ++   +-+
>  |NIC  |   |Rx  |   |DLB| /\ |DLB| /\ |DLB|   |Tx  |   |NIC  |
>  |Ports|---|Core|---|   |-WC2|   |-WC5|   |---|Core|---|Ports|
>  +-+   -+   +---+ \/ +---+ \/ +---+   ++   --+
>\  / \  /
>   WC3  WC6
>
> At its heart DLB consists of resources than can be assigned to
> VDEVs/applications in a flexible manner, such as ports, queues, credits to use
> queues, sequence numbers, etc.

All of those objects are managed in userspace today in the unaccelerated case?

> We support up to 16/32 VF/VDEVs (depending
> on version) with SRIOV and SIOV. Role of the kernel driver includes VDEV
> Composition (vdcm module), functional level reset, live migration, error
> handling, power management, and etc..

Need some more specificity here. What about those features requires
the kernel to get involved with a DLB2 specific ABI to manage ports,
queues, credits, sequence numbers, etc...?

Re: [PATCH v10 00/20] dlb: introduce DLB device driver

2021-03-11 Thread Dan Williams

On Wed, Mar 10, 2021 at 1:02 AM Greg KH  wrote:
>
> On Wed, Feb 10, 2021 at 11:54:03AM -0600, Mike Ximing Chen wrote:
> > Intel DLB is an accelerator for the event-driven programming model of
> > DPDK's Event Device Library[2]. The library is used in packet processing
> > pipelines that arrange for multi-core scalability, dynamic load-balancing,
> > and variety of packet distribution and synchronization schemes
>
> The more that I look at this driver, the more I think this is a "run
> around" the networking stack.  Why are you all adding kernel code to
> support DPDK which is an out-of-kernel networking stack?  We can't
> support that at all.
>
> Why not just use the normal networking functionality instead of this
> custom char-device-node-monstrosity?

Hey Greg,

I've come to find out that this driver does not bypass kernel
networking, and the kernel functionality I thought it bypassed, IPC /
Scheduling, is not even in the picture in the non-accelerated case. So
given you and I are both confused by this submission that tells me
that the problem space needs to be clarified and assumptions need to
be enumerated.

> What is missing from todays kernel networking code that requires this
> run-around?

Yes, first and foremost Mike, what are the kernel infrastructure gaps
and pain points that led up to this proposal?

Re: [PATCH v10 14/20] dlb: add start domain ioctl

2021-03-10 Thread Dan Williams

On Wed, Mar 10, 2021 at 12:14 AM Greg KH  wrote:
>
> On Wed, Mar 10, 2021 at 02:45:10AM +, Chen, Mike Ximing wrote:
> >
> >
> > > -Original Message-
> > > From: Greg KH 
> > > On Wed, Feb 10, 2021 at 11:54:17AM -0600, Mike Ximing Chen wrote:
> > > >
> > > > diff --git a/drivers/misc/dlb/dlb_ioctl.c b/drivers/misc/dlb/dlb_ioctl.c
> > > > index 6a311b969643..9b05344f03c8 100644
> > > > --- a/drivers/misc/dlb/dlb_ioctl.c
> > > > +++ b/drivers/misc/dlb/dlb_ioctl.c
> > > > @@ -51,6 +51,7 @@
> > > DLB_DOMAIN_IOCTL_CALLBACK_TEMPLATE(create_ldb_queue)
> > > >  DLB_DOMAIN_IOCTL_CALLBACK_TEMPLATE(create_dir_queue)
> > > >  DLB_DOMAIN_IOCTL_CALLBACK_TEMPLATE(get_ldb_queue_depth)
> > > >  DLB_DOMAIN_IOCTL_CALLBACK_TEMPLATE(get_dir_queue_depth)
> > > > +DLB_DOMAIN_IOCTL_CALLBACK_TEMPLATE(start_domain)
> > > >
> > > > --- a/drivers/misc/dlb/dlb_pf_ops.c
> > > > +++ b/drivers/misc/dlb/dlb_pf_ops.c
> > > > @@ -160,6 +160,14 @@ dlb_pf_create_dir_port(struct dlb_hw *hw, u32 id,
> > > >  resp, false, 0);
> > > >  }
> > > >
> > > > +static int
> > > > +dlb_pf_start_domain(struct dlb_hw *hw, u32 id,
> > > > + struct dlb_start_domain_args *args,
> > > > + struct dlb_cmd_response *resp)
> > > > +{
> > > > + return dlb_hw_start_domain(hw, id, args, resp, false, 0);
> > > > +}
> > > > +
> > > >  static int dlb_pf_get_num_resources(struct dlb_hw *hw,
> > > >   struct dlb_get_num_resources_args *args)
> > > >  {
> > > > @@ -232,6 +240,7 @@ struct dlb_device_ops dlb_pf_ops = {
> > > >   .create_dir_queue = dlb_pf_create_dir_queue,
> > > >   .create_ldb_port = dlb_pf_create_ldb_port,
> > > >   .create_dir_port = dlb_pf_create_dir_port,
> > > > + .start_domain = dlb_pf_start_domain,
> > >
> > > Why do you have a "callback" when you only ever call one function?  Why
> > > is that needed at all?
> > >
> > In our next submission, we are going to add virtual function (VF) support. 
> > The
> > callbacks for VFs are different from those for PF which is what we support 
> > in this
> > submission. We can defer the introduction of  the callback structure to 
> > when we
> > add the VF support. But since we have many callback functions, that approach
> > will generate many changes in then "existing" code. We thought that putting
> > the callback structure in place now would make the job of adding VF support 
> > easier.
> > Is it OK?
>
> No, do not add additional complexity when it is not needed.  It causes
> much more review work and I and no one else have any idea that
> "something might be coming in the future", so please do not make our
> lives harder.
>
> Make it simple, and work, now.  You can always add additional changes
> later, if it is ever needed.
>

Good points Greg, the internal reviews missed this, let me take
another once over before v11.

Re: [PATCH v10 01/20] dlb: add skeleton for DLB driver

2021-03-10 Thread Dan Williams

On Mon, Mar 8, 2021 at 8:53 PM Chen, Mike Ximing
 wrote:
>
>
> > -Original Message-
> > From: Greg KH 
> > On Mon, Mar 08, 2021 at 08:00:00PM +, Chen, Mike Ximing wrote:
> > >
> > > Hi Greg,
> > >
> > > While waiting for the feedback from the networking maintainers, I am
> > > wondering if you have any other comments/suggestions that I  should 
> > > address
> > > in parallel.
> >
> > It's in my "to-review" queue, which is huge at the moment.  But the
> > networking developers review will determine how this should go forward,
> > so I'll just wait for them to get to it.
> >
>
> I see the status of the submission (to netdev)  is now marked as "Not 
> Applicable"
> at netdev's patchwork site
> https://patchwork.kernel.org/project/netdevbpf/list/?series=&submitter=197673&state=*&q=&archive=both&delegate=
>
> Looks like that the patch set is considered as not being networking related. 
> (?)

That's just to clean up their queue of things that need to be pulled
into the net tree. This driver is being requested to just be acked-by
netdev before Greg considers it for drivers/misc/.

Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and implement private channel OPs

2021-02-01 Thread Dan Williams

On Mon, Feb 1, 2021 at 4:40 PM Saleem, Shiraz  wrote:
>
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and
> > implement private channel OPs
> >
> > On Sat, Jan 30, 2021 at 01:19:36AM +, Saleem, Shiraz wrote:
> > > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver
> > > > and implement private channel OPs
> > > >
> > > > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 27, 2021 at 10:17:56PM +, Saleem, Shiraz wrote:
> > > > >
> > > > > > Even with another core PCI driver, there still needs to be
> > > > > > private communication channel between the aux rdma driver and
> > > > > > this PCI driver to pass things like QoS updates.
> > > > >
> > > > > Data pushed from the core driver to its aux drivers should either
> > > > > be done through new callbacks in a struct device_driver or by
> > > > > having a notifier chain scheme from the core driver.
> > > >
> > > > Right, and internal to driver/core device_lock will protect from
> > > > parallel probe/remove and PCI flows.
> > > >
> > >
> > > OK. We will hold the device_lock while issuing the .ops callbacks from 
> > > core
> > driver.
> > > This should solve our synchronization issue.
> > >
> > > There have been a few discussions in this thread. And I would like to
> > > be clear on what to do.
> > >
> > > So we will,
> > >
> > > 1. Remove .open/.close, .peer_register/.peer_unregister 2. Protect ops
> > > callbacks issued from core driver to the aux driver with device_lock
> >
> > A notifier chain is probably better, honestly.
> >
> > Especially since you don't want to split the netdev side, a notifier chain 
> > can be
> > used by both cases equally.
> >
>
> The device_lock seems to be a simple solution to this synchronization problem.
> May I ask what makes the notifier scheme better to solve this?
>

Only loosely following the arguments here, but one of the requirements
of the driver-op scheme is that the notifying agent needs to know the
target device. With the notifier-chain approach the target device
becomes anonymous to the notifier agent.

Re: [PATCH net-next 3/3] net: mhi: Add mbim proto

2021-02-01 Thread Dan Williams

On Mon, 2021-02-01 at 19:27 +0100, Loic Poulain wrote:
> On Mon, 1 Feb 2021 at 19:17, Dan Williams  wrote:
> > On Fri, 2021-01-29 at 18:21 -0800, Jakub Kicinski wrote:
> > > On Wed, 27 Jan 2021 18:01:17 +0100 Loic Poulain wrote:
> > > > MBIM has initially been specified by USB-IF for transporting
> > > > data
> > > > (IP)
> > > > between a modem and a host over USB. However some modern modems
> > > > also
> > > > support MBIM over PCIe (via MHI). In the same way as
> > > > QMAP(rmnet),
> > > > it
> > > > allows to aggregate IP packets and to perform context
> > > > multiplexing.
> > > > 
> > > > This change adds minimal MBIM support to MHI, allowing to
> > > > support
> > > > MBIM
> > > > only modems. MBIM being based on USB NCM, it reuses some
> > > > helpers
> > > > from
> > > > the USB stack, but the cdc-mbim driver is too USB coupled to be
> > > > reused.
> > > > 
> > > > At some point it would be interesting to move on a factorized
> > > > solution,
> > > > having a generic MBIM network lib or dedicated MBIM netlink
> > > > virtual
> > > > interface support.
> > 
> > What would a kernel-side MBIM netlink interface do?  Just data-
> > plane
> > stuff (like channel setup to create new netdevs), or are you
> > thinking
> > about control-plane stuff like APN definition, radio scans, etc?
> 
> Just the data-plane (mbim encoding/decoding/muxing).

Ah yes :) If so, then fully agree.

But is that really specific to MBIM? eg, same kinds of things happen
for QMI. Johannes referred to a more generic WWAN framework that we had
discussed 1.5+ years ago to address these issues. Might be worth
restarting that, perhaps simplifying, and figuring out the minimal set
of generic bits needed to describe/add/delete a data channel for WWAN
control protocols.
Dan

Re: [PATCH net-next 3/3] net: mhi: Add mbim proto

2021-02-01 Thread Dan Williams

On Fri, 2021-01-29 at 18:21 -0800, Jakub Kicinski wrote:
> On Wed, 27 Jan 2021 18:01:17 +0100 Loic Poulain wrote:
> > MBIM has initially been specified by USB-IF for transporting data
> > (IP)
> > between a modem and a host over USB. However some modern modems
> > also
> > support MBIM over PCIe (via MHI). In the same way as QMAP(rmnet),
> > it
> > allows to aggregate IP packets and to perform context multiplexing.
> > 
> > This change adds minimal MBIM support to MHI, allowing to support
> > MBIM
> > only modems. MBIM being based on USB NCM, it reuses some helpers
> > from
> > the USB stack, but the cdc-mbim driver is too USB coupled to be
> > reused.
> > 
> > At some point it would be interesting to move on a factorized
> > solution,
> > having a generic MBIM network lib or dedicated MBIM netlink virtual
> > interface support.

What would a kernel-side MBIM netlink interface do?  Just data-plane
stuff (like channel setup to create new netdevs), or are you thinking
about control-plane stuff like APN definition, radio scans, etc?

Dan

> > This code has been highly inspired from the mhi_mbim downstream
> > driver
> > (Carl Yin ).
> > 
> > Signed-off-by: Loic Poulain 
> 
> Does the existing MBIM over USB NCM also show up as a netdev?
> 
> Let's CC Dan and Bjorn on MBIM-related code, they may have opinions.
>

Re: [PATCH net-next v2] net: qmi_wwan: Add pass through mode

2021-01-25 Thread Dan Williams

On Mon, 2021-01-25 at 17:00 +0100, Bjørn Mork wrote:
> Dan Williams  writes:
> > On Mon, 2021-01-25 at 00:33 -0700, Subash Abhinov Kasiviswanathan
> > wrote:
> > > Pass through mode is to allow packets in MAP format to be passed
> > > on to the stack. rmnet driver can be used to process and
> > > demultiplex
> > > these packets.
> > > 
> > > Pass through mode can be enabled when the device is in raw ip
> > > mode
> > > only.
> > > Conversely, raw ip mode cannot be disabled when pass through mode
> > > is
> > > enabled.
> > > 
> > > Userspace can use pass through mode in conjunction with rmnet
> > > driver
> > > through the following steps-
> > > 
> > > 1. Enable raw ip mode on qmi_wwan device
> > > 2. Enable pass through mode on qmi_wwan device
> > > 3. Create a rmnet device with qmi_wwan device as real device
> > > using
> > > netlink
> > 
> > This option is module-wide, right?
> > 
> > eg, if there are multiple qmi_wwan-driven devices on the system,
> > *all*
> > of them must use MAP + passthrough, or none of them can, right?
> 
> No, this is a per-netdev setting, just like the raw-ip flag.

Thanks for setting me straight. Glad to hear it.
Dan

Re: [PATCH net-next v2] net: qmi_wwan: Add pass through mode

2021-01-25 Thread Dan Williams

On Mon, 2021-01-25 at 00:33 -0700, Subash Abhinov Kasiviswanathan
wrote:
> Pass through mode is to allow packets in MAP format to be passed
> on to the stack. rmnet driver can be used to process and demultiplex
> these packets.
> 
> Pass through mode can be enabled when the device is in raw ip mode
> only.
> Conversely, raw ip mode cannot be disabled when pass through mode is
> enabled.
> 
> Userspace can use pass through mode in conjunction with rmnet driver
> through the following steps-
> 
> 1. Enable raw ip mode on qmi_wwan device
> 2. Enable pass through mode on qmi_wwan device
> 3. Create a rmnet device with qmi_wwan device as real device using
> netlink

This option is module-wide, right?

eg, if there are multiple qmi_wwan-driven devices on the system, *all*
of them must use MAP + passthrough, or none of them can, right?

There are users that run 2+ devices on the same system, and different
cards. I'm not sure I would assume that all can/would run in the same
mode, unfortunately.

Dan

> Signed-off-by: Subash Abhinov Kasiviswanathan <
> subas...@codeaurora.org>
> ---
> v1->v2: Update commit text and fix the following comments from Bjorn-
> Remove locking as no netdev state change is requried since all the
> configuration is already done in raw_ip_store.
> Check the inverse relationship between raw_ip mode and pass_through
> mode.
> pass_through_mode just sets/resets the flag now.
> raw_ip check is not needed when queueing pass_through mode packets as
> that is enforced already during the mode configuration.
> 
>  drivers/net/usb/qmi_wwan.c | 58
> ++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 7ea113f5..e58a80a 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -57,6 +57,7 @@ struct qmi_wwan_state {
>  enum qmi_wwan_flags {
>   QMI_WWAN_FLAG_RAWIP = 1 << 0,
>   QMI_WWAN_FLAG_MUX = 1 << 1,
> + QMI_WWAN_FLAG_PASS_THROUGH = 1 << 2,
>  };
>  
>  enum qmi_wwan_quirks {
> @@ -326,6 +327,13 @@ static ssize_t raw_ip_store(struct device
> *d,  struct device_attribute *attr, co
>   if (enable == (info->flags & QMI_WWAN_FLAG_RAWIP))
>   return len;
>  
> + /* ip mode cannot be cleared when pass through mode is set */
> + if (!enable && (info->flags & QMI_WWAN_FLAG_PASS_THROUGH)) {
> + netdev_err(dev->net,
> +"Cannot clear ip mode on pass through
> device\n");
> + return -EINVAL;
> + }
> +
>   if (!rtnl_trylock())
>   return restart_syscall();
>  
> @@ -456,14 +464,59 @@ static ssize_t del_mux_store(struct device
> *d,  struct device_attribute *attr, c
>   return ret;
>  }
>  
> +static ssize_t pass_through_show(struct device *d,
> +  struct device_attribute *attr, char
> *buf)
> +{
> + struct usbnet *dev = netdev_priv(to_net_dev(d));
> + struct qmi_wwan_state *info;
> +
> + info = (void *)&dev->data;
> + return sprintf(buf, "%c\n",
> +info->flags & QMI_WWAN_FLAG_PASS_THROUGH ? 'Y' :
> 'N');
> +}
> +
> +static ssize_t pass_through_store(struct device *d,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct usbnet *dev = netdev_priv(to_net_dev(d));
> + struct qmi_wwan_state *info;
> + bool enable;
> +
> + if (strtobool(buf, &enable))
> + return -EINVAL;
> +
> + info = (void *)&dev->data;
> +
> + /* no change? */
> + if (enable == (info->flags & QMI_WWAN_FLAG_PASS_THROUGH))
> + return len;
> +
> + /* pass through mode can be set for raw ip devices only */
> + if (!(info->flags & QMI_WWAN_FLAG_RAWIP)) {
> + netdev_err(dev->net,
> +"Cannot set pass through mode on non ip
> device\n");
> + return -EINVAL;
> + }
> +
> + if (enable)
> + info->flags |= QMI_WWAN_FLAG_PASS_THROUGH;
> + else
> + info->flags &= ~QMI_WWAN_FLAG_PASS_THROUGH;
> +
> + return len;
> +}
> +
>  static DEVICE_ATTR_RW(raw_ip);
>  static DEVICE_ATTR_RW(add_mux);
>  static DEVICE_ATTR_RW(del_mux);
> +static DEVICE_ATTR_RW(pass_through);
>  
>  static struct attribute *qmi_wwan_sysfs_attrs[] = {
>   &dev_attr_raw_ip.attr,
>   &dev_attr_add_mux.attr,
>   &dev_attr_del_mux.attr,
> + &dev_attr_pass_through.attr,
>   NULL,
>  };
>  
> @@ -510,6 +563,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev,
> struct sk_buff *skb)
>   if (info->flags & QMI_WWAN_FLAG_MUX)
>   return qmimux_rx_fixup(dev, skb);
>  
> + if (info->flags & QMI_WWAN_FLAG_PASS_THROUGH) {
> + skb->protocol = htons(ETH_P_MAP);
> + return (netif_rx(skb) == NET_RX_SUCCESS);
> + }
> +
>   switch (skb->data[0] & 0xf0) {
>   case 0x40:
>   proto = htons(ET

Re: [PATCH 17/18] net: iosm: readme file

2021-01-20 Thread Dan Williams

On Wed, 2021-01-20 at 15:32 -0800, Jakub Kicinski wrote:
> On Wed, 20 Jan 2021 20:34:51 +0100 Andrew Lunn wrote:
> > On Sun, Jan 17, 2021 at 06:26:54PM +0100, Bjørn Mork wrote:
> > > I was young and stupid. Now I'm not that young anymore ;-)  
> > 
> > We all make mistakes, when we don't have the knowledge there are
> > other
> > ways. That is partially what code review is about.
> > 
> > > Never ever imagined that this would be replicated in another
> > > driver,
> > > though.  That doesn't really make much sense.  We have learned by
> > > now,
> > > haven't we?  This subject has been discussed a few times in the
> > > past,
> > > and Johannes summary is my understanding as well:
> > > "I don't think anyone likes that"  
> > 
> > So there seems to be agreement there. But what is not clear, is
> > anybody willing to do the work to fix this, and is there enough
> > ROI.
> > 
> > Do we expect more devices like this? Will 6G, 7G modems look very
> > different? 
> 
> Didn't Intel sell its 5G stuff off to Apple?

Yes, but they kept the ability to continue with 3G/4G hardware and
other stuff.

> > Be real network devices and not need any of this odd stuff?
> > Or will they be just be incrementally better but mostly the same?
> > 
> > I went into the review thinking it was an Ethernet driver, and kept
> > having WTF moments. Now i know it is not an Ethernet driver, i can
> > say
> > it is not my domain, i don't know the field well enough to say if
> > all
> > these hacks are acceptable or not.
> > 
> > It probably needs David and Jakub to set the direction to be
> > followed.
> 
> AFAIU all those cellar modems are relatively slow and FW-heavy, so
> the
> ideal solution IMO is not even a common kernel interface but actually
> a common device interface, like NVMe (or virtio for lack of better
> examples).

That was supposed to be MBIM, but unfortunately those involved didn't
iterate and MBIM got stuck. I don't think we'll see a standard as long
as some vendors are dominant and see no need for it.

Dan

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Dan Williams

On Mon, Jan 4, 2021 at 5:53 PM Jason Gunthorpe  wrote:
>
> On Mon, Jan 04, 2021 at 04:51:51PM -0800, Dan Williams wrote:
> > On Mon, Jan 4, 2021 at 4:14 PM Jason Gunthorpe  wrote:
> > >
> > > On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
> > >
> > >
> > > > > Regardless of the shortcut to make everything a struct
> > > > > platform_device, I think it was a mistake to put OF devices on
> > > > > platform_bus. Those should have remained on some of_bus even if they
> > > >
> > > > Like I keep saying the same thing applies to all non-enumerable buses -
> > > > exactly the same considerations exist for all the other buses like I2C
> > > > (including the ACPI naming issue you mention below), and for that matter
> > > > with enumerable buses which can have firmware info.
> > >
> > > And most busses do already have their own bus type. ACPI, I2C, PCI,
> > > etc. It is just a few that have been squished into platform, notably
> > > OF.
> > >
> >
> > I'll note that ACPI is an outlier that places devices on 2 buses,
> > where new acpi_driver instances are discouraged [1] in favor of
> > platform_drivers. ACPI scan handlers are awkwardly integrated into the
> > Linux device model.
> >
> > So while I agree with sentiment that an "ACPI bus" should
> > theoretically stand on its own there is legacy to unwind.
> >
> > I only bring that up to keep the focus on how to organize drivers
> > going forward, because trying to map some of these arguments backwards
> > runs into difficulties.
> >
> > [1]: 
> > http://lore.kernel.org/r/cajz5v0j_rek3agddw7flvmw_7kneccg2u_hukgjzqelcy8s...@mail.gmail.com
>
> Well, this is the exact kind of thing I think we are talking about
> here..
>
> > > It should be split up based on the unique naming scheme and any bus
> > > specific API elements - like raw access to ACPI or OF data or what
> > > have you for other FW bus types.
> >
> > I agree that the pendulum may have swung too far towards "reuse
> > existing bus_type", and auxiliary-bus unwinds some of that, but does
> > the bus_type really want to be an indirection for driver apis outside
> > of bus-specific operations?
>
> If the bus is the "enumeration entity" and we define that things like
> name, resources, gpio's, regulators, etc are a generic part of what is
> enumerated, then it makes sense that the bus would have methods
> to handle those things too.
>
> In other words, the only way to learn what GPIO 'resource' is to ask
> the enumeration mechnism that is providing the bus. If the enumeration
> and bus are 1:1 then you can use a function pointer on the bus type
> instead of open coding a dispatch based on an indirect indication.
>

I get that, but I'm fearing a gigantic bus_ops structure that has
narrow helpers like ->gpio_count() that mean nothing to the many other
clients of the bus. Maybe I'm overestimating the pressure there will
be to widen the ops structure at the bus level.

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Dan Williams

On Mon, Jan 4, 2021 at 4:14 PM Jason Gunthorpe  wrote:
>
> On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
>
>
> > > Regardless of the shortcut to make everything a struct
> > > platform_device, I think it was a mistake to put OF devices on
> > > platform_bus. Those should have remained on some of_bus even if they
> >
> > Like I keep saying the same thing applies to all non-enumerable buses -
> > exactly the same considerations exist for all the other buses like I2C
> > (including the ACPI naming issue you mention below), and for that matter
> > with enumerable buses which can have firmware info.
>
> And most busses do already have their own bus type. ACPI, I2C, PCI,
> etc. It is just a few that have been squished into platform, notably
> OF.
>

I'll note that ACPI is an outlier that places devices on 2 buses,
where new acpi_driver instances are discouraged [1] in favor of
platform_drivers. ACPI scan handlers are awkwardly integrated into the
Linux device model.

So while I agree with sentiment that an "ACPI bus" should
theoretically stand on its own there is legacy to unwind.

I only bring that up to keep the focus on how to organize drivers
going forward, because trying to map some of these arguments backwards
runs into difficulties.

[1]: 
http://lore.kernel.org/r/cajz5v0j_rek3agddw7flvmw_7kneccg2u_hukgjzqelcy8s...@mail.gmail.com

> > > are represented by struct platform_device and fiddling in the core
> > > done to make that work OK.
> >
> > What exactly is the fiddling in the core here, I'm a bit unclear?
>
> I'm not sure, but I bet there is a small fall out to making bus_type
> not 1:1 with the struct device type.. Would have to attempt it to see
>
> > > This feels like a good conference topic someday..
> >
> > We should have this discussion *before* we get too far along with trying
> > to implement things, we should at least have some idea where we want to
> > head there.
>
> Well, auxillary bus is clearly following the original bus model
> intention with a dedicated bus type with a controlled naming
> scheme. The debate here seems to be "what about platform bus" and
> "what to do with mfd"?
>
> > Those APIs all take a struct device for lookup so it's the same call for
> > looking things up regardless of the bus the device is on or what
> > firmware the system is using - where there are firmware specific lookup
> > functions they're generally historical and shouldn't be used for new
> > code.  It's generally something in the form
> >
> >   api_type *api_get(struct device *dev, const char *name);
>
> Well, that is a nice improvement since a few years back when I last
> worked on this stuff.
>
> But now it begs the question, why not push harder to make 'struct
> device' the generic universal access point and add some resource_get()
> API along these lines so even a platform_device * isn't needed?
>
> Then the path seems much clearer, add a multi-bus-type device_driver
> that has a probe(struct device *) and uses the 'universal api_get()'
> style interface to find the generic 'resources'.
>
> The actual bus types and bus structs can then be split properly
> without the boilerplate that caused them all to be merged to platform,
> even PCI could be substantially merged like this.
>
> Bonus points to replace the open coded method disptach:
>
> int gpiod_count(struct device *dev, const char *con_id)
> {
> int count = -ENOENT;
>
> if (IS_ENABLED(CONFIG_OF) && dev && dev->of_node)
> count = of_gpio_get_count(dev, con_id);
> else if (IS_ENABLED(CONFIG_ACPI) && dev && ACPI_HANDLE(dev))
> count = acpi_gpio_count(dev, con_id);
>
> if (count < 0)
> count = platform_gpio_count(dev, con_id);
>
> With an actual bus specific virtual function:
>
> return dev->bus->gpio_count(dev);
>
> > ...and then do the same thing for every other bus with firmware
> > bindings.  If it's about the firmware interfaces it really isn't a
> > platform bus specific thing.  It's not clear to me if that's what it is
> > though or if this is just some tangent.
>
> It should be split up based on the unique naming scheme and any bus
> specific API elements - like raw access to ACPI or OF data or what
> have you for other FW bus types.

I agree that the pendulum may have swung too far towards "reuse
existing bus_type", and auxiliary-bus unwinds some of that, but does
the bus_type really want to be an indirection for driver apis outside
of bus-specific operations?

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Dan Williams

On Fri, Dec 18, 2020 at 1:17 PM Alexandre Belloni
 wrote:
>
> On 18/12/2020 16:58:56-0400, Jason Gunthorpe wrote:
> > On Fri, Dec 18, 2020 at 08:32:11PM +, Mark Brown wrote:
> >
> > > > So, I strongly suspect, MFD should create mfd devices on a MFD bus
> > > > type.
> > >
> > > Historically people did try to create custom bus types, as I have
> > > pointed out before there was then pushback that these were duplicating
> > > the platform bus so everything uses platform bus.
> >
> > Yes, I vaugely remember..
> >
> > I don't know what to say, it seems Greg doesn't share this view of
> > platform devices as a universal device.
> >
> > Reading between the lines, I suppose things would have been happier
> > with some kind of inheritance scheme where platform device remained as
> > only instantiated directly in board files, while drivers could bind to
> > OF/DT/ACPI/FPGA/etc device instantiations with minimal duplication &
> > boilerplate.
> >
> > And maybe that is exactly what we have today with platform devices,
> > though the name is now unfortunate.
> >
> > > I can't tell the difference between what it's doing and what SOF is
> > > doing, the code I've seen is just looking at the system it's running
> > > on and registering a fixed set of client devices.  It looks slightly
> > > different because it's registering a device at a time with some wrapper
> > > functions involved but that's what the code actually does.
> >
> > SOF's aux bus usage in general seems weird to me, but if you think
> > it fits the mfd scheme of primarily describing HW to partition vs
> > describing a SW API then maybe it should use mfd.
> >
> > The only problem with mfd as far as SOF is concerned was Greg was not
> > happy when he saw PCI stuff in the MFD subsystem.
> >
>
> But then again, what about non-enumerable devices on the PCI device? I
> feel this would exactly fit MFD. This is a collection of IPs that exist
> as standalone but in this case are grouped in a single device.
>
> Note that I then have another issue because the kernel doesn't support
> irq controllers on PCI and this is exactly what my SoC has. But for now,
> I can just duplicate the irqchip driver in the MFD driver.
>
> > This whole thing started when Intel first proposed to directly create
> > platform_device's in their ethernet driver and Greg had a quite strong
> > NAK to that.
>
> Let me point to drivers/net/ethernet/cadence/macb_pci.c which is a
> fairly recent example. It does exactly that and I'm not sure you could
> do it otherwise while still not having to duplicate most of macb_probe.
>

This still feels an orthogonal example to the problem auxiliary-bus is
solving. If a platform-device and a pci-device surface an IP with a
shared programming model that's an argument for a shared library, like
libata to house the commonality. In contrast auxiliary-bus is a
software model for software-defined sub-functionality to be wrapped in
a driver model. It assumes a parent-device / parent-driver hierarchy
that platform-bus and pci-bus do not imply.

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-17 Thread Dan Williams

On Thu, Dec 17, 2020 at 1:20 PM Alexandre Belloni
 wrote:
>
> Hello,
>
> On 05/12/2020 16:51:36+0100, Greg KH wrote:
> > > To me, the documentation was written, and reviewed, more from the
> > > perspective of "why not open code a custom bus instead". So I can see
> > > after the fact how that is a bit too much theory and justification and
> > > not enough practical application. Before the fact though this was a
> > > bold mechanism to propose and it was not clear that everyone was
> > > grokking the "why" and the tradeoffs.
> >
> > Understood, I guess I read this from the "of course you should do this,
> > now how do I use it?" point of view.  Which still needs to be addressed
> > I feel.
> >
> > > I also think it was a bit early to identify consistent design patterns
> > > across the implementations and codify those. I expect this to evolve
> > > convenience macros just like other parts of the driver-core gained
> > > over time. Now that it is in though, another pass through the
> > > documentation to pull in more examples seems warranted.
> >
> > A real, working, example would be great to have, so that people can know
> > how to use this.  Trying to dig through the sound or IB patches to view
> > how it is being used is not a trivial thing to do, which is why
> > reviewing this took so much work.  Having a simple example test module,
> > that creates a number of devices on a bus, ideally tied into the ktest
> > framework, would be great.  I'll attach below a .c file that I used for
> > some basic local testing to verify some of this working, but it does not
> > implement a aux bus driver, which needs to be also tested.
> >
>
> There is something I don't get from the documentation and it is what is
> this introducing that couldn't already be done using platform drivers
> and platform devices?

There is room for documentation improvement here. I realize reading it
back now that much of the justification for "why not platform bus?"
happened on the list, but only a small mention made it into the
document. It turns out that platform-bus has some special integrations
and hacks with platform-firmware implementations. For example, the
ACPI companion magic and specific platform firmware integrations in
platform_match(). It's also an awkward bus name to use because these
devices do not belong to the platform. The platform bus is for devices
that do not have an enumeration mechanism besides board files or
firmware descriptions.

So while many of the auxiliary device use cases might be able to be
squeezed into a platform-bus scheme it further overloads what is
already a wide responsibility.

In comparison, the auxiliary-bus is tailored to the "sub-function of a
parent device/driver" use case. It lets the host driver be the root of
a namespace of sub-functionality in a standard template way.

> We already have a bunch of drivers in tree that have to share a state
> and register other drivers from other subsystems for the same device.
> How is the auxiliary bus different?

There's also custom subsystem buses that do this. Why not other
alternatives? They didn't capture the simultaneous mindshare of RDMA,
SOF, and NETDEV developers. Personally my plans for using
auxiliary-bus do not map cleanly to anything else in the tree. I want
to use it for attaching an NPEM driver (Native PCIE Enclosure
Management) to any PCI device driver that opts-in, but it would be
overkill to go create an "npem" bus for this.

Re: [PATCH v17 3/3] bus: mhi: Add userspace client interface driver

2020-12-11 Thread Dan Williams

On Fri, 2020-12-11 at 08:44 +0100, Greg KH wrote:
> On Thu, Dec 10, 2020 at 11:04:11PM -0800, Hemant Kumar wrote:
> > This MHI client driver allows userspace clients to transfer
> > raw data between MHI device and host using standard file
> > operations.
> > Driver instantiates UCI device object which is associated to device
> > file node. UCI device object instantiates UCI channel object when
> > device
> > file node is opened. UCI channel object is used to manage MHI
> > channels
> > by calling MHI core APIs for read and write operations. MHI
> > channels
> > are started as part of device open(). MHI channels remain in start
> > state until last release() is called on UCI device file node.
> > Device
> > file node is created with format
> > 
> > /dev/
> > 
> > Currently it supports QMI channel. libqmi is userspace MHI client
> > which
> > communicates to a QMI service using QMI channel. libqmi is a glib-
> > based
> > library for talking to WWAN modems and devices which speaks QMI
> > protocol.
> > For more information about libqmi please refer
> > https://www.freedesktop.org/wiki/Software/libqmi/
> 
> This says _what_ this is doing, but not _why_.
> 
> Why do you want to circumvent the normal user/kernel apis for this
> type
> of device and move the normal network handling logic out to
> userspace?
> What does that help with?  What does the current in-kernel api lack
> that
> this userspace interface is going to solve, and why can't the in-
> kernel
> api solve it instead?
> 
> You are pushing a common user/kernel api out of the kernel here, to
> become very device-specific, with no apparent justification as to why
> this is happening.
> 
> Also, because you are going around the existing network api, I will
> need
> the networking maintainers to ack this type of patch.

Just to re-iterate: QMI ~= AT commands ~= MBIM (not quite, but same
level)

We already do QMI-over-USB, or AT-over-CDC-ACM. This is QMI-over-MHI.

It's not networking data plane. It's WWAN device configuration.

There are no current kernel APIs for this, and I really don't think we
want there to be. The API surface is *huge* and we definitely don't
want that in-kernel.

Dan

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-05 Thread Dan Williams

On Sat, Dec 5, 2020 at 4:24 PM David Ahern  wrote:
>
> On 12/2/20 5:54 PM, Dan Williams wrote:
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 8d7001712062..040be48ce046 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -1,6 +1,9 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  menu "Generic Driver Options"
> >
> > +config AUXILIARY_BUS
> > + bool
> > +
> >  config UEVENT_HELPER
> >   bool "Support for uevent helper"
> >   help
>
> Missing a description and without it does not appear in menuconfig or in
> the config file.
>
> Could use a blurb in the help as well.

It doesn't have a description or help because it is a select-only
symbol, but a comment to that effect and a pointer to the
documentation would help.

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-04 Thread Dan Williams

On Fri, Dec 4, 2020 at 3:41 AM Greg KH  wrote:
>
> On Wed, Dec 02, 2020 at 04:54:24PM -0800, Dan Williams wrote:
> > From: Dave Ertman 
> >
> > Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> > It enables drivers to create an auxiliary_device and bind an
> > auxiliary_driver to it.
> >
> > The bus supports probe/remove shutdown and suspend/resume callbacks.
> > Each auxiliary_device has a unique string based id; driver binds to
> > an auxiliary_device based on this id through the bus.
> >
> > Co-developed-by: Kiran Patil 
> > Co-developed-by: Ranjani Sridharan 
> > Co-developed-by: Fred Oh 
> > Co-developed-by: Leon Romanovsky 
> > Signed-off-by: Kiran Patil 
> > Signed-off-by: Ranjani Sridharan 
> > Signed-off-by: Fred Oh 
> > Signed-off-by: Leon Romanovsky 
> > Signed-off-by: Dave Ertman 
> > Reviewed-by: Pierre-Louis Bossart 
> > Reviewed-by: Shiraz Saleem 
> > Reviewed-by: Parav Pandit 
> > Reviewed-by: Dan Williams 
> > Reviewed-by: Martin Habets 
> > Link: 
> > https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
> > Signed-off-by: Dan Williams 
> > ---
> > This patch is "To:" the maintainers that have a pending backlog of
> > driver updates dependent on this facility, and "Cc:" Greg. Greg, I
> > understand you have asked for more time to fully review this and apply
> > it to driver-core.git, likely for v5.12, but please consider Acking it
> > for v5.11 instead. It looks good to me and several other stakeholders.
> > Namely, stakeholders that have pressure building up behind this facility
> > in particular Mellanox RDMA, but also SOF, Intel Ethernet, and later on
> > Compute Express Link.
> >
> > I will take the blame for the 2 months of silence that made this awkward
> > to take through driver-core.git, but at the same time I do not want to
> > see that communication mistake inconvenience other parties that
> > reasonably thought this was shaping up to land in v5.11.
> >
> > I am willing to host this version at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux 
> > tags/auxiliary-bus-for-5.11
> >
> > ...for all the independent drivers to have a common commit baseline. It
> > is not there yet pending Greg's Ack.
> >
> > For example implementations incorporating this patch, see Dave Ertman's
> > SOF series:
> >
> > https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
> >
> > ...and Leon's mlx5 series:
> >
> > http://lore.kernel.org/r/20201026111849.1035786-1-l...@kernel.org
> >
> > PS: Greg I know I promised some review on newcomer patches to help with
> > your queue, unfortunately Intel-internal review is keeping my plate
> > full. Again, I do not want other stakeholder to be waiting on me to
> > resolve that backlog.
>
> Ok, I spent some hours today playing around with this.  I wrote up a
> small test-patch for this (how did anyone test this thing???) and while
> it feels awkward in places, and it feels like there is still way too
> much "boilerplate" code that a user has to write and manage, I don't
> have the time myself to fix it up right now.
>
> So I'll go apply this to my tree, and provide a tag for everyone else to
> be able to pull from for their different development trees so they can
> work on.
>
> I do have 3 follow-on patches that I will send to the list in response
> to this message that I will be applying on top of this patch.  They do
> some minor code formatting changes, as well as change the return type of
> the remove function to make it more future-proof.  That last change will
> require users of this code to change their implementations, but it will
> be obvious what to do as you will get a build warning.
>
> Note, I'm still not comfortable with a few things here.  The
> documentation feels odd, and didn't really help me out in writing any
> test code, which doesn't seem right.  Also the use of strings and '.' as
> part of the api feels awkward, and messy, and of course, totally
> undocumented.
>
> But, as the use of '.' is undocumented, that means we can change it in
> the future!  Because no driver or device name should ever be a user api
> reliant thing, if we come up with a better way to do all of this in the
> future, that shouldn't be a problem to change existing users over to
> this.  So this is a warning to everyone, you CAN NOT depend on the sysfs
> name of a device or bus name for any tool.  If so, your userspace tool

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-03 Thread Dan Williams

On Thu, Dec 3, 2020 at 6:34 PM Jason Gunthorpe  wrote:
>
> On Thu, Dec 03, 2020 at 04:06:24PM +0100, Greg KH wrote:
>
> > > ...for all the independent drivers to have a common commit baseline. It
> > > is not there yet pending Greg's Ack.
> >
> > I have been trying to carve out some time to review this.  At my initial
> > glance, I still have objections, so please, give me a few more days to
> > get this done...
>
> There are still several more days till the merge window, but I am
> going to ask Leon to get the mlx5 series, and this version of the
> auxbus patch it depends on, into linux-next with the intention to
> forward it to Linus if there are no substantive comments.
>
> Regardless of fault or reason this whole 1.5 year odyssey seems to have
> brought misery to everyone involved and it really is time to move on.
>
> Leon and his team did a good deed 6 weeks ago to quickly turn around
> and build another user example. For their efforts they have been
> rewarded with major merge conflicts and alot of delayed work due to
> the invasive nature of the mlx5 changes. To continue to push this out
> is disrespectful to him and his team's efforts.
>
> A major part of my time as RDMA maintainer has been to bring things
> away from vendor trees and into a common opensource community.  Intel
> shipping a large out of tree RDMA driver and abandoning their intree
> driver is really harmful. This auxbus is a substantial blocker to them
> normalizing their operations, thus I view it as important to
> resolve. Even after this it is going to take a long time and alot of
> effort to review their new RDMA driver.

When you have 3 independent driver teams (mlx5, i40e, sof) across 2
companies (NVIDIA and Intel), and multiple subsystem maintainers with
a positive track record of upstream engagement all agreeing on a piece
of infrastructure, I struggle to imagine a stronger case for merging.
I did get word of a fixup needed in the shutdown code, I'll get that
folded. Then, barring a concrete objection, I'll look to publish a
commit that others can pull in to start building soak time in -next
this time tomorrow.

[resend/standalone PATCH v4] Add auxiliary bus support

2020-12-02 Thread Dan Williams

From: Dave Ertman 

Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
It enables drivers to create an auxiliary_device and bind an
auxiliary_driver to it.

The bus supports probe/remove shutdown and suspend/resume callbacks.
Each auxiliary_device has a unique string based id; driver binds to
an auxiliary_device based on this id through the bus.

Co-developed-by: Kiran Patil 
Co-developed-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Co-developed-by: Leon Romanovsky 
Signed-off-by: Kiran Patil 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Fred Oh 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Dave Ertman 
Reviewed-by: Pierre-Louis Bossart 
Reviewed-by: Shiraz Saleem 
Reviewed-by: Parav Pandit 
Reviewed-by: Dan Williams 
Reviewed-by: Martin Habets 
Link: 
https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
Signed-off-by: Dan Williams 
---
This patch is "To:" the maintainers that have a pending backlog of
driver updates dependent on this facility, and "Cc:" Greg. Greg, I
understand you have asked for more time to fully review this and apply
it to driver-core.git, likely for v5.12, but please consider Acking it
for v5.11 instead. It looks good to me and several other stakeholders.
Namely, stakeholders that have pressure building up behind this facility
in particular Mellanox RDMA, but also SOF, Intel Ethernet, and later on
Compute Express Link.

I will take the blame for the 2 months of silence that made this awkward
to take through driver-core.git, but at the same time I do not want to
see that communication mistake inconvenience other parties that
reasonably thought this was shaping up to land in v5.11.

I am willing to host this version at:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux 
tags/auxiliary-bus-for-5.11

...for all the independent drivers to have a common commit baseline. It
is not there yet pending Greg's Ack.

For example implementations incorporating this patch, see Dave Ertman's
SOF series:

https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com

...and Leon's mlx5 series:

http://lore.kernel.org/r/20201026111849.1035786-1-l...@kernel.org

PS: Greg I know I promised some review on newcomer patches to help with
your queue, unfortunately Intel-internal review is keeping my plate
full. Again, I do not want other stakeholder to be waiting on me to
resolve that backlog.

 Documentation/driver-api/auxiliary_bus.rst |  234 
 Documentation/driver-api/index.rst |1 
 drivers/base/Kconfig   |3 
 drivers/base/Makefile  |1 
 drivers/base/auxiliary.c   |  268 
 include/linux/auxiliary_bus.h  |   78 
 include/linux/mod_devicetable.h|8 +
 scripts/mod/devicetable-offsets.c  |3 
 scripts/mod/file2alias.c   |8 +
 9 files changed, 604 insertions(+)
 create mode 100644 Documentation/driver-api/auxiliary_bus.rst
 create mode 100644 drivers/base/auxiliary.c
 create mode 100644 include/linux/auxiliary_bus.h

diff --git a/Documentation/driver-api/auxiliary_bus.rst 
b/Documentation/driver-api/auxiliary_bus.rst
new file mode 100644
index ..5dd7804631ef
--- /dev/null
+++ b/Documentation/driver-api/auxiliary_bus.rst
@@ -0,0 +1,234 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+=
+Auxiliary Bus
+=
+
+In some subsystems, the functionality of the core device (PCI/ACPI/other) is
+too complex for a single device to be managed by a monolithic driver
+(e.g. Sound Open Firmware), multiple devices might implement a common
+intersection of functionality (e.g. NICs + RDMA), or a driver may want to
+export an interface for another subsystem to drive (e.g. SIOV Physical Function
+export Virtual Function management).  A split of the functinoality into child-
+devices representing sub-domains of functionality makes it possible to
+compartmentalize, layer, and distribute domain-specific concerns via a Linux
+device-driver model.
+
+An example for this kind of requirement is the audio subsystem where a single
+IP is handling multiple entities such as HDMI, Soundwire, local devices such as
+mics/speakers etc. The split for the core's functionality can be arbitrary or
+be defined by the DSP firmware topology and include hooks for test/debug. This
+allows for the audio core device to be minimal and focused on hardware-specific
+control and communication.
+
+Each auxiliary_device represents a part of its parent functionality. The
+generic behavior can be extended and specialized as needed by encapsulating an
+auxiliary_device within other domain-specific structures and the use of .ops
+callbacks. Devices on the auxiliary bus do not share any structures and the use
+of a communication channel with the parent is domain-specific.
+
+Note that ops are intended as a way to augment instan

Re: [PATCH v4 01/10] Add auxiliary bus support

2020-11-17 Thread Dan Williams

On Mon, Nov 16, 2020 at 11:02 PM Greg KH  wrote:
>
> On Tue, Nov 17, 2020 at 07:30:00AM +0200, Leon Romanovsky wrote:
> > On Fri, Nov 13, 2020 at 08:18:50AM -0800, Dave Ertman wrote:
> > > Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> > > It enables drivers to create an auxiliary_device and bind an
> > > auxiliary_driver to it.
> > >
> > > The bus supports probe/remove shutdown and suspend/resume callbacks.
> > > Each auxiliary_device has a unique string based id; driver binds to
> > > an auxiliary_device based on this id through the bus.
> > >
> > > Co-developed-by: Kiran Patil 
> > > Signed-off-by: Kiran Patil 
> > > Co-developed-by: Ranjani Sridharan 
> > > Signed-off-by: Ranjani Sridharan 
> > > Co-developed-by: Fred Oh 
> > > Signed-off-by: Fred Oh 
> > > Co-developed-by: Leon Romanovsky 
> > > Signed-off-by: Leon Romanovsky 
> > > Reviewed-by: Pierre-Louis Bossart 
> > > Reviewed-by: Shiraz Saleem 
> > > Reviewed-by: Parav Pandit 
> > > Reviewed-by: Dan Williams 
> > > Signed-off-by: Dave Ertman 
> > > ---
> >
> > Greg,
> >
> > This horse was beaten to death, can we please progress with this patch?
> > Create special topic branch or ack so I'll prepare this branch.
> >
> > We are in -rc4 now and we (Mellanox) can't hold our submissions anymore.
> > My mlx5_core probe patches [1] were too intrusive and they are ready to
> > be merged, Parav's patches got positive review as well [2] and will be
> > taken next.
> >
> > We delayed and have in our internal queues the patches for VDPA, eswitch
> > and followup for mlx5_core probe rework, but trapped due to this AUX bus
> > patch.
>
> There are no deadlines for kernel patches here, sorry.  Give me some
> time to properly review this, core kernel changes should not be rushed.
>
> Also, if you really want to blame someone for the delay, look at the
> patch submitters not the reviewers, as they are the ones that took a
> very long time with this over the lifecycle of this patchset, not me.  I
> have provided many "instant" reviews of this patchset, and then months
> went by between updates from them.

Please stop this finger pointing. It was already noted that the team,
out of abundance of caution / deference to the process, decided not to
push the patches while I was out on family leave. It's cruel to hold
that against them, and if anyone is to blame it's me for not
clarifying it was ok to proceed while I was out.

Re: [PATCH net-next] net-loopback: allow lo dev initial state to be controlled

2020-11-12 Thread Dan Williams

On Thu, 2020-11-12 at 17:08 +0100, Andrew Lunn wrote:
> On Wed, Nov 11, 2020 at 12:43:08PM -0800, Jian Yang wrote:
> > From: Mahesh Bandewar 
> > 
> > Traditionally loopback devices comes up with initial state as DOWN
> > for
> > any new network-namespace. This would mean that anyone needing this
> > device (which is mostly true except sandboxes where networking in
> > not
> > needed at all), would have to bring this UP by issuing something
> > like
> > 'ip link set lo up' which can be avoided if the initial state can
> > be set
> > as UP.
> 
> How useful is lo if it is up, but has no IP address? I don't think
> this change adds the IP addresses does it? So you still need
> something
> inside your netns to add the IP addresses? Which seems to make this
> change pointless?

lo gets addresses automatically these days, no?

$ ip netns add blue
$ ip netns exec blue ip addr
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
$ ip netns exec blue ip link set lo up
$ ip netns exec blue ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host 
   valid_lft forever preferred_lft forever

Dan

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

On Thu, Nov 5, 2020 at 11:28 AM Ertman, David M
 wrote:
[..]
> > > Each auxiliary_device represents a part of its parent
> > > +functionality. The generic behavior can be extended and specialized as
> > needed
> > > +by encapsulating an auxiliary_device within other domain-specific
> > structures and
> > > +the use of .ops callbacks. Devices on the auxiliary bus do not share any
> > > +structures and the use of a communication channel with the parent is
> > > +domain-specific.
> >
> > Should there be any guidance here on when to use ops and when to just
> > export functions from parent driver to child. EXPORT_SYMBOL_NS() seems
> > a perfect fit for publishing shared routines between parent and child.
> >
>
> I would leave this up the driver writers to determine what is best for them.

I think there is a pathological case that can be avoided with a
statement like the following:

"Note that ops are intended as a way to augment instance behavior
within a class of auxiliary devices, it is not the mechanism for
exporting common infrastructure from the parent. Consider
EXPORT_SYMBOL_NS() to convey infrastructure from the parent module to
the auxiliary module(s)."

As for your other dispositions of the feedback, looks good to me.

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

On Thu, Nov 5, 2020 at 11:30 AM Leon Romanovsky  wrote:
>
> On Thu, Nov 05, 2020 at 09:12:51AM -0800, Dan Williams wrote:
> > On Thu, Nov 5, 2020 at 1:47 AM Leon Romanovsky  wrote:
> > >
> > > On Thu, Nov 05, 2020 at 01:19:09AM -0800, Dan Williams wrote:
> > > > Some doc fixups, and minor code feedback. Otherwise looks good to me.
> > > >
> > > > On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  
> > > > wrote:
> > > > >
> > >
> > > <...>
> > >
> > > > >
> > > > > +config AUXILIARY_BUS
> > > > > +   bool
> > > >
> > > > tristate? Unless you need non-exported symbols, might as well let this
> > > > be a module.
> > >
> > > I asked it to be "bool", because bus as a module is an invitation for
> > > a disaster. For example if I compile-in mlx5 which is based on this bus,
> > > and won't add auxiliary_bus as a module to initramfs, the system won't 
> > > boot.
> >
> > Something is broken if module dependencies don't arrange for
> > auxiliary_bus.ko to be added to the initramfs automatically, but yes,
> > it is another degree of freedom for something to go wrong if you build
> > the initramfs by hand.
>
> And this is something that I would like to avoid for now.

Fair enough.

>
> >
> > >
> > > <...>
> > >
> > > >
> > > > Per above SPDX is v2 only, so...
> > >
> > > Isn't it default for the Linux kernel?
> >
> > SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
> > implies the "or later" language. The only default assumption is that
> > the license is GPL v2 compatible, those possibilities are myriad, but
> > v2-only is the first preference.
>
> I mean that plain GPL == GPL v2 in the kernel.

You are right, I was wrong.

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

On Thu, Nov 5, 2020 at 11:40 AM Parav Pandit  wrote:
>
>
>
> > From: Ertman, David M 
> > Sent: Friday, November 6, 2020 12:58 AM
> > Subject: RE: [PATCH v3 01/10] Add auxiliary bus support
> >
> > > -Original Message-
> > > From: Dan Williams 
> > > Sent: Thursday, November 5, 2020 1:19 AM
> > >
>
> [..]
> > > > +
> > > > +Another use case is for the PCI device to be split out into
> > > > +multiple sub functions.  For each sub function an auxiliary_device
> > > > +will be created.  A PCI sub function driver will bind to such
> > > > +devices that will create its own one or more class devices.  A PCI
> > > > +sub function auxiliary device will likely be contained in a struct
> > > > +with additional attributes such as user defined sub function number
> > > > +and optional attributes such as resources and a link to
> > > the
> > > > +parent device.  These attributes could be used by systemd/udev; and
> > > hence should
> > > > +be initialized before a driver binds to an auxiliary_device.
> > >
> > > This does not read like an explicit example like the previous 2. Did
> > > you have something specific in mind?
> > >
> >
> > This was added by request of Parav.
> >
> This example describes the mlx5 PCI subfunction use case.
> I didn't follow your question about 'explicit example'.
> What part is missing to identify it as explicit example?

Specifically listing "mlx5" so if someone reading this document thinks
to themselves "hey mlx5 sounds like my use case" they can go grep for
that.

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

On Thu, Nov 5, 2020 at 12:21 PM Greg KH  wrote:
>
> On Thu, Nov 05, 2020 at 09:12:51AM -0800, Dan Williams wrote:
> > > >
> > > > Per above SPDX is v2 only, so...
> > >
> > > Isn't it default for the Linux kernel?
> >
> > SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
> > implies the "or later" language. The only default assumption is that
> > the license is GPL v2 compatible, those possibilities are myriad, but
> > v2-only is the first preference.
>
> No, MODULE_LICENSE("GPL") does not imply "or later" at all.  Please see
> include/linux/module.h, it means "GPL version 2".
>

Oh, I did, and stopped before getting to:

  "only/or later" distinction is completely irrelevant and does neither
 *replace the proper license identifiers in the corresponding source file

...sorry for the noise.

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

On Thu, Nov 5, 2020 at 1:47 AM Leon Romanovsky  wrote:
>
> On Thu, Nov 05, 2020 at 01:19:09AM -0800, Dan Williams wrote:
> > Some doc fixups, and minor code feedback. Otherwise looks good to me.
> >
> > On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  
> > wrote:
> > >
>
> <...>
>
> > >
> > > +config AUXILIARY_BUS
> > > +   bool
> >
> > tristate? Unless you need non-exported symbols, might as well let this
> > be a module.
>
> I asked it to be "bool", because bus as a module is an invitation for
> a disaster. For example if I compile-in mlx5 which is based on this bus,
> and won't add auxiliary_bus as a module to initramfs, the system won't boot.

Something is broken if module dependencies don't arrange for
auxiliary_bus.ko to be added to the initramfs automatically, but yes,
it is another degree of freedom for something to go wrong if you build
the initramfs by hand.

>
> <...>
>
> >
> > Per above SPDX is v2 only, so...
>
> Isn't it default for the Linux kernel?

SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
implies the "or later" language. The only default assumption is that
the license is GPL v2 compatible, those possibilities are myriad, but
v2-only is the first preference.

Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams

Some doc fixups, and minor code feedback. Otherwise looks good to me.

On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  wrote:
>
> Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> It enables drivers to create an auxiliary_device and bind an
> auxiliary_driver to it.
>
> The bus supports probe/remove shutdown and suspend/resume callbacks.
> Each auxiliary_device has a unique string based id; driver binds to
> an auxiliary_device based on this id through the bus.
>
> Co-developed-by: Kiran Patil 
> Signed-off-by: Kiran Patil 
> Co-developed-by: Ranjani Sridharan 
> Signed-off-by: Ranjani Sridharan 
> Co-developed-by: Fred Oh 
> Signed-off-by: Fred Oh 
> Co-developed-by: Leon Romanovsky 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Pierre-Louis Bossart 
> Reviewed-by: Shiraz Saleem 
> Reviewed-by: Parav Pandit 
> Reviewed-by: Dan Williams 
> Signed-off-by: Dave Ertman 
> ---
>  Documentation/driver-api/auxiliary_bus.rst | 228 ++
>  Documentation/driver-api/index.rst |   1 +
>  drivers/base/Kconfig   |   3 +
>  drivers/base/Makefile  |   1 +
>  drivers/base/auxiliary.c   | 267 +
>  include/linux/auxiliary_bus.h  |  78 ++
>  include/linux/mod_devicetable.h|   8 +
>  scripts/mod/devicetable-offsets.c  |   3 +
>  scripts/mod/file2alias.c   |   8 +
>  9 files changed, 597 insertions(+)
>  create mode 100644 Documentation/driver-api/auxiliary_bus.rst
>  create mode 100644 drivers/base/auxiliary.c
>  create mode 100644 include/linux/auxiliary_bus.h
>
> diff --git a/Documentation/driver-api/auxiliary_bus.rst 
> b/Documentation/driver-api/auxiliary_bus.rst
> new file mode 100644
> index ..500f29692c81
> --- /dev/null
> +++ b/Documentation/driver-api/auxiliary_bus.rst
> @@ -0,0 +1,228 @@
> +.. SPDX-License-Identifier: GPL-2.0-only
> +
> +=
> +Auxiliary Bus
> +=
> +
> +In some subsystems, the functionality of the core device (PCI/ACPI/other) is
> +too complex for a single device to be managed as a monolithic block or a 
> part of
> +the functionality needs to be exposed to a different subsystem.

I think there are three identified use cases for this now, so call out
those examples for others to compare if aux bus is a fit for their use
case.

"In some subsystems, the functionality of the core device
(PCI/ACPI/other) is too complex to be handled by a monolithic driver
(e.g. Sound Open Firmware), multiple devices might implement a common
intersection of functionality (e.g. NICs + RDMA), or a driver may want
to export an interface for another subsystem to drive (e.g. SIOV
Physical Function export Virtual Function management)."

> + Splitting the
> +functionality into smaller orthogonal devices would make it easier to manage
> +data, power management and domain-specific interaction with the hardware.

Passive voice and I don't understand what is meant by the word "orthogonal"

"A split of the functionality into child-devices representing
sub-domains of functionality makes it possible to compartmentalize,
layer, and distribute domain-specific concerns via a Linux
device-driver model"

> A key
> +requirement for such a split is that there is no dependency on a physical 
> bus,
> +device, register accesses or regmap support.
> These individual devices split from
> +the core cannot live on the platform bus as they are not physical devices 
> that
> +are controlled by DT/ACPI. The same argument applies for not using MFD in 
> this
> +scenario as MFD relies on individual function devices being physical devices.

These last few sentences are confusing the justification of the
existence of auxiliary bus, not the explanation of when why and how to
use it.

The "When Should the Auxiliary Bus Be Used" should mention where
Auxiliary bus fits relative to the platform-bus and MFD, perhaps in an
explicit "When not to use Auxiliary Bus" section to direct people to
platform-bus and MFD when those are a better fit.

> +
> +An example for this kind of requirement is the audio subsystem where a single
> +IP is handling multiple entities such as HDMI, Soundwire, local devices such 
> as
> +mics/speakers etc. The split for the core's functionality can be arbitrary or
> +be defined by the DSP firmware topology and include hooks for test/debug. 
> This
> +allows for the audio core device to be minimal and focused on 
> hardware-specific
> +control and communication.
> +
> +The auxiliary bus is intended to be minimal, generic and avoid 
> domain-specific
> +assumptions.

As this file will be sitting in Documentation/ it will be too late

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-04 Thread Dan Williams

On Wed, Nov 4, 2020 at 11:32 PM gregkh  wrote:
>
> On Wed, Nov 04, 2020 at 03:21:23PM -0800, Dan Williams wrote:
> > On Tue, Nov 3, 2020 at 7:45 AM Jason Gunthorpe  wrote:
> > [..]
> > > > +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> > > > +
> > > > +static struct auxiliary_driver mlx5v_driver = {
> > > > + .name = "vnet",
> > > > + .probe = mlx5v_probe,
> > > > + .remove = mlx5v_remove,
> > > > + .id_table = mlx5v_id_table,
> > > > +};
> > >
> > > It is hard to see from the diff, but when this patch is applied the
> > > vdpa module looks like I imagined things would look with the auxiliary
> > > bus. It is very similar in structure to a PCI driver with the probe()
> > > function cleanly registering with its subsystem. This is what I'd like
> > > to see from the new Intel RDMA driver.
> > >
> > > Greg, I think this patch is the best clean usage example.
> > >
> > > I've looked over this series and it has the right idea and
> > > parts. There is definitely more that can be done to improve mlx5 in
> > > this area, but this series is well scoped and cleans a good part of
> > > it.
> >
> > Greg?
> >
> > I know you alluded to going your own way if the auxiliary bus patches
> > did not shape up soon, but it seems they have and the stakeholders
> > have reached this consensus point.
> >
> > Were there any additional changes you wanted to see happen? I'll go
> > give the final set another once over, but David has been diligently
> > fixing up all the declared major issues so I expect to find at most
> > minor incremental fixups.
>
> This is in my to-review pile, along with a load of other stuff at the
> moment:
> $ ~/bin/mdfrm -c ~/mail/todo/
> 1709 messages in /home/gregkh/mail/todo/
>
> So give me a chance.  There is no rush on my side for this given the
> huge delays that have happened here on the authorship side many times in
> the past :)

Sure, I was more looking to confirm that it's worth continuing to
polish this set given your mention of possibly going a different
direction.

> If you can review it, or anyone else, that is always most appreciated.

Thanks, will do.

Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-04 Thread Dan Williams

On Tue, Nov 3, 2020 at 7:45 AM Jason Gunthorpe  wrote:
[..]
> > +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> > +
> > +static struct auxiliary_driver mlx5v_driver = {
> > + .name = "vnet",
> > + .probe = mlx5v_probe,
> > + .remove = mlx5v_remove,
> > + .id_table = mlx5v_id_table,
> > +};
>
> It is hard to see from the diff, but when this patch is applied the
> vdpa module looks like I imagined things would look with the auxiliary
> bus. It is very similar in structure to a PCI driver with the probe()
> function cleanly registering with its subsystem. This is what I'd like
> to see from the new Intel RDMA driver.
>
> Greg, I think this patch is the best clean usage example.
>
> I've looked over this series and it has the right idea and
> parts. There is definitely more that can be done to improve mlx5 in
> this area, but this series is well scoped and cleans a good part of
> it.

Greg?

I know you alluded to going your own way if the auxiliary bus patches
did not shape up soon, but it seems they have and the stakeholders
have reached this consensus point.

Were there any additional changes you wanted to see happen? I'll go
give the final set another once over, but David has been diligently
fixing up all the declared major issues so I expect to find at most
minor incremental fixups.

Re: [PATCH v9 3/4] docs: Add documentation for userspace client interface

2020-10-26 Thread Dan Williams

On Mon, 2020-10-26 at 07:38 -0600, Jeffrey Hugo wrote:
> On 10/25/2020 3:46 PM, Jakub Kicinski wrote:
> > On Fri, 23 Oct 2020 16:17:54 -0700 Hemant Kumar wrote:
> > > +UCI driver enables userspace clients to communicate to external
> > > MHI devices
> > > +like modem and WLAN. UCI driver probe creates standard character
> > > device file
> > > +nodes for userspace clients to perform open, read, write, poll
> > > and release file
> > > +operations.
> > 
> > What's the user space that talks to this?
> > 
> 
> Multiple.
> 
> Each channel has a different purpose.  There it is expected that a 
> different userspace application would be using it.
> 
> Hemant implemented the loopback channel, which is a simple channel
> that 
> just sends you back anything you send it.  Typically this is consumed
> by 
> a test application.
> 
> Diag is a typical channel to be consumed by userspace.  This is
> consumed 
> by various applications that talk to the remote device for
> diagnostic 
> information (logs and such).

QMI too?
Dan

> Sahara is another common channel that is usually used for the
> multistage 
> firmware loading process.
>

Re: [RFC] treewide: cleanup unreachable breaks

2020-10-17 Thread Dan Williams

On Sat, Oct 17, 2020 at 9:10 AM  wrote:
>
> From: Tom Rix 
>
> This is a upcoming change to clean up a new warning treewide.
> I am wondering if the change could be one mega patch (see below) or
> normal patch per file about 100 patches or somewhere half way by collecting
> early acks.
>
> clang has a number of useful, new warnings see
> https://clang.llvm.org/docs/DiagnosticsReference.html
>
> This change cleans up -Wunreachable-code-break
> https://clang.llvm.org/docs/DiagnosticsReference.html#wunreachable-code-break
> for 266 of 485 warnings in this week's linux-next, allyesconfig on x86_64.
>
> The method of fixing was to look for warnings where the preceding statement
> was a simple statement and by inspection made the subsequent break unneeded.
> In order of frequency these look like
>
> return and break
>
> switch (c->x86_vendor) {
> case X86_VENDOR_INTEL:
> intel_p5_mcheck_init(c);
> return 1;
> -   break;
>
> goto and break
>
> default:
> operation = 0; /* make gcc happy */
> goto fail_response;
> -   break;
>
> break and break
> case COLOR_SPACE_SRGB:
> /* by pass */
> REG_SET(OUTPUT_CSC_CONTROL, 0,
> OUTPUT_CSC_GRPH_MODE, 0);
> break;
> -   break;
>
> The exception to the simple statement, is a switch case with a block
> and the end of block is a return
>
> struct obj_buffer *buff = r->ptr;
> return scnprintf(str, PRIV_STR_SIZE,
> "size=%u\naddr=0x%X\n", buff->size,
> buff->addr);
> }
> -   break;
>
> Not considered obvious and excluded, breaks after
> multi level switches
> complicated if-else if-else blocks
> panic() or similar calls
>
> And there is an odd addition of a 'fallthrough' in drivers/tty/nozomi.c
[..]
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 5a7c80053c62..2f250874b1a4 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -200,11 +200,10 @@ ssize_t nd_namespace_store(struct device *dev,
> }
> break;
> default:
> len = -EBUSY;
> goto out_attach;
> -   break;
> }

Acked-by: Dan Williams

Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Dan Williams

On Tue, Oct 13, 2020 at 12:37 PM Matthew Wilcox  wrote:
>
> On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> > On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > Cc: Nicolas Pitre 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  fs/cramfs/inode.c | 10 +-
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > > index 912308600d39..003c014a42ed 100644
> > > --- a/fs/cramfs/inode.c
> > > +++ b/fs/cramfs/inode.c
> > > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block 
> > > *sb, unsigned int offset,
> > > struct page *page = pages[i];
> > >
> > > if (page) {
> > > -   memcpy(data, kmap(page), PAGE_SIZE);
> > > -   kunmap(page);
> > > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > > +   kunmap_thread(page);
> >
> > Why does this need a sleepable kmap? This looks like a textbook
> > kmap_atomic() use case.
>
> There's a lot of code of this form.  Could we perhaps have:
>
> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
> char *vto = kmap_atomic(to);
>
> memcpy(vto, vfrom, size);
> kunmap_atomic(vto);
> }
>
> in linux/highmem.h ?

Nice, yes, that could also replace the local ones in lib/iov_iter.c
(memcpy_{to,from}_page())

Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Dan Williams

On Fri, Oct 9, 2020 at 12:52 PM  wrote:
>
> From: Ira Weiny 
>
> The kmap() calls in this FS are localized to a single thread.  To avoid
> the over head of global PKRS updates use the new kmap_thread() call.
>
> Cc: Nicolas Pitre 
> Signed-off-by: Ira Weiny 
> ---
>  fs/cramfs/inode.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index 912308600d39..003c014a42ed 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block *sb, 
> unsigned int offset,
> struct page *page = pages[i];
>
> if (page) {
> -   memcpy(data, kmap(page), PAGE_SIZE);
> -   kunmap(page);
> +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> +   kunmap_thread(page);

Why does this need a sleepable kmap? This looks like a textbook
kmap_atomic() use case.

Re: [PATCH] net: Add mhi-net driver

2020-10-12 Thread Dan Williams

On Fri, 2020-10-09 at 22:33 +0200, Loic Poulain wrote:
> This patch adds a new network driver implementing MHI transport for
> network packets. Packets can be in any format, though QMAP (rmnet)
> is the usual protocol (flow control + PDN mux).
> 
> It support two MHI devices, IP_HW0 which is, the path to the IPA
> (IP accelerator) on qcom modem, And IP_SW0 which is the software
> driven IP path (to modem CPU).
> 
> Signed-off-by: Loic Poulain 
> ---
>  drivers/net/Kconfig   |   7 ++
>  drivers/net/Makefile  |   1 +
>  drivers/net/mhi_net.c | 281
> ++
>  3 files changed, 289 insertions(+)
>  create mode 100644 drivers/net/mhi_net.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 1368d1d..11a6357 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -426,6 +426,13 @@ config VSOCKMON
> mostly intended for developers or support to debug vsock
> issues. If
> unsure, say N.
>  
> +config MHI_NET
> + tristate "MHI network driver"
> + depends on MHI_BUS
> + help
> +   This is the network driver for MHI.  It can be used with
> +   QCOM based WWAN modems (like SDX55).  Say Y or M.
> +
>  endif # NET_CORE
>  
>  config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 94b6080..8312037 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_GTP) += gtp.o
>  obj-$(CONFIG_NLMON) += nlmon.o
>  obj-$(CONFIG_NET_VRF) += vrf.o
>  obj-$(CONFIG_VSOCKMON) += vsockmon.o
> +obj-$(CONFIG_MHI_NET) += mhi_net.o
>  
>  #
>  # Networking Drivers
> diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
> new file mode 100644
> index 000..f66248e
> --- /dev/null
> +++ b/drivers/net/mhi_net.c
> @@ -0,0 +1,281 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* MHI Network driver - Network over MHI
> + *
> + * Copyright (C) 2020 Linaro Ltd 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define MIN_MTU  ETH_MIN_MTU
> +#define MAX_MTU  0x
> +#define DEFAULT_MTU  8192
> +
> +struct mhi_net_stats {
> + u64 rx_packets;
> + u64 rx_bytes;
> + u64 rx_errors;
> + u64 rx_dropped;
> + u64 tx_packets;
> + u64 tx_bytes;
> + u64 tx_errors;
> + u64 tx_dropped;
> + atomic_t rx_queued;
> +};
> +
> +struct mhi_net_dev {
> + struct mhi_device *mdev;
> + struct net_device *ndev;
> + struct delayed_work rx_refill;
> + struct mhi_net_stats stats;
> +};
> +
> +static int mhi_ndo_open(struct net_device *ndev)
> +{
> + struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> +
> + /* Feed the rx buffer pool */
> + schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> +
> + /* Carrier is established via out-of-band channel (e.g. qmi) */
> + netif_carrier_on(ndev);
> +
> + netif_start_queue(ndev);
> +
> + return 0;
> +}
> +
> +static int mhi_ndo_stop(struct net_device *ndev)
> +{
> + struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> +
> + netif_stop_queue(ndev);
> + netif_carrier_off(ndev);
> + cancel_delayed_work_sync(&mhi_netdev->rx_refill);
> +
> + return 0;
> +}
> +
> +static int mhi_ndo_xmit(struct sk_buff *skb, struct net_device
> *ndev)
> +{
> + struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> + struct mhi_device *mdev = mhi_netdev->mdev;
> + int err;
> +
> + /* Only support for single buffer transfer for now */
> + err = skb_linearize(skb);
> + if (unlikely(err)) {
> + kfree_skb(skb);
> + mhi_netdev->stats.tx_dropped++;
> + return NETDEV_TX_OK;
> + }
> +
> + skb_tx_timestamp(skb);
> +
> + /* mhi_queue_skb is not thread-safe, but xmit is serialized by
> the
> +  * network core. Once MHI core will be thread save, migrate to
> +  * NETIF_F_LLTX support.
> +  */
> + err = mhi_queue_skb(mdev, DMA_TO_DEVICE, skb, skb->len,
> MHI_EOT);
> + if (err) {
> + netdev_err(ndev, "mhi_queue_skb err %d\n", err);
> + netif_stop_queue(ndev);
> + return (err == -ENOMEM) ? NETDEV_TX_BUSY : err;
> + }
> +
> + return NETDEV_TX_OK;
> +}
> +
> +static void mhi_ndo_get_stats64(struct net_device *ndev,
> + struct rtnl_link_stats64 *stats)
> +{
> + struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> +
> + stats->rx_packets = mhi_netdev->stats.rx_packets;
> + stats->rx_bytes = mhi_netdev->stats.rx_bytes;
> + stats->rx_errors = mhi_netdev->stats.rx_errors;
> + stats->rx_dropped = mhi_netdev->stats.rx_dropped;
> + stats->tx_packets = mhi_netdev->stats.tx_packets;
> + stats->tx_bytes = mhi_netdev->stats.tx_bytes;
> + stats->tx_errors = mhi_netdev->stats.tx_errors;
> + stats->tx_dropped = mhi_netdev->stats.tx_dropped;
> +}
> +
> +static const struct net_device_ops mhi_netdev_ops = {
> + .ndo_open

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-09 Thread Dan Williams

On Fri, Oct 9, 2020 at 7:27 AM Pierre-Louis Bossart
 wrote:
>
>
>
>  +
>  +   ancildrv->driver.owner = owner;
>  +   ancildrv->driver.bus = &ancillary_bus_type;
>  +   ancildrv->driver.probe = ancillary_probe_driver;
>  +   ancildrv->driver.remove = ancillary_remove_driver;
>  +   ancildrv->driver.shutdown = ancillary_shutdown_driver;
>  +
> >>>
> >>> I think that this part is wrong, probe/remove/shutdown functions should
> >>> come from ancillary_bus_type.
> >>
> >>  From checking other usage cases, this is the model that is used for 
> >> probe, remove,
> >> and shutdown in drivers.  Here is the example from Greybus.
> >>
> >> int greybus_register_driver(struct greybus_driver *driver, struct module 
> >> *owner,
> >>  const char *mod_name)
> >> {
> >>  int retval;
> >>
> >>  if (greybus_disabled())
> >>  return -ENODEV;
> >>
> >>  driver->driver.bus = &greybus_bus_type;
> >>  driver->driver.name = driver->name;
> >>  driver->driver.probe = greybus_probe;
> >>  driver->driver.remove = greybus_remove;
> >>  driver->driver.owner = owner;
> >>  driver->driver.mod_name = mod_name;
> >>
> >>
> >>> You are overwriting private device_driver
> >>> callbacks that makes impossible to make container_of of ancillary_driver
> >>> to chain operations.
> >>>
> >>
> >> I am sorry, you lost me here.  you cannot perform container_of on the 
> >> callbacks
> >> because they are pointers, but if you are referring to going from 
> >> device_driver
> >> to the auxiliary_driver, that is what happens in auxiliary_probe_driver in 
> >> the
> >> very beginning.
> >>
> >> static int auxiliary_probe_driver(struct device *dev)
> >> 145 {
> >> 146 struct auxiliary_driver *auxdrv = 
> >> to_auxiliary_drv(dev->driver);
> >> 147 struct auxiliary_device *auxdev = to_auxiliary_dev(dev);
> >>
> >> Did I miss your meaning?
> >
> > I think you're misunderstanding the cases when the
> > bus_type.{probe,remove} is used vs the driver.{probe,remove}
> > callbacks. The bus_type callbacks are to implement a pattern where the
> > 'probe' and 'remove' method are typed to the bus device type. For
> > example 'struct pci_dev *' instead of raw 'struct device *'. See this
> > conversion of dax bus as an example of going from raw 'struct device
> > *' typed probe/remove to dax-device typed probe/remove:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=75797273189d
>
> Thanks Dan for the reference, very useful. This doesn't look like a a
> big change to implement, just wondering about the benefits and
> drawbacks, if any? I am a bit confused here.
>
> First, was the initial pattern wrong as Leon asserted it? Such code
> exists in multiple examples in the kernel and there's nothing preventing
> the use of container_of that I can think of. Put differently, if this
> code was wrong then there are other existing buses that need to be updated.
>
> Second, what additional functionality does this move from driver to
> bus_type provide? The commit reference just states 'In preparation for
> introducing seed devices the dax-bus core needs to be able to intercept
> ->probe() and ->remove() operations", but that doesn't really help me
> figure out what 'intercept' means. Would you mind elaborating?
>
> And last, the existing probe function does calls dev_pm_domain_attach():
>
> static int ancillary_probe_driver(struct device *dev)
> {
> struct ancillary_driver *ancildrv = to_ancillary_drv(dev->driver);
> struct ancillary_device *ancildev = to_ancillary_dev(dev);
> int ret;
>
> ret = dev_pm_domain_attach(dev, true);
>
> So the need to access the raw device still exists. Is this still legit
> if the probe() is moved to the bus_type structure?

Sure, of course.

>
> I have no objection to this change if it preserves the same
> functionality and possibly extends it, just wanted to better understand
> the reasons for the change and in which cases the bus probe() makes more
> sense than a driver probe().
>
> Thanks for enlightening the rest of us!

tl;dr: The ops set by the device driver should never be overwritten by
the bus, the bus can only wrap them in its own ops.

The reason to use the bus_type is because the bus type is the only
agent that knows both how to convert a raw 'struct device *' to the
bus's native type, and how to convert a raw 'struct device_driver *'
to the bus's native driver type. The driver core does:

if (dev->bus->probe) {
ret = dev->bus->probe(dev);
} else if (drv->probe) {
ret = drv->probe(dev);
}

...so that the bus has the first priority for probing a device /
wrapping the native driver ops. The bus ->probe, in addition to
optionally performing some bus specific pre-work, lets the bus upcast
the device to bus-native type.

The bus also knows the types of drivers that will

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-08 Thread Dan Williams

On Thu, Oct 8, 2020 at 3:04 PM Ertman, David M  wrote:
>
> > -Original Message-
> > From: Leon Romanovsky 
> > Sent: Tuesday, October 6, 2020 10:23 AM
> > To: Ertman, David M 
> > Cc: alsa-de...@alsa-project.org; ti...@suse.de; broo...@kernel.org; linux-
> > r...@vger.kernel.org; j...@nvidia.com; dledf...@redhat.com;
> > netdev@vger.kernel.org; da...@davemloft.net; k...@kernel.org;
> > gre...@linuxfoundation.org; ranjani.sridha...@linux.intel.com; pierre-
> > louis.boss...@linux.intel.com; fred...@linux.intel.com;
> > pa...@mellanox.com; Saleem, Shiraz ; Williams,
> > Dan J ; Patil, Kiran 
> > Subject: Re: [PATCH v2 1/6] Add ancillary bus support
> >
> > On Mon, Oct 05, 2020 at 11:24:41AM -0700, Dave Ertman wrote:
> > > Add support for the Ancillary Bus, ancillary_device and ancillary_driver.
> > > It enables drivers to create an ancillary_device and bind an
> > > ancillary_driver to it.
> > >
> > > The bus supports probe/remove shutdown and suspend/resume callbacks.
> > > Each ancillary_device has a unique string based id; driver binds to
> > > an ancillary_device based on this id through the bus.
> > >
> > > Co-developed-by: Kiran Patil 
> > > Signed-off-by: Kiran Patil 
> > > Co-developed-by: Ranjani Sridharan 
> > > Signed-off-by: Ranjani Sridharan 
> > > Co-developed-by: Fred Oh 
> > > Signed-off-by: Fred Oh 
> > > Reviewed-by: Pierre-Louis Bossart 
> > > Reviewed-by: Shiraz Saleem 
> > > Reviewed-by: Parav Pandit 
> > > Reviewed-by: Dan Williams 
> > > Signed-off-by: Dave Ertman 
> > > ---
> >
> > <...>
> >
> > > +/**
> > > + * __ancillary_driver_register - register a driver for ancillary bus 
> > > devices
> > > + * @ancildrv: ancillary_driver structure
> > > + * @owner: owning module/driver
> > > + */
> > > +int __ancillary_driver_register(struct ancillary_driver *ancildrv, struct
> > module *owner)
> > > +{
> > > +   if (WARN_ON(!ancildrv->probe) || WARN_ON(!ancildrv->remove)
> > ||
> > > +   WARN_ON(!ancildrv->shutdown) || WARN_ON(!ancildrv-
> > >id_table))
> > > +   return -EINVAL;
> >
> > In our driver ->shutdown is empty, it will be best if ancillary bus will
> > do "if (->remove) ..->remove()" pattern.
> >
>
> Yes, looking it over, only the probe needs to mandatory.  I will change the 
> others to the
> conditional model, and adjust the WARN_ONs.
>
>
> > > +
> > > +   ancildrv->driver.owner = owner;
> > > +   ancildrv->driver.bus = &ancillary_bus_type;
> > > +   ancildrv->driver.probe = ancillary_probe_driver;
> > > +   ancildrv->driver.remove = ancillary_remove_driver;
> > > +   ancildrv->driver.shutdown = ancillary_shutdown_driver;
> > > +
> >
> > I think that this part is wrong, probe/remove/shutdown functions should
> > come from ancillary_bus_type.
>
> From checking other usage cases, this is the model that is used for probe, 
> remove,
> and shutdown in drivers.  Here is the example from Greybus.
>
> int greybus_register_driver(struct greybus_driver *driver, struct module 
> *owner,
> const char *mod_name)
> {
> int retval;
>
> if (greybus_disabled())
> return -ENODEV;
>
> driver->driver.bus = &greybus_bus_type;
> driver->driver.name = driver->name;
> driver->driver.probe = greybus_probe;
> driver->driver.remove = greybus_remove;
> driver->driver.owner = owner;
> driver->driver.mod_name = mod_name;
>
>
> > You are overwriting private device_driver
> > callbacks that makes impossible to make container_of of ancillary_driver
> > to chain operations.
> >
>
> I am sorry, you lost me here.  you cannot perform container_of on the 
> callbacks
> because they are pointers, but if you are referring to going from 
> device_driver
> to the auxiliary_driver, that is what happens in auxiliary_probe_driver in the
> very beginning.
>
> static int auxiliary_probe_driver(struct device *dev)
> 145 {
> 146 struct auxiliary_driver *auxdrv = to_auxiliary_drv(dev->driver);
> 147 struct auxiliary_device *auxdev = to_auxiliary_dev(dev);
>
> Did I miss your meaning?

I think you're misunderstanding the cases when the
bus_type.{probe,remove} is used vs the driver.{probe,remove}
callbacks. The bus_type callbacks are to implement a pattern where the
'probe' and 'remove' method are typed to the bus device type. For
example 'struct pci_dev *' instead of raw 'struct device *'. See this
conversion of dax bus as an example of going from raw 'struct device
*' typed probe/remove to dax-device typed probe/remove:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=75797273189d

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-08 Thread Dan Williams

On Thu, Oct 8, 2020 at 1:00 AM Leon Romanovsky  wrote:
>
> On Thu, Oct 08, 2020 at 12:38:00AM -0700, Dan Williams wrote:
> > On Thu, Oct 8, 2020 at 12:01 AM Leon Romanovsky  wrote:
> > [..]
> > > All stated above is my opinion, it can be different from yours.
> >
> > Yes, but we need to converge to move this forward. Jason was involved
> > in the current organization for registration, Greg was angling for
> > this to be core functionality. I have use cases outside of RDMA and
> > netdev. Parav was ok with the current organization. The SOF folks
> > already have a proposed incorporation of it. The argument I am hearing
> > is that "this registration api seems hard for driver writers" when we
> > have several driver writers who have already taken a look and can make
> > it work. If you want to follow on with a simpler wrappers for your use
> > case, great, but I do not yet see anyone concurring with your opinion
> > that the current organization is irretrievably broken or too obscure
> > to use.
>
> Can it be that I'm first one to use this bus for very large driver (>120K LOC)
> that has 5 different ->probe() flows?
>
> For example, this 
> https://lore.kernel.org/linux-rdma/20201006172317.GN1874917@unreal/
> hints to me that this bus wasn't used with anything complex as it was 
> initially intended.

I missed that. Yes, I agree that's broken.

>
> And regarding registration, I said many times that init()/add() scheme is ok, 
> the inability
> to call to uninit() after add() failure is not ok from my point of view.

Ok, I got to the wrong conclusion about your position.

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-08 Thread Dan Williams

On Thu, Oct 8, 2020 at 12:01 AM Leon Romanovsky  wrote:
[..]
> All stated above is my opinion, it can be different from yours.

Yes, but we need to converge to move this forward. Jason was involved
in the current organization for registration, Greg was angling for
this to be core functionality. I have use cases outside of RDMA and
netdev. Parav was ok with the current organization. The SOF folks
already have a proposed incorporation of it. The argument I am hearing
is that "this registration api seems hard for driver writers" when we
have several driver writers who have already taken a look and can make
it work. If you want to follow on with a simpler wrappers for your use
case, great, but I do not yet see anyone concurring with your opinion
that the current organization is irretrievably broken or too obscure
to use.

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-07 Thread Dan Williams

On Wed, Oct 7, 2020 at 10:21 PM Leon Romanovsky  wrote:
>
> On Wed, Oct 07, 2020 at 08:46:45PM +, Ertman, David M wrote:
> > > -Original Message-
> > > From: Parav Pandit 
> > > Sent: Wednesday, October 7, 2020 1:17 PM
> > > To: Leon Romanovsky ; Ertman, David M
> > > 
> > > Cc: Pierre-Louis Bossart ; alsa-
> > > de...@alsa-project.org; pa...@mellanox.com; ti...@suse.de;
> > > netdev@vger.kernel.org; ranjani.sridha...@linux.intel.com;
> > > fred...@linux.intel.com; linux-r...@vger.kernel.org;
> > > dledf...@redhat.com; broo...@kernel.org; Jason Gunthorpe
> > > ; gre...@linuxfoundation.org; k...@kernel.org; Williams,
> > > Dan J ; Saleem, Shiraz
> > > ; da...@davemloft.net; Patil, Kiran
> > > 
> > > Subject: RE: [PATCH v2 1/6] Add ancillary bus support
> > >
> > >
> > > > From: Leon Romanovsky 
> > > > Sent: Thursday, October 8, 2020 12:56 AM
> > > >
> > > > > > This API is partially obscures low level driver-core code and needs
> > > > > > to provide clear and proper abstractions without need to remember
> > > > > > about put_device. There is already _add() interface why don't you do
> > > > > > put_device() in it?
> > > > > >
> > > > >
> > > > > The pushback Pierre is referring to was during our mid-tier internal
> > > > > review.  It was primarily a concern of Parav as I recall, so he can 
> > > > > speak to
> > > his
> > > > reasoning.
> > > > >
> > > > > What we originally had was a single API call
> > > > > (ancillary_device_register) that started with a call to
> > > > > device_initialize(), and every error path out of the function 
> > > > > performed a
> > > > put_device().
> > > > >
> > > > > Is this the model you have in mind?
> > > >
> > > > I don't like this flow:
> > > > ancillary_device_initialize()
> > > > if (ancillary_ancillary_device_add()) {
> > > >   put_device()
> > > >   ancillary_device_unregister()
> > > Calling device_unregister() is incorrect, because add() wasn't successful.
> > > Only put_device() or a wrapper ancillary_device_put() is necessary.
> > >
> > > >   return err;
> > > > }
> > > >
> > > > And prefer this flow:
> > > > ancillary_device_initialize()
> > > > if (ancillary_device_add()) {
> > > >   ancillary_device_unregister()
> > > This is incorrect and a clear deviation from the current core APIs that 
> > > adds the
> > > confusion.
> > >
> > > >   return err;
> > > > }
> > > >
> > > > In this way, the ancillary users won't need to do non-intuitive 
> > > > put_device();
> > >
> > > Below is most simple, intuitive and matching with core APIs for name and
> > > design pattern wise.
> > > init()
> > > {
> > > err = ancillary_device_initialize();
> > > if (err)
> > > return ret;
> > >
> > > err = ancillary_device_add();
> > > if (ret)
> > > goto err_unwind;
> > >
> > > err = some_foo();
> > > if (err)
> > > goto err_foo;
> > > return 0;
> > >
> > > err_foo:
> > > ancillary_device_del(adev);
> > > err_unwind:
> > > ancillary_device_put(adev->dev);
> > > return err;
> > > }
> > >
> > > cleanup()
> > > {
> > > ancillary_device_de(adev);
> > > ancillary_device_put(adev);
> > > /* It is common to have a one wrapper for this as
> > > ancillary_device_unregister().
> > >  * This will match with core device_unregister() that has precise
> > > documentation.
> > >  * but given fact that init() code need proper error unwinding, like
> > > above,
> > >  * it make sense to have two APIs, and no need to export another
> > > symbol for unregister().
> > >  * This pattern is very easy to audit and code.
> > >  */
> > > }
> >
> > I like this flow +1
> >
> > But ... since the init() function is performing both device_init and
> > device_add - it should probably be called ancillary_device_register,
> > and we are back to a single exported API for both register and
> > unregister.
> >
> > At that point, do we need wrappers on the primitives init, add, del,
> > and put?
>
> Let me summarize.
> 1. You are not providing driver/core API but simplification and obfuscation
> of basic primitives and structures. This is new layer. There is no room for
> a claim that we must to follow internal API.

Yes, this a driver core api, Greg even questioned why it was in
drivers/bus instead of drivers/base which I think makes sense.

> 2. API should be symmetric. If you call to _register()/_add(), you will need
> to call to _unregister()/_del(). Please don't add obscure _put().

It's not obscure it's a long standing semantic for how to properly
handle device_add() failures. Especially in this case where there is
no way to have something like a common auxiliary_device_alloc() that
will work for everyone the only other option is require all device
destruction to go through the provided release method (put_device())
after a device_add() failure.

> 3. You can't "ask" from users to call internal calls (put_device) over 
> internal
> fields in ancillary_device.

Sure it can. platform_device

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-07 Thread Dan Williams

On Wed, Oct 7, 2020 at 6:37 AM Leon Romanovsky  wrote:
>
> On Wed, Oct 07, 2020 at 01:09:55PM +, Saleem, Shiraz wrote:
> > > Subject: Re: [PATCH v2 1/6] Add ancillary bus support
> > >
> > > On Tue, Oct 6, 2020 at 12:21 PM Leon Romanovsky  wrote:
> > > >
> > > > On Tue, Oct 06, 2020 at 05:41:00PM +, Saleem, Shiraz wrote:
> > > > > > Subject: Re: [PATCH v2 1/6] Add ancillary bus support
> > > > > >
> > > > > > On Tue, Oct 06, 2020 at 05:09:09PM +, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Leon Romanovsky 
> > > > > > > > Sent: Tuesday, October 6, 2020 10:33 PM
> > > > > > > >
> > > > > > > > On Tue, Oct 06, 2020 at 10:18:07AM -0500, Pierre-Louis Bossart 
> > > > > > > > wrote:
> > > > > > > > > Thanks for the review Leon.
> > > > > > > > >
> > > > > > > > > > > Add support for the Ancillary Bus, ancillary_device and
> > > ancillary_driver.
> > > > > > > > > > > It enables drivers to create an ancillary_device and
> > > > > > > > > > > bind an ancillary_driver to it.
> > > > > > > > > >
> > > > > > > > > > I was under impression that this name is going to be 
> > > > > > > > > > changed.
> > > > > > > > >
> > > > > > > > > It's part of the opens stated in the cover letter.
> > > > > > > >
> > > > > > > > ok, so what are the variants?
> > > > > > > > system bus (sysbus), sbsystem bus (subbus), crossbus ?
> > > > > > > Since the intended use of this bus is to
> > > > > > > (a) create sub devices that represent 'functional separation'
> > > > > > > and
> > > > > > > (b) second use case for subfunctions from a pci device,
> > > > > > >
> > > > > > > I proposed below names in v1 of this patchset.
> > > > > >
> > > > > > > (a) subdev_bus
> > > > > >
> > > > > > It sounds good, just can we avoid "_" in the name and call it 
> > > > > > subdev?
> > > > > >
> > > > >
> > > > > What is wrong with naming the bus 'ancillary bus'? I feel it's a 
> > > > > fitting name.
> > > > > An ancillary software bus for ancillary devices carved off a parent 
> > > > > device
> > > registered on a primary bus.
> > > >
> > > > Greg summarized it very well, every internal conversation about this
> > > > patch with my colleagues (non-english speakers) starts with the 
> > > > question:
> > > > "What does ancillary mean?"
> > > > https://lore.kernel.org/alsa-devel/20201001071403.gc31...@kroah.com/
> > > >
> > > > "For non-native english speakers this is going to be rough, given that
> > > > I as a native english speaker had to go look up the word in a
> > > > dictionary to fully understand what you are trying to do with that
> > > > name."
> > >
> > > I suggested "auxiliary" in another splintered thread on this question.
> > > In terms of what the kernel is already using:
> > >
> > > $ git grep auxiliary | wc -l
> > > 507
> > > $ git grep ancillary | wc -l
> > > 153
> > >
> > > Empirically, "auxiliary" is more common and closely matches the intended 
> > > function
> > > of these devices relative to their parent device.
> >
> > auxiliary bus is a befitting name as well.
>
> Let's share all options and decide later.
> I don't want to find us bikeshedding about it.

Too late we are deep into bikeshedding at this point... it continued
over here [1] for a bit, but let's try to bring the discussion back to
this thread.

[1]: 
http://lore.kernel.org/r/10048d4d-038c-c2b7-2ed7-fd4ca87d1...@linux.intel.com

Re: [PATCH v2 1/6] Add ancillary bus support

2020-10-06 Thread Dan Williams

On Tue, Oct 6, 2020 at 12:21 PM Leon Romanovsky  wrote:
>
> On Tue, Oct 06, 2020 at 05:41:00PM +, Saleem, Shiraz wrote:
> > > Subject: Re: [PATCH v2 1/6] Add ancillary bus support
> > >
> > > On Tue, Oct 06, 2020 at 05:09:09PM +, Parav Pandit wrote:
> > > >
> > > > > From: Leon Romanovsky 
> > > > > Sent: Tuesday, October 6, 2020 10:33 PM
> > > > >
> > > > > On Tue, Oct 06, 2020 at 10:18:07AM -0500, Pierre-Louis Bossart wrote:
> > > > > > Thanks for the review Leon.
> > > > > >
> > > > > > > > Add support for the Ancillary Bus, ancillary_device and 
> > > > > > > > ancillary_driver.
> > > > > > > > It enables drivers to create an ancillary_device and bind an
> > > > > > > > ancillary_driver to it.
> > > > > > >
> > > > > > > I was under impression that this name is going to be changed.
> > > > > >
> > > > > > It's part of the opens stated in the cover letter.
> > > > >
> > > > > ok, so what are the variants?
> > > > > system bus (sysbus), sbsystem bus (subbus), crossbus ?
> > > > Since the intended use of this bus is to
> > > > (a) create sub devices that represent 'functional separation' and
> > > > (b) second use case for subfunctions from a pci device,
> > > >
> > > > I proposed below names in v1 of this patchset.
> > > >
> > > > (a) subdev_bus
> > >
> > > It sounds good, just can we avoid "_" in the name and call it subdev?
> > >
> >
> > What is wrong with naming the bus 'ancillary bus'? I feel it's a fitting 
> > name.
> > An ancillary software bus for ancillary devices carved off a parent device 
> > registered on a primary bus.
>
> Greg summarized it very well, every internal conversation about this patch
> with my colleagues (non-english speakers) starts with the question:
> "What does ancillary mean?"
> https://lore.kernel.org/alsa-devel/20201001071403.gc31...@kroah.com/
>
> "For non-native english speakers this is going to be rough,
> given that I as a native english speaker had to go look up
> the word in a dictionary to fully understand what you are
> trying to do with that name."

I suggested "auxiliary" in another splintered thread on this question.
In terms of what the kernel is already using:

$ git grep auxiliary | wc -l
507
$ git grep ancillary | wc -l
153

Empirically, "auxiliary" is more common and closely matches the
intended function of these devices relative to their parent device.

Re: [RFC] Status of orinoco_usb

2020-10-06 Thread Dan Williams

On Tue, 2020-10-06 at 08:40 -0400, Jes Sorensen wrote:
> On 10/6/20 3:45 AM, Arend Van Spriel wrote:
> > + Jes
> > 
> > On 10/5/2020 4:12 PM, Kalle Valo wrote:
> > > Greg Kroah-Hartman  writes:
> > > 
> > > > On Fri, Oct 02, 2020 at 01:53:58PM +0200, Sebastian Andrzej
> > > > Siewior
> > > > wrote:
> > > > > On 2020-10-02 13:37:25 [+0200], Greg Kroah-Hartman wrote:
> > > > > > > Is it possible to end up here in softirq context or is
> > > > > > > this a relic?
> > > > > > 
> > > > > > I think it's a relic of where USB host controllers
> > > > > > completed their
> > > > > > urbs
> > > > > > in hard-irq mode.  The BH/tasklet change is a pretty recent
> > > > > > change.
> > > > > 
> > > > > But the BH thingy for HCDs went in v3.12 for EHCI. XHCI was
> > > > > v5.5. My
> > > > > guess would be that people using orinoco USB are on EHCI :)
> > > > 
> > > > USB 3 systems run XHCI, which has a USB 2 controller in it, so
> > > > these
> > > > types of things might not have been noticed yet.  Who knows :)
> > > > 
> > > > > > > Should it be removed?
> > > > > > 
> > > > > > We can move it out to drivers/staging/ and then drop it to
> > > > > > see if
> > > > > > anyone
> > > > > > complains that they have the device and is willing to test
> > > > > > any
> > > > > > changes.
> > > > > 
> > > > > Not sure moving is easy since it depends on other files in
> > > > > that folder.
> > > > > USB is one interface next to PCI for instance. Unless you
> > > > > meant to move
> > > > > the whole driver including all interfaces.
> > > > > I was suggesting to remove the USB bits.
> > > > 
> > > > I forgot this was tied into other code, sorry.  I don't know
> > > > what to
> > > > suggest other than maybe try to fix it up the best that you
> > > > can, and
> > > > let's see if anyone notices...
> > > 
> > > That's what I would suggest as well.
> > > 
> > > These drivers for ancient hardware are tricky. Even if there
> > > doesn't
> > > seem to be any users on the driver sometimes people pop up
> > > reporting
> > > that it's still usable. We had that recently with one another
> > > wireless
> > > driver (forgot the name already).
> > 
> > Quite a while ago I shipped an orinoco dongle to Jes Sorensen which
> > he
> > wanted to use for some intern project if I recall correctly. Guess
> > that
> > idea did not fly yet.
> 
> I had an outreachy intern who worked on some of it, so I shipped all
> my
> Orinoco hardware to her. We never made as much progress as I had
> hoped,
> and I haven't had time to work on it since.

If anyone wants orinoco_usb, I think I still have one or two. I may
also have original orinoco hardware (PCMCIA) if anyone wants it.

Dan

Re: [net-next v4 10/12] ASoC: SOF: Introduce descriptors for SOF client

2020-07-06 Thread Dan Williams

On Thu, Jul 2, 2020 at 6:44 AM Ranjani Sridharan
 wrote:
[..]
> > > Hi Jason,
> > >
> > > We're addressing the naming in the next version as well. We've had
> > > several people reject the name virtual bus and we've narrowed in on
> > > "ancillary bus" for the new name suggesting that we have the core
> > > device that is attached to the primary bus and one or more sub-
> > > devices
> > > that are attached to the ancillary bus. Please let us know what you
> > > think of it.
> >
> > I'm thinking that the primary person who keeps asking you to create
> > this
> > "virtual bus" was not upset about that name, nor consulted, so why
> > are
> > you changing this?  :(
> >
> > Right now this feels like the old technique of "keep throwing crap at
> > a
> > maintainer until they get so sick of it that they do the work
> > themselves..."
>
> Hi Greg,
>
> It wasnt our intention to frustrate you with the name change but in the
> last exchange you had specifically asked for signed-off-by's from other
> Intel developers. In that process, one of the recent feedback from some
> of them was about the name being misleading and confusing.
>
> If you feel strongly about the keeping name "virtual bus", please let
> us know and we can circle back with them again.

Hey Greg,

Feel free to blame me for the naming thrash it was part of my internal
review feedback trying to crispen the definition of this facility. I
was expecting the next revision to come with the internal reviewed-by
and an explanation of all the items that were changed during that
review.

Ranjani, is the next rev ready to go out with the review items
identified? Let's just proceed with the current direction of the
review tags that Greg asked for, name changes and all, and iterate the
next details on the list with the new patches in hand.

Re: qmi_wwan not using netif_carrier_*()

2020-06-18 Thread Dan Williams

On Thu, 2020-06-18 at 12:08 +0200, Tanjeff-Nicolai Moos wrote:
> 
> On Wed, 17 Jun 2020 19:24:34 +0200
> Andrew Lunn  wrote:
> 
> > On Wed, Jun 17, 2020 at 11:59:33AM -0500, Dan Williams wrote:
> > > On Wed, 2020-06-17 at 18:48 +0200, Andrew Lunn wrote:
> > > > On Wed, Jun 17, 2020 at 03:21:53PM +0200, Tanjeff-Nicolai Moos
> > > > wrote:
> > > > > Hi netdevs,
> > > > > 
> > > > > Kernel version:
> > > > > 
> > > > >   I'm working with kernel 4.14.137 (OpenWRT project). But I
> > > > > looked
> > > > > at
> > > > >   the source of kernel 5.7 and found the same situation.
> > > > > 
> > > > > Problem:
> > > > > 
> > > > >   I'm using the qmi_wwan driver for a Sierra Wireless EM7455
> > > > > LTE
> > > > >   modem. This driver does not use
> > > > >   netif_carrier_on()/netif_carrier_off() to update its link
> > > > > status.
> > > > >   This confuses ledtrig_netdev which uses netif_carrier_ok()
> > > > > to
> > > > > obtain
> > > > >   the link status.
> > > > > 
> > > > > My solution:
> > > > > 
> > > > >   As a solution (or workaround?) I would try:
> > > > > 
> > > > >   1) In drivers/net/usb/qmi_wwan.c, lines 904/913: Add the
> > > > > flag
> > > > >  FLAG_LINK_INTR.
> > > > > 
> > > > >   2) In drivers/net/usb/usbnet.c, functions usbnet_open() and
> > > > >  usbnet_stop(): Add a call to netif_carrier_*(),
> > > > >  but only if FLAG_LINK_INTR is set.
> > > > > 
> > > > > Question:
> > > > > 
> > > > >   Is this the intended way to use FLAG_LINK_INTR and
> > > > > netif_carrier_*()?
> > > > >   Or is there another recommended way to obtain the link
> > > > > status of
> > > > >   network devices (I could change ledtrig_netdev)?
> > > > 
> > > > Hi Tanjeff
> > > > 
> > > > With Ethernet, having a carrier means there is a link partner,
> > > > the
> > > > layer 2 of the OSI 7 layer stack model is working. If the
> > > > interface
> > > > is
> > > > not open()ed, it clearly should not have carrier. However, just
> > > > because it is open, does not mean it has carrier. The cable
> > > > could be
> > > > unplugged, etc.
> > > > 
> > > > This is an LTE modem. What does carrier mean here? I don't know
> > > > if it
> > > > is well defined, but i would guess it is connected to a base
> > > > station
> > > > which is offering service. I'm assuming you are interested in
> > > > data
> > > > here, not wanting to make a 911/999/112/$EMERGENCY_SERVICE call
> > > > which
> > > > in theory all base stations should accept.
> > > > 
> > > > Is there a way to get this state information from the hardware?
> > > > That
> > > > would be the correct way to set the carrier.
> > > 
> > > There isn't. All the setup that would result in IFF_LOWER_UP (eg
> > > ability to pass packets to the cellular network) happens over
> > > channels
> > > *other* than the ethernet one. eg CDC-WDM, CDC-ACM, CDC-MBIM, AT
> > > commands, QMI commands, MBIM commands, etc.
> > > 
> > > Something in userspace handles the actual IP-level connection
> > > setup and
> > > once that's done, only then do you really have IFF_LOWER_UP. One
> > > way to
> > > solve this could be to require userspace connection managers to
> > > manage
> > > the carrier state of the device, which is possible for some
> > > drivers
> > > already IIRC.
> > 
> > So Tanjeff, what is you real use case here? I assume you want to
> > control an LED so it is on when the LTE modem is connected? Could
> > you
> > export the LED to user space and have a dhclient-enter/exit script
> > change
> > the state of the LED?
> The LED should show whether the link is up (data transfer is
> possible),
> and it should blink when data is being transferred. This
> functionality
> is provided by the ledtrig_netdev driver. The blinking works already,
> but the indicated link state is wrong, because the netif_car

Re: qmi_wwan not using netif_carrier_*()

2020-06-17 Thread Dan Williams

On Wed, 2020-06-17 at 19:24 +0200, Andrew Lunn wrote:
> On Wed, Jun 17, 2020 at 11:59:33AM -0500, Dan Williams wrote:
> > On Wed, 2020-06-17 at 18:48 +0200, Andrew Lunn wrote:
> > > On Wed, Jun 17, 2020 at 03:21:53PM +0200, Tanjeff-Nicolai Moos
> > > wrote:
> > > > Hi netdevs,
> > > > 
> > > > Kernel version:
> > > > 
> > > >   I'm working with kernel 4.14.137 (OpenWRT project). But I
> > > > looked
> > > > at
> > > >   the source of kernel 5.7 and found the same situation.
> > > > 
> > > > Problem:
> > > > 
> > > >   I'm using the qmi_wwan driver for a Sierra Wireless EM7455
> > > > LTE
> > > >   modem. This driver does not use
> > > >   netif_carrier_on()/netif_carrier_off() to update its link
> > > > status.
> > > >   This confuses ledtrig_netdev which uses netif_carrier_ok() to
> > > > obtain
> > > >   the link status.
> > > > 
> > > > My solution:
> > > > 
> > > >   As a solution (or workaround?) I would try:
> > > > 
> > > >   1) In drivers/net/usb/qmi_wwan.c, lines 904/913: Add the flag
> > > >  FLAG_LINK_INTR.
> > > > 
> > > >   2) In drivers/net/usb/usbnet.c, functions usbnet_open() and
> > > >  usbnet_stop(): Add a call to netif_carrier_*(),
> > > >  but only if FLAG_LINK_INTR is set.
> > > > 
> > > > Question:
> > > > 
> > > >   Is this the intended way to use FLAG_LINK_INTR and
> > > > netif_carrier_*()?
> > > >   Or is there another recommended way to obtain the link status
> > > > of
> > > >   network devices (I could change ledtrig_netdev)?
> > > 
> > > Hi Tanjeff
> > > 
> > > With Ethernet, having a carrier means there is a link partner,
> > > the
> > > layer 2 of the OSI 7 layer stack model is working. If the
> > > interface
> > > is
> > > not open()ed, it clearly should not have carrier. However, just
> > > because it is open, does not mean it has carrier. The cable could
> > > be
> > > unplugged, etc.
> > > 
> > > This is an LTE modem. What does carrier mean here? I don't know
> > > if it
> > > is well defined, but i would guess it is connected to a base
> > > station
> > > which is offering service. I'm assuming you are interested in
> > > data
> > > here, not wanting to make a 911/999/112/$EMERGENCY_SERVICE call
> > > which
> > > in theory all base stations should accept.
> > > 
> > > Is there a way to get this state information from the hardware?
> > > That
> > > would be the correct way to set the carrier.
> > 
> > There isn't. All the setup that would result in IFF_LOWER_UP (eg
> > ability to pass packets to the cellular network) happens over
> > channels
> > *other* than the ethernet one. eg CDC-WDM, CDC-ACM, CDC-MBIM, AT
> > commands, QMI commands, MBIM commands, etc.
> > 
> > Something in userspace handles the actual IP-level connection setup
> > and
> > once that's done, only then do you really have IFF_LOWER_UP. One
> > way to
> > solve this could be to require userspace connection managers to
> > manage
> > the carrier state of the device, which is possible for some drivers
> > already IIRC.
> 
> So Tanjeff, what is you real use case here? I assume you want to
> control an LED so it is on when the LTE modem is connected? Could you
> export the LED to user space and have a dhclient-enter/exit script
> change
> the state of the LED?

Most of these devices do not use DHCP either, but get the IP through
the control channel (QMI, MBIM, AT commands, etc).  I'd just watch for
an IP address on the interface instead.

Dan

Re: qmi_wwan not using netif_carrier_*()

2020-06-17 Thread Dan Williams

On Wed, 2020-06-17 at 18:48 +0200, Andrew Lunn wrote:
> On Wed, Jun 17, 2020 at 03:21:53PM +0200, Tanjeff-Nicolai Moos wrote:
> > Hi netdevs,
> > 
> > Kernel version:
> > 
> >   I'm working with kernel 4.14.137 (OpenWRT project). But I looked
> > at
> >   the source of kernel 5.7 and found the same situation.
> > 
> > Problem:
> > 
> >   I'm using the qmi_wwan driver for a Sierra Wireless EM7455 LTE
> >   modem. This driver does not use
> >   netif_carrier_on()/netif_carrier_off() to update its link status.
> >   This confuses ledtrig_netdev which uses netif_carrier_ok() to
> > obtain
> >   the link status.
> > 
> > My solution:
> > 
> >   As a solution (or workaround?) I would try:
> > 
> >   1) In drivers/net/usb/qmi_wwan.c, lines 904/913: Add the flag
> >  FLAG_LINK_INTR.
> > 
> >   2) In drivers/net/usb/usbnet.c, functions usbnet_open() and
> >  usbnet_stop(): Add a call to netif_carrier_*(),
> >  but only if FLAG_LINK_INTR is set.
> > 
> > Question:
> > 
> >   Is this the intended way to use FLAG_LINK_INTR and
> > netif_carrier_*()?
> >   Or is there another recommended way to obtain the link status of
> >   network devices (I could change ledtrig_netdev)?
> 
> Hi Tanjeff
> 
> With Ethernet, having a carrier means there is a link partner, the
> layer 2 of the OSI 7 layer stack model is working. If the interface
> is
> not open()ed, it clearly should not have carrier. However, just
> because it is open, does not mean it has carrier. The cable could be
> unplugged, etc.
> 
> This is an LTE modem. What does carrier mean here? I don't know if it
> is well defined, but i would guess it is connected to a base station
> which is offering service. I'm assuming you are interested in data
> here, not wanting to make a 911/999/112/$EMERGENCY_SERVICE call which
> in theory all base stations should accept.
> 
> Is there a way to get this state information from the hardware? That
> would be the correct way to set the carrier.

There isn't. All the setup that would result in IFF_LOWER_UP (eg
ability to pass packets to the cellular network) happens over channels
*other* than the ethernet one. eg CDC-WDM, CDC-ACM, CDC-MBIM, AT
commands, QMI commands, MBIM commands, etc.

Something in userspace handles the actual IP-level connection setup and
once that's done, only then do you really have IFF_LOWER_UP. One way to
solve this could be to require userspace connection managers to manage
the carrier state of the device, which is possible for some drivers
already IIRC.

Dan

Re: [PATCH ethtool v1] netlink: add master/slave configuration support

2020-06-09 Thread Dan Williams

On Tue, Jun 9, 2020 at 12:57 PM David Miller  wrote:
>
> From: "Williams, Dan J" 
> Date: Tue, 9 Jun 2020 19:30:50 +
>
> > On Tue, 2020-06-09 at 11:36 -0700, David Miller wrote:
> >> From: Stephen Hemminger 
> >> Date: Tue, 9 Jun 2020 10:19:35 -0700
> >>
> >> > Yes, words do matter and convey a lot of implied connotation and
> >> > meaning.
> >>
> >> What is your long term plan?  Will you change all of the UAPI for
> >> bonding for example?
> >
> > The long term plan in my view includes talking with standards bodies to
> > move new content to, for example, master/subordinate. In other words,
> > practical forward steps, not retroactively changing interfaces.
>
> When that knowledge is established legitimately in standards and
> transferred into common knowledge of these technologies, yes then
> please come present your proposals.

Our hands are not completely tied by the specifications, as a
community we have a non-zero level of influence over standards bodies,
even direct participation in some. So we could do something stronger
than passively wait for standards to catch up. For example, deprecate
our consideration of future specifications that include this language
and set a cut off date.

I understand the confusion that arises from using terminology
differently from the specification, but at the same time when
specification language actively hurts kernel code maintainability we
change it. For example, when I did the first iteration of what
eventually became libnvdimm Ingo rightly reacted to the naming being
too ACPI specification centric and wanting more kernel-centric naming.

If the common kernel name for what was formerly called a "slave"
device is a "subordinate" device then the confusion is lessened, only
one common kernel translation to consider.

> But right now using different words will create confusion.
>
> I also find master/subordinate an interesting proposal, what if master
> is a triggering term?  Why only slave?

"slave" has a direct connection to human suffering deployed at global scale.

One way I think about this is consider we have our first ever Linux
Plumbers on the African continent, and you had a choice about giving a
talk about how the git master branch operates, or a talk about slave
devices which one would you feel more immediately comfortable leading?
Any hesitation you would feel freely using the word slave with a
predominantly black audience is a similar speed bump a contributor
might feel needing to consume that term to get their job done.

Explaining "no, not that connotation of slave" does not scale as much
as transitioning to another term.

> I know people feel something needs to change, but do that moving
> forward for the technologies themselves.

This is the start of a process that the kernel community can take an
active role to lead, we have it within our control to not wait for the
lowest common denominator to catch up.

> Not how we implement support
> for a technology which is established already.
>
> Plant the seed, don't chop the tree down.

I appreciate the engagement.

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Dan Williams

On Thu, 2019-06-27 at 12:20 -0700, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 20:39:48 +0200
> Michal Kubecek  wrote:
> 
> > > $ ip li set dev enp3s0 alias "Onboard Ethernet"
> > > # ip link show "Onboard Ethernet"
> > > Device "Onboard Ethernet" does not exist.
> > > 
> > > So it does not really appear to be an alias, it is a label. To be
> > > truly useful, it needs to be more than a label, it needs to be a
> > > real
> > > alias which you can use.  
> > 
> > That's exactly what I meant: to be really useful, one should be
> > able to
> > use the alias(es) for setting device options, for adding routes, in
> > netfilter rules etc.
> > 
> > Michal
> 
> The kernel doesn't enforce uniqueness of alias.

Can we even enforce unique aliases/labels? Given that the kernel hasn't
enforced that in the past there's a good possibility of breaking stuff
if it started. (unfortunately)

Dan

> Also current kernel RTM_GETLINK doesn't do filter by alias (easily
> fixed).
> 
> If it did, then handling it in iproute would be something like:
> 
> diff --git a/lib/ll_map.c b/lib/ll_map.c
> index e0ed54bf77c9..c798ba542224 100644
> --- a/lib/ll_map.c
> +++ b/lib/ll_map.c
> @@ -26,15 +26,18 @@
>  struct ll_cache {
>   struct hlist_node idx_hash;
>   struct hlist_node name_hash;
> + struct hlist_node alias_hash;
>   unsignedflags;
>   unsignedindex;
>   unsigned short  type;
> - charname[];
> + char*alias;
> + charname[IFNAMSIZ];
>  };
>  
>  #define IDXMAP_SIZE  1024
>  static struct hlist_head idx_head[IDXMAP_SIZE];
>  static struct hlist_head name_head[IDXMAP_SIZE];
> +static struct hlist_head alias_head[IDXMAP_SIZE];
>  
>  static struct ll_cache *ll_get_by_index(unsigned index)
>  {
> @@ -77,10 +80,26 @@ static struct ll_cache *ll_get_by_name(const char
> *name)
>   return NULL;
>  }
>  
> +static struct ll_cache *ll_get_by_alias(const char *alias)
> +{
> + struct hlist_node *n;
> + unsigned h = namehash(alias) & (IDXMAP_SIZE - 1);
> +
> + hlist_for_each(n, &alias_head[h]) {
> + struct ll_cache *im
> + = container_of(n, struct ll_cache, alias_hash);
> +
> + if (strcmp(im->alias, alias) == 0)
> + return im;
> + }
> +
> + return NULL;
> +}
> +
>  int ll_remember_index(struct nlmsghdr *n, void *arg)
>  {
>   unsigned int h;
> - const char *ifname;
> + const char *ifname, *ifalias;
>   struct ifinfomsg *ifi = NLMSG_DATA(n);
>   struct ll_cache *im;
>   struct rtattr *tb[IFLA_MAX+1];
> @@ -96,6 +115,10 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   if (im) {
>   hlist_del(&im->name_hash);
>   hlist_del(&im->idx_hash);
> + if (im->alias) {
> + hlist_del(&im->alias_hash);
> + free(im->alias);
> + }
>   free(im);
>   }
>   return 0;
> @@ -106,6 +129,8 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   if (ifname == NULL)
>   return 0;
>  
> + ifalias = tb[IFLA_IFALIAS] ? rta_getattr_str(tb[IFLA_IFALIAS])
> : NULL;
> +
>   if (im) {
>   /* change to existing entry */
>   if (strcmp(im->name, ifname) != 0) {
> @@ -114,6 +139,14 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   hlist_add_head(&im->name_hash, &name_head[h]);
>   }
>  
> + if (im->alias) {
> + hlist_del(&im->alias_hash);
> + if (ifalias) {
> + h = namehash(ifalias) & (IDXMAP_SIZE -
> 1);
> + hlist_add_head(&im->alias_hash,
> &alias_head[h]);
> + }
> + }
> +
>   im->flags = ifi->ifi_flags;
>   return 0;
>   }
> @@ -132,6 +165,12 @@ int ll_remember_index(struct nlmsghdr *n, void
> *arg)
>   h = namehash(ifname) & (IDXMAP_SIZE - 1);
>   hlist_add_head(&im->name_hash, &name_head[h]);
>  
> + if (ifalias) {
> + im->alias = strdup(ifalias);
> + h = namehash(ifalias) & (IDXMAP_SIZE - 1);
> + hlist_add_head(&im->alias_hash, &alias_head[h]);
> + }   
> + 
>   return 0;
>  }
>  
> @@ -152,7 +191,7 @@ static unsigned int ll_idx_a2n(const char *name)
>   return idx;
>  }
>  
> -static int ll_link_get(const char *name, int index)
> +static int ll_link_get(const char *name, const char *alias, int
> index)
>  {
>   struct {
>   struct nlmsghdr n;
> @@ -176,6 +215,9 @@ static int ll_link_get(const char *name, int
> index)
>   if (name)
>   addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name,
> strlen(name) + 1);
> + if (alias)
> + addattr_l(&req.n, sizeof(req), IFLA_IF

Re: [RFC] longer netdev names proposal

2019-06-27 Thread Dan Williams

On Thu, 2019-06-27 at 08:29 -0700, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 11:43:27 +0200
> Jiri Pirko  wrote:
> 
> > Hi all.
> > 
> > In the past, there was repeatedly discussed the IFNAMSIZ (16) limit
> > for
> > netdevice name length. Now when we have PF and VF representors
> > with port names like "pfXvfY", it became quite common to hit this
> > limit:
> > 0123456789012345
> > enp131s0f1npf0vf6
> > enp131s0f1npf0vf22
> > 
> > Since IFLA_NAME is just a string, I though it might be possible to
> > use
> > it to carry longer names as it is. However, the userspace tools,
> > like
> > iproute2, are doing checks before print out. So for example in
> > output of
> > "ip addr" when IFLA_NAME is longer than IFNAMSIZE, the netdevice is
> > completely avoided.
> > 
> > So here is a proposal that might work:
> > 1) Add a new attribute IFLA_NAME_EXT that could carry names longer
> > than
> >IFNAMSIZE, say 64 bytes. The max size should be only defined in
> > kernel,
> >user should be prepared for any string size.
> > 2) Add a file in sysfs that would indicate that NAME_EXT is
> > supported by
> >the kernel.
> > 3) Udev is going to look for the sysfs indication file. In case
> > when
> >kernel supports long names, it will do rename to longer name,
> > setting
> >IFLA_NAME_EXT. If not, it does what it does now - fail.
> > 4) There are two cases that can happen during rename:
> >A) The name is shorter than IFNAMSIZ
> >   -> both IFLA_NAME and IFLA_NAME_EXT would contain the same
> > string:  
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = enp5s0f1npf0vf1
> >  renamed  IFLA_NAME_EXT = enp5s0f1npf0vf1
> >B) The name is longer tha IFNAMSIZ
> >   -> IFLA_NAME would contain the original one, IFLA_NAME_EXT
> > would   
> >  contain the new one:
> >  original IFLA_NAME = eth0
> >  original IFLA_NAME_EXT = eth0
> >  renamed  IFLA_NAME = eth0
> >  renamed  IFLA_NAME_EXT = enp131s0f1npf0vf22

It makes me a bit uncomfortable to allow IFLA_NAME and IFLA_NAME_EXT to
be completely different. That sounds like a big source of confusion and
debugging problems in production.

Dan

> > This would allow the old tools to work with "eth0" and the new
> > tools would work with "enp131s0f1npf0vf22". In sysfs, there would
> > be symlink from one name to another.
> >   
> > Also, there might be a warning added to kernel if someone works
> > with IFLA_NAME that the userspace tool should be upgraded.
> > 
> > Eventually, only IFLA_NAME_EXT is going to be used by everyone.
> > 
> > I'm aware there are other places where similar new attribute
> > would have to be introduced too (ip rule for example).
> > I'm not saying this is a simple work.
> > 
> > Question is what to do with the ioctl api (get ifindex etc). I
> > would
> > probably leave it as is and push tools to use rtnetlink instead.
> > 
> > Any ideas why this would not work? Any ideas how to solve this
> > differently?
> > 
> > Thanks!
> > 
> > Jiri
> >  
> 
> I looked into this in the past, but then rejected it because
> there are so many tools that use names, not just iproute2.
> Plus long names are very user unfriendly.

Re: cellular modem driver APIs

2019-04-04 Thread Dan Williams

On Thu, 2019-04-04 at 11:00 +0200, Johannes Berg wrote:
> Hi,
> 
> > Thanks a lot for doing this!  Being responsible for most of the
> > issues
> > you point out, I can only say that you have my full support if you
> > want
> > to change any of it.
> 
> :-)
> 
> > My pathetic excuses below are just meant to clarify why things are
> > the
> > way they are.  They are not a defense for status quo ;-)
> 
> Thanks!
> 
> > > Here's the current things we seem to be doing:
> > > 
> > >   (1) Channels are created/encoded as VLANs (cdc_mbim)
> > > 
> > >   This is ... strange at best, it requires creating fake
> > > ethernet
> > >   headers on the frames, just to be able to have a VLAN tag.
> > > If you
> > >   could rely on VLAN acceleration it wouldn't be _so_ bad,
> > > but of
> > >   course you can't, so you have to detect an in-band VLAN tag
> > > and
> > >   decode/remove it, before taking the VLAN ID into the
> > > virtual
> > >   channel number.
> > 
> > No, the driver sets NETIF_F_HW_VLAN_CTAG_TX. There is no in-band
> > VLAN
> > tag for any normal use.  The tag is taken directly from skb
> > metadata and
> > mapped to the appropriate MBIM session ID.
> 
> Right, I saw this.
> 
> > But this failed when cooking raw frames with an in-line tag using
> > packet
> > sockets, so I added a fallback to in-line tags for that use case.
> 
> But this still means that the fallback for in-line has to be
> supported,
> so you can't really fully rely on VLAN acceleration. Maybe my wording
> here was incomplete, but I was aware of this.
> 
> Nevertheless, it means to replicate this in another driver you don't
> just need the VLAN acceleration handling, but also the fallback, so
> it's
> a bunch of extra code.
> 
> > >   Creating channels is hooked on VLAN operations, which is
> > > about the
> > >   only thing that makes sense here?
> > 
> > Well, that was why I did this, to avoid requiring som new set of
> > userspace tools to manage these links.  I looked for some existing
> > tools
> > for adding virtual netdevs, and I thought I could make VLANs fit
> > the
> > scheme. 
> 
> Right.
> 
> > In hindsight, I should have created a new netlink based API for
> > cellular
> > modem virtual links instead.  But I don't think it ever struck me
> > as a
> > choice I had at the time.  I just wasn't experienced enough to
> > realize
> > how the Linux kernel APIs are developed ;-)
> 
> :-)
> And likely really it wasn't all as fleshed out as today with the
> plethora of virtual links supported. This seems fairly old.
> 
> > >   (2) Channels are created using sysfs (qmi_wwan)
> > > 
> > >   This feels almost worse - channels are created using sysfs
> > > and
> > >   just *bam* new netdev shows up, no networking APIs are used
> > > to
> > >   create them at all, and I suppose you can't even query the
> > > channel
> > >   ID for each netdev if you rename them or so. Actually,
> > > maybe you
> > >   can in sysfs, not sure I understand the code fully.
> > 
> > This time I was, and I tried to learn from the MBIM mistake. So I
> > asked
> > the users (ModemManager developers++), proposing a netlink API as a
> > possible solution:
> > 
> > https://lists.freedesktop.org/archives/libqmi-devel/2017-January/001900.html
> > 
> > The options I presented were those I saw at the time: VLANs like
> > cdc_mbim, a new netlink API, or sysfs.  There wasn't much feedback,
> > but
> > sysfs "won".  So this was a decision made by the users of the API,
> > FWIW.
> 
> Fair point. Dan pointed out that no (default) userspace actually
> exists
> to do this though, and users kinda of have to do it manually - he
> says
> modem manager and libmbim all just use the default channel today. So
> not
> sure they really went on to become users of this ;-)

To be clear, ModemManager doesn't (yet) make use of multiple IP
channels. But libmbim supports it with 'mbimcli --connect="session-
id=4,apn=X"' and then you'd add VLAN 4 onto the mbim netdev and
theoretically things would work :)  Bjorn would have the details
though.

libmbim really doesn't care about the extra netdevs or channels itself
since it doesn't care about the data plane (nor does it need to at this
time).

Dan

> > >   (3) Channels are created using a new link type (rmnet)
> > > 
> > >   To me this sort of feels the most natural, but this
> > > particular
> > >   implementation has at least two issues:
> > > 
> > >   (a) The implementation is basically driver-specific now,
> > > the link
> > >   type is called 'rmnet' etc.
> > >   (b) The bridge enslave thing there is awful.
> > 
> > This driver showed up right after the sysfs based implementation in
> > qmi_wwan.  Too bad we didn't know about this work then.  I  don't
> > think
> > anyone would have been interested in the qmi_wwan sysfs thing if we
> > had
> > known about the plans for this driver.  But what's done is done.
> 
> Sure.
> 
> > > It seems to me that there really is

Re: [PATCH net] qmi_wwan: Add support for Quectel EG12/EM12

2019-03-05 Thread Dan Williams

On Sat, 2019-03-02 at 22:19 +0100, Kristian Evensen wrote:
> On Sat, Mar 2, 2019 at 4:00 PM Bjørn Mork  wrote:
> > I am not too happy about the double device id matching with extra
> > magic
> > applied. outside the device id table.  It does not scale, and it
> > does
> > not play well at all with dynamic IDs.  But I guess it's acceptable
> > for
> > two devices when it was ok for one, so
> > 
> > Acked-by: Bjørn Mork 
> > 
> > But I assume this isn't the last device we've seen with this issue.
> > Someone has already mention Quectel EM20. Maybe convert this code
> > to a
> > quirk for the next device and leave all the device specific magic
> > in the
> > device id table?  Could even add a macro for the class match +
> > quirk
> > thing.
> 
> (Appologies to Bjørn for receiving multiple copies of this email - I
> forgot that the Gmail app sends HTML-encoded emails)
> 
> Thanks for your feedback. My reasoning was the same. For one or two
> devices, the current approach is (probably) ok. Any more, and a
> different approach is needed.
> 
> Looking through the documentation I have, it seems like most
> Quectel-modems support the command used to change the
> USB-configuration. Thus, I wouldn’t be surprised if all or most
> existing Quectel-devices should have this quirk applied. I have a
> couple of older modules, so I will look at how they behave with
> different configurations. I will then either add the quirk soon or
> when for example EM20 is available (unless someone else does it
> earlier :)).

Yeah, this should really be a new quirk from here on.

Dan

Re: [RFC 2/6] ixgbe: use extack for xdp errors

2019-02-28 Thread Dan Williams

On Thu, 2019-02-28 at 10:38 -0800, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> Give a reason for returning error for bpf setup.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 25 +--
> 
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index a4e7584a50cb..9a81123074ca 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -10224,18 +10224,23 @@ ixgbe_features_check(struct sk_buff *skb,
> struct net_device *dev,
>   return features;
>  }
>  
> -static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog
> *prog)
> +static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog
> *prog,
> +struct netlink_ext_ack *extack)
>  {
>   int i, frame_size = dev->mtu + ETH_HLEN + ETH_FCS_LEN +
> VLAN_HLEN;
>   struct ixgbe_adapter *adapter = netdev_priv(dev);
>   struct bpf_prog *old_prog;
>   bool need_reset;
>  
> - if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
> + if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) {
> + NL_SET_ERR_MSG(extack, "XDP not support with SRIOV
> enabled");

Should be "supported" I assume?

Dan

>   return -EINVAL;
> + }
>  
> - if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
> + if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
> + NL_SET_ERR_MSG(extack, "XDP not supported with DCB
> enabled");
>   return -EINVAL;
> + }
>  
>   /* verify ixgbe ring attributes are sufficient for XDP */
>   for (i = 0; i < adapter->num_rx_queues; i++) {
> @@ -10244,12 +10249,17 @@ static int ixgbe_xdp_setup(struct
> net_device *dev, struct bpf_prog *prog)
>   if (ring_is_rsc_enabled(ring))
>   return -EINVAL;
>  
> - if (frame_size > ixgbe_rx_bufsz(ring))
> + if (frame_size > ixgbe_rx_bufsz(ring)) {
> + NL_SET_ERR_MSG(extack,
> +"XDP does not support multiple
> buffers");
>   return -EINVAL;
> + }
>   }
>  
> - if (nr_cpu_ids > MAX_XDP_QUEUES)
> + if (nr_cpu_ids > MAX_XDP_QUEUES) {
> + NL_SET_ERR_MSG(extack, "number of cpus >
> MAX_XDP_QUEUES");
>   return -ENOMEM;
> + }
>  
>   old_prog = xchg(&adapter->xdp_prog, prog);
>   need_reset = (!!prog != !!old_prog);
> @@ -10260,7 +10270,7 @@ static int ixgbe_xdp_setup(struct net_device
> *dev, struct bpf_prog *prog)
>  
>   if (err) {
>   rcu_assign_pointer(adapter->xdp_prog,
> old_prog);
> - return -EINVAL;
> + return err;
>   }
>   } else {
>   for (i = 0; i < adapter->num_rx_queues; i++)
> @@ -10288,7 +10298,7 @@ static int ixgbe_xdp(struct net_device *dev,
> struct netdev_bpf *xdp)
>  
>   switch (xdp->command) {
>   case XDP_SETUP_PROG:
> - return ixgbe_xdp_setup(dev, xdp->prog);
> + return ixgbe_xdp_setup(dev, xdp->prog, xdp->extack);
>   case XDP_QUERY_PROG:
>   xdp->prog_id = adapter->xdp_prog ?
>   adapter->xdp_prog->aux->id : 0;
> @@ -10298,6 +10308,7 @@ static int ixgbe_xdp(struct net_device *dev,
> struct netdev_bpf *xdp)
>   xdp->xsk.queue_id);
>  
>   default:
> + NL_SET_ERR_MSG(xdp->extack, "Unknown XDP command");
>   return -EINVAL;
>   }
>  }

Re: [PATCH 2/3] mm/gup: Introduce get_user_pages_fast_longterm()

2019-02-11 Thread Dan Williams

On Mon, Feb 11, 2019 at 2:07 PM Jason Gunthorpe  wrote:
>
> On Mon, Feb 11, 2019 at 01:52:38PM -0800, Ira Weiny wrote:
> > On Mon, Feb 11, 2019 at 01:39:12PM -0800, John Hubbard wrote:
> > > On 2/11/19 1:26 PM, Ira Weiny wrote:
> > > > On Mon, Feb 11, 2019 at 01:13:56PM -0800, John Hubbard wrote:
> > > >> On 2/11/19 12:39 PM, Jason Gunthorpe wrote:
> > > >>> On Mon, Feb 11, 2019 at 12:16:42PM -0800, ira.we...@intel.com wrote:
> > >  From: Ira Weiny 
> > > >> [...]
> > > >> It seems to me that the longterm vs. short-term is of questionable 
> > > >> value.
> > > >
> > > > This is exactly why I did not post this before.  I've been waiting our 
> > > > other
> > > > discussions on how GUP pins are going to be handled to play out.  But 
> > > > with the
> > > > netdev thread today[1] it seems like we need to make sure we have a 
> > > > "safe" fast
> > > > variant for a while.  Introducing FOLL_LONGTERM seemed like the 
> > > > cleanest way to
> > > > do that even if we will not need the distinction in the future...  :-(
> > >
> > > Yes, I agree. Below...
> > >
> > > > [...]
> > > > This is also why I did not change the get_user_pages_longterm because 
> > > > we could
> > > > be ripping this all out by the end of the year...  (I hope. :-)
> > > >
> > > > So while this does "pollute" the GUP family of calls I'm hoping it is 
> > > > not
> > > > forever.
> > > >
> > > > Ira
> > > >
> > > > [1] https://lkml.org/lkml/2019/2/11/1789
> > > >
> > >
> > > Yes, and to be clear, I think your patchset here is fine. It is easy to 
> > > find
> > > the FOLL_LONGTERM callers if and when we want to change anything. I just 
> > > think
> > > also it's appopriate to go a bit further, and use FOLL_LONGTERM all by 
> > > itself.
> > >
> > > That's because in either design outcome, it's better that way:
> > >
> > > is just right. The gup API already has _fast and non-fast variants, and 
> > > once
> > > you get past a couple, you end up with a multiplication of names that 
> > > really
> > > work better as flags. We're there.
> > >
> > > the _longterm API variants.
> >
> > Fair enough.   But to do that correctly I think we will need to convert
> > get_user_pages_fast() to use flags as well.  I have a version of this series
> > which includes a patch does this, but the patch touched a lot of subsystems 
> > and
> > a couple of different architectures...[1]
>
> I think this should be done anyhow, it is trouble the two basically
> identical interfaces have different signatures. This already caused a
> bug in vfio..
>
> I also wonder if someone should think about making fast into a flag
> too..
>
> But I'm not sure when fast should be used vs when it shouldn't :(

Effectively fast should always be used just in case the user cares
about performance. It's just that it may fail and need to fall back to
requiring the vma.

Personally I thought RDMA memory registration is a one-time / upfront
slow path so that non-fast-GUP is tolerable.

The workloads that *need* it are O_DIRECT users that can't tolerate a
vma lookup on every I/O.

Re: [PATCH 2/3] mm/gup: Introduce get_user_pages_fast_longterm()

2019-02-11 Thread Dan Williams

On Mon, Feb 11, 2019 at 1:39 PM John Hubbard  wrote:
>
> On 2/11/19 1:26 PM, Ira Weiny wrote:
> > On Mon, Feb 11, 2019 at 01:13:56PM -0800, John Hubbard wrote:
> >> On 2/11/19 12:39 PM, Jason Gunthorpe wrote:
> >>> On Mon, Feb 11, 2019 at 12:16:42PM -0800, ira.we...@intel.com wrote:
>  From: Ira Weiny 
> >> [...]
> >> It seems to me that the longterm vs. short-term is of questionable value.
> >
> > This is exactly why I did not post this before.  I've been waiting our other
> > discussions on how GUP pins are going to be handled to play out.  But with 
> > the
> > netdev thread today[1] it seems like we need to make sure we have a "safe" 
> > fast
> > variant for a while.  Introducing FOLL_LONGTERM seemed like the cleanest 
> > way to
> > do that even if we will not need the distinction in the future...  :-(
>
> Yes, I agree. Below...
>
> > [...]
> > This is also why I did not change the get_user_pages_longterm because we 
> > could
> > be ripping this all out by the end of the year...  (I hope. :-)
> >
> > So while this does "pollute" the GUP family of calls I'm hoping it is not
> > forever.
> >
> > Ira
> >
> > [1] https://lkml.org/lkml/2019/2/11/1789
> >
>
> Yes, and to be clear, I think your patchset here is fine. It is easy to find
> the FOLL_LONGTERM callers if and when we want to change anything. I just think
> also it's appopriate to go a bit further, and use FOLL_LONGTERM all by itself.
>
> That's because in either design outcome, it's better that way:
>
> -- If we keep the concept of "I'm a long-term gup call site", then 
> FOLL_LONGTERM
> is just right. The gup API already has _fast and non-fast variants, and once
> you get past a couple, you end up with a multiplication of names that really
> work better as flags. We're there.
>
> -- If we drop the concept, then you've already done part of the work, by 
> removing
> the _longterm API variants.
>

A problem I now see with the _longterm name is that it hides its true
intent. It's really a "dax can't use page cache tricks to make it seem
like this page is ok to access indefinitely, if the system needs this
page back your pin would prevent the forward progress of the system
state.". If the discussion results in a need to have an explicit file
state (immutable or lease) then we'll continue to need a gup pin type
distinction. If the discussion resolves to one of the silent options
(fail truncate, lie about truncate) then FOLL_LONGTERM might be able
to die at that point.

Re: [PATCH 1/2] xsk: do not use mmap_sem

2019-02-11 Thread Dan Williams

[ add Ira ]

On Mon, Feb 11, 2019 at 7:33 AM Daniel Borkmann  wrote:
>
> [ +Dan ]
>
> On 02/07/2019 08:43 AM, Björn Töpel wrote:
> > Den tors 7 feb. 2019 kl 06:38 skrev Davidlohr Bueso :
> >>
> >> Holding mmap_sem exclusively for a gup() is an overkill.
> >> Lets replace the call for gup_fast() and let the mm take
> >> it if necessary.
> >>
> >> Cc: David S. Miller 
> >> Cc: Bjorn Topel 
> >> Cc: Magnus Karlsson 
> >> CC: netdev@vger.kernel.org
> >> Signed-off-by: Davidlohr Bueso 
> >> ---
> >>  net/xdp/xdp_umem.c | 6 ++
> >>  1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> >> index 5ab236c5c9a5..25e1e76654a8 100644
> >> --- a/net/xdp/xdp_umem.c
> >> +++ b/net/xdp/xdp_umem.c
> >> @@ -265,10 +265,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
> >> if (!umem->pgs)
> >> return -ENOMEM;
> >>
> >> -   down_write(¤t->mm->mmap_sem);
> >> -   npgs = get_user_pages(umem->address, umem->npgs,
> >> - gup_flags, &umem->pgs[0], NULL);
> >> -   up_write(¤t->mm->mmap_sem);
> >> +   npgs = get_user_pages_fast(umem->address, umem->npgs,
> >> +  gup_flags, &umem->pgs[0]);
> >>
> >
> > Thanks for the patch!
> >
> > The lifetime of the pinning is similar to RDMA umem mapping, so isn't
> > gup_longterm preferred?
>
> Seems reasonable from reading what gup_longterm seems to do. Davidlohr
> or Dan, any thoughts on the above?

Yes, any gup where the lifetime of the corresponding put_page() is
under direct control of userspace should be using the _longterm flavor
to coordinate be careful in the presence of dax. In the dax case there
is no page cache indirection to coordinate truncate vs page pins. Ira
is looking to add a "fast + longterm" flavor for cases like this.

Re: [RFC PATCH V3 0/5] Hi:

2019-01-07 Thread Dan Williams

On Mon, Jan 7, 2019 at 2:25 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jan 07, 2019 at 01:39:15PM -0800, Dan Williams wrote:
> > On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin  wrote:
> > >
> > > On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > > > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > > > >
> > > > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > > > This series tries to access virtqueue metadata through 
> > > > > > > > > > kernel virtual
> > > > > > > > > > address instead of copy_user() friends since they had too 
> > > > > > > > > > much
> > > > > > > > > > overheads like checks, spec barriers or even hardware 
> > > > > > > > > > feature
> > > > > > > > > > toggling.
> > > > > > > > > Will review, thanks!
> > > > > > > > > One questions that comes to mind is whether it's all about 
> > > > > > > > > bypassing
> > > > > > > > > stac/clac.  Could you please include a performance comparison 
> > > > > > > > > with
> > > > > > > > > nosmap?
> > > > > > > > >
> > > > > > > > On machine without SMAP (Sandy Bridge):
> > > > > > > >
> > > > > > > > Before: 4.8Mpps
> > > > > > > >
> > > > > > > > After: 5.2Mpps
> > > > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > > > Or would you say it's just a better written code?
> > > > > >
> > > > > >
> > > > > > It's the effect of removing speculation barrier.
> > > > >
> > > > >
> > > > > You mean __uaccess_begin_nospec introduced by
> > > > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > > > ?
> > > > >
> > > > > So fundamentally we do access_ok checks when supplying
> > > > > the memory table to the kernel thread, and we should
> > > > > do the spec barrier there.
> > > > >
> > > > > Then we can just create and use a variant of uaccess macros that does
> > > > > not include the barrier?
> > > > >
> > > > > Or, how about moving the barrier into access_ok?
> > > > > This way repeated accesses with a single access_ok get a bit faster.
> > > > > CC Dan Williams on this idea.
> > > >
> > > > It would be interesting to see how expensive re-doing the address
> > > > limit check is compared to the speculation barrier. I.e. just switch
> > > > vhost_get_user() to use get_user() rather than __get_user(). That will
> > > > sanitize the pointer in the speculative path without a barrier.
> > >
> > > Hmm it's way cheaper even though IIRC it's measureable.
> > > Jason, would you like to try?
> > > Although frankly __get_user being slower than get_user feels very wrong.
> > > Not yet sure what to do exactly but would you agree?
> >
> > Agree. __get_user() being faster than get_user() defeats the whole
> > point of converting code paths to the access_ok() + __get_user()
> > pattern.
>
> Did you mean the reverse?

Hmm, no... I'll rephrase: __get_user() should have lower overhead than
get_user().

Re: [RFC PATCH V3 0/5] Hi:

2019-01-07 Thread Dan Williams

On Mon, Jan 7, 2019 at 6:11 AM Michael S. Tsirkin  wrote:
>
> On Sun, Jan 06, 2019 at 11:15:20PM -0800, Dan Williams wrote:
> > On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> > > >
> > > > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > > > This series tries to access virtqueue metadata through kernel 
> > > > > > > > virtual
> > > > > > > > address instead of copy_user() friends since they had too much
> > > > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > > > toggling.
> > > > > > > Will review, thanks!
> > > > > > > One questions that comes to mind is whether it's all about 
> > > > > > > bypassing
> > > > > > > stac/clac.  Could you please include a performance comparison with
> > > > > > > nosmap?
> > > > > > >
> > > > > > On machine without SMAP (Sandy Bridge):
> > > > > >
> > > > > > Before: 4.8Mpps
> > > > > >
> > > > > > After: 5.2Mpps
> > > > > OK so would you say it's really unsafe versus safe accesses?
> > > > > Or would you say it's just a better written code?
> > > >
> > > >
> > > > It's the effect of removing speculation barrier.
> > >
> > >
> > > You mean __uaccess_begin_nospec introduced by
> > > commit 304ec1b050310548db33063e567123fae8fd0301
> > > ?
> > >
> > > So fundamentally we do access_ok checks when supplying
> > > the memory table to the kernel thread, and we should
> > > do the spec barrier there.
> > >
> > > Then we can just create and use a variant of uaccess macros that does
> > > not include the barrier?
> > >
> > > Or, how about moving the barrier into access_ok?
> > > This way repeated accesses with a single access_ok get a bit faster.
> > > CC Dan Williams on this idea.
> >
> > It would be interesting to see how expensive re-doing the address
> > limit check is compared to the speculation barrier. I.e. just switch
> > vhost_get_user() to use get_user() rather than __get_user(). That will
> > sanitize the pointer in the speculative path without a barrier.
>
> Hmm it's way cheaper even though IIRC it's measureable.
> Jason, would you like to try?
> Although frankly __get_user being slower than get_user feels very wrong.
> Not yet sure what to do exactly but would you agree?

Agree. __get_user() being faster than get_user() defeats the whole
point of converting code paths to the access_ok() + __get_user()
pattern.

>
>
> > I recall we had a convert access_ok() discussion with this result here:
> >
> > https://lkml.org/lkml/2018/1/17/929
>
> Sorry let me try to clarify. IIUC speculating access_ok once
> is harmless. As Linus said the problem is with "_subsequent_
> accesses that can then be used to perturb the cache".
>
> Thus:
>
> 1. if (!access_ok)
> 2.  return
> 3.  get_user
> 4. if (!access_ok)
> 5.  return
> 6.  get_user
>
> Your proposal that Linus nacked was to effectively add a barrier after
> lines 2 and 5 (also using the array_index_nospec trick for speed),
> right? Unfortunately that needs a big API change.
>
> I am asking about adding barrier_nospec within access_ok.
> Thus effectively before lines 1 and 4.
> access_ok will be slower but after all the point of access_ok is
> to then access the same memory multiple times.

If the barrier is before lines 1 and 4 then it offers no real
protection as far I can see. It will start speculating again on the
user controlled pointer value after the barrier resolves.

> So we should be making __get_user faster and access_ok slower ...

I agree, but then the barrier always needs to be after the access_ok()
check unconditionally called in the return path from access_ok(). At
that point it's back to the implementation that Linus nak'd, or I'm
missing some other detail.

...but maybe if it is limited to a new access_ok_nospec() then the
concern is addressed? Then rename current __get_user() to
__get_user_nospec() and make a new __get_user() that is back to being
optimal.

Re: [RFC PATCH V3 0/5] Hi:

2019-01-06 Thread Dan Williams

On Sun, Jan 6, 2019 at 8:17 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jan 07, 2019 at 11:53:41AM +0800, Jason Wang wrote:
> >
> > On 2019/1/7 上午11:28, Michael S. Tsirkin wrote:
> > > On Mon, Jan 07, 2019 at 10:19:03AM +0800, Jason Wang wrote:
> > > > On 2019/1/3 上午4:47, Michael S. Tsirkin wrote:
> > > > > On Sat, Dec 29, 2018 at 08:46:51PM +0800, Jason Wang wrote:
> > > > > > This series tries to access virtqueue metadata through kernel 
> > > > > > virtual
> > > > > > address instead of copy_user() friends since they had too much
> > > > > > overheads like checks, spec barriers or even hardware feature
> > > > > > toggling.
> > > > > Will review, thanks!
> > > > > One questions that comes to mind is whether it's all about bypassing
> > > > > stac/clac.  Could you please include a performance comparison with
> > > > > nosmap?
> > > > >
> > > > On machine without SMAP (Sandy Bridge):
> > > >
> > > > Before: 4.8Mpps
> > > >
> > > > After: 5.2Mpps
> > > OK so would you say it's really unsafe versus safe accesses?
> > > Or would you say it's just a better written code?
> >
> >
> > It's the effect of removing speculation barrier.
>
>
> You mean __uaccess_begin_nospec introduced by
> commit 304ec1b050310548db33063e567123fae8fd0301
> ?
>
> So fundamentally we do access_ok checks when supplying
> the memory table to the kernel thread, and we should
> do the spec barrier there.
>
> Then we can just create and use a variant of uaccess macros that does
> not include the barrier?
>
> Or, how about moving the barrier into access_ok?
> This way repeated accesses with a single access_ok get a bit faster.
> CC Dan Williams on this idea.

It would be interesting to see how expensive re-doing the address
limit check is compared to the speculation barrier. I.e. just switch
vhost_get_user() to use get_user() rather than __get_user(). That will
sanitize the pointer in the speculative path without a barrier.

I recall we had a convert access_ok() discussion with this result here:

https://lkml.org/lkml/2018/1/17/929

...but it sounds like you are proposing a smaller scope fixup for the
vhost use case? Something like barrier_nospec() in the success path
for all vhost access_ok() checks and then a get_user() variant that
disables the barrier.

Re: [PATCH net] qmi_wwan: Support dynamic config on Quectel EP06

2018-09-10 Thread Dan Williams

On Sat, 2018-09-08 at 16:12 +0200, Bjørn Mork wrote:
> Kristian Evensen  writes:
> 
> > Quectel EP06 (and EM06/EG06) supports dynamic configuration of USB
> > interfaces, without the device changing VID/PID or configuration
> > number.
> > When the configuration is updated and interfaces are added/removed,
> > the
> > interface numbers change. This means that the current code for
> > matching
> > EP06 does not work.
> 
> That's annoying, but hardly surprising. They obviously try to make
> life
> as hard as possible for the drivers.  I wonder what the Windows
> drivers
> do here, if there are any?  Or are these modules only used in
> embedded
> Linux devices?
> 
> > This patch removes the current EP06 interface number match, and
> > replaces
> > it with a match on class, subclass and protocol. Unfortunately,
> > matching
> > on those three alone is not enough, as the diag interface exports
> > the
> > same values as QMI. The other serial interfaces + adb export
> > different
> > values and do not match.
> > 
> > The diag interface only has two endpoints, while the QMI interface
> > has
> > three. I have therefore added a check for number of interfaces, and
> > we
> > ignore the interface if the number of endpoints equals two.
> 
> Could this break if more/other functions are enabled?  Are you sure
> there can't be any other type of serial function with 3 endpoints and
> ff/ff/ff?  Well, I guess no one knows for sure... And this is more
> than
> good enough until it breaks. Thanks for solving the puzzle.  Looks
> good
> to me.

I'm sure they could add another serial interface with ff/ff/ff and 3
endpoints, but I don't know what else we can do to make this device
work correctly.  I double-checked the permutations & logic and it makes
sense to me.

Acked-by: Dan Williams 

Dan

Re: Qualcomm rmnet driver and qmi_wwan

2018-06-05 Thread Dan Williams

On Tue, 2018-06-05 at 11:38 +0200, Daniele Palmas wrote:
> Hi,
> 
> 2018-02-21 20:47 GMT+01:00 Subash Abhinov Kasiviswanathan
> :
> > On 2018-02-21 04:38, Daniele Palmas wrote:
> > > 
> > > Hello,
> > > 
> > > in rmnet kernel documentation I read:
> > > 
> > > "This driver can be used to register onto any physical network
> > > device in
> > > IP mode. Physical transports include USB, HSIC, PCIe and IP
> > > accelerator."
> > > 
> > > Does this mean that it can be used in association with the
> > > qmi_wwan
> > > driver?
> > > 
> > > If yes, can someone give me an hint on the steps to follow?
> > > 
> > > If not, does anyone know if it is possible to modify qmi_wwan in
> > > order
> > > to take advantage of the features provided by the rmnet driver?
> > > 
> > > In this case hint on the changes for modifying qmi_wwan are
> > > welcome.
> > > 
> > > Thanks in advance,
> > > Daniele
> > 
> > 
> > Hi
> > 
> > I havent used qmi_wwan so the following comment is based on code
> > inspection.
> > qmimux_register_device() is creating qmimux devices with usb net
> > device as
> > real_dev. The Multiplexing and aggregation header (qmimux_hdr) is
> > stripped
> > off
> > in qmimux_rx_fixup() and the packet is passed on to stack.
> > 
> > You could instead create rmnet devices with the usb netdevice as
> > real dev.
> > The packets from the usb net driver can be queued to network stack
> > directly
> > as rmnet driver will setup a RX handler. rmnet driver will process
> > the
> > packets
> > further and then queue to network stack.
> > 
> 
> in kernel documentation I read that rmnet user space configuration is
> done through librmnetctl available at
> 
> https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource
> /dataservices/tree/rmnetctl
> 
> However it seems to me that this is a bit outdated (e.g. it does not
> properly build since it is looking for kernel header
> linux/rmnet_data.h that, as far as I understand, is no more present).
> 
> Is there available a more recent version of the tool?

I'd expect that somebody (Subash?) would add support for the
rmnet/qmimux options to iproute2 via 'ip link' like exists for most
other device types.

Dan

> Thanks,
> Daniele
> 
> > --
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > a Linux Foundation Collaborative Project

Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-04-17 Thread Dan Williams

On Sun, 2018-04-15 at 18:16 +0200, Pavel Machek wrote:
> On Mon 2018-03-26 10:33:55, Dan Williams wrote:
> > On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > > > 
> > > > > Broken state.
> > > > > 
> > > > > pavel@amd:~$ nmcli dev
> > > > > DEVICE  TYPE  STATECONNECTION
> > > > > eth1ethernet  unavailable  --
> > > > > lo  loopback  unmanaged--
> > > > > wlan0   wifi  unmanaged--
> > > > 
> > > > If the state is "unmanaged" on resume, that would indicate a
> > > > problem
> > > > with sleep/wake and likely not a kernel network device issue.
> > > > 
> > > > We should probably move this discussion to the NM lists to
> > > > debug
> > > > further.  Before you suspend, run "nmcli gen log level trace"
> > > > to
> > > > turn
> > > > on full debug logging, then reproduce the issue, and send a
> > > > pointer
> > > > to
> > > > those logs (scrubbed for anything you consider sensitive) to
> > > > the NM
> > > > mailing list.
> > > 
> > > Hmm :-)
> > > 
> > > root@amd:/data/pavel# nmcli gen log level trace
> > > Error: Unknown log level 'trace'
> > 
> > What NM version?  'trace' is pretty old (since 1.0 from December
> > 2014)
> > so unless you're using a really, really old version of Debian I'd
> > expect you'd have it.  Anyway, debug would do.
> 
> Hmm.
> 
> pavel@duo:~$ /usr/sbin/NetworkManager --version
> You must be root to run NetworkManager!
> pavel@duo:~$ sudo /usr/sbin/NetworkManager --version
> 0.9.10.0
> 
> So I set the log level, but I still don't see much in the log:
> 
> Apr 14 18:14:29 duo dbus[3009]: [system] Successfully activated
> service 'org.freedesktop.nm_dispatcher'
> Apr 14 18:14:29 duo nm-dispatcher: Dispatching action 'down' for
> wlan1
> Apr 14 18:14:29 duo systemd[1]: Started Network Manager Script
> Dispatcher Service.
> Apr 14 18:14:29 duo systemd-sleep[6853]: Suspending system...

I think if systemd really is handling the suspend/resume, you'll need
to make sure NM has systemd suspend/resume handling enabled as well. 
NM 0.9.10 has build-time support for both:

[--with-suspend-resume=upower|systemd]

while later versions also support ConsoleKit2 and elogind.

Any idea what your NM was compiled with?  Though honestly I'm not sure
how all the suspend/resume stuff is supposed to work these days on
different distros that provide the choice between systemd/upower/pm-
utils/CK2/etc.

Dan

> Apr 14 21:27:53 duo systemd[1]: systemd-journald.service watchdog
> timeout (limit 1min)!
> pavel@duo:~$ date
> Sun Apr 15 12:26:32 CEST 2018
> pavel@duo:~$
> 
> Is it possible that time handling accross suspend changed in v4.17?
> 
> I get some weird effects. With display backlight...
> 
> > > Where do I get the logs? I don't see much in the syslog...
> > > And.. It seems that it is "every other suspend". One resume
> > > results
> > > in
> > > broken network, one in working one, one in broken one...
> > 
> > Does your distro use pm-utils, upower, or systemd for
> > suspend/resume
> > handling?
> 
> upower, I guess:
> 
> pavel@duo:/data/l/linux$ ps aux | grep upower
> root  3820  0.0  0.1  42848  7984 ?Ssl  Apr14   0:01
> /usr/lib/upower/upowerd
> 
>   
> Pavel

Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-03-26 Thread Dan Williams

On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > 
> > > Broken state.
> > > 
> > > pavel@amd:~$ nmcli dev
> > > DEVICE  TYPE  STATECONNECTION
> > > eth1ethernet  unavailable  --
> > > lo  loopback  unmanaged--
> > > wlan0   wifi  unmanaged--
> > 
> > If the state is "unmanaged" on resume, that would indicate a
> > problem
> > with sleep/wake and likely not a kernel network device issue.
> > 
> > We should probably move this discussion to the NM lists to debug
> > further.  Before you suspend, run "nmcli gen log level trace" to
> > turn
> > on full debug logging, then reproduce the issue, and send a pointer
> > to
> > those logs (scrubbed for anything you consider sensitive) to the NM
> > mailing list.
> 
> Hmm :-)
> 
> root@amd:/data/pavel# nmcli gen log level trace
> Error: Unknown log level 'trace'

What NM version?  'trace' is pretty old (since 1.0 from December 2014)
so unless you're using a really, really old version of Debian I'd
expect you'd have it.  Anyway, debug would do.

> root@amd:/data/pavel# nmcli gen log level help
> Error: Unknown log level 'help'

nmcli gen help

> root@amd:/data/pavel# nmcli gen log level
> Error: value for 'level' argument is required.
> root@amd:/data/pavel# nmcli gen log level debug

This should be OK.

> root@amd:/data/pavel# cat /var/log/sys/log

It routes it to whatever the syslog 'daemon' facility logs to (however
that's configured on your system).  Usually /var/log/messages or
/var/log/daemon.log or sometimes your distro configures it to
/var/log/NetworkManager.log.

Or if you're using a systemd-based distro, it would probably be in the
systemd journal so "journalctl -b -u NetworkManager"

> Where do I get the logs? I don't see much in the syslog...

> And.. It seems that it is "every other suspend". One resume results
> in
> broken network, one in working one, one in broken one...

Does your distro use pm-utils, upower, or systemd for suspend/resume
handling?

Dan

Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-03-20 Thread Dan Williams

On Tue, 2018-03-20 at 09:03 +0100, Pavel Machek wrote:
> On Mon 2018-03-19 12:45:49, Dan Williams wrote:
> > On Mon, 2018-03-19 at 18:33 +0100, Pavel Machek wrote:
> > > On Mon 2018-03-19 10:40:08, Dan Williams wrote:
> > > > On Mon, 2018-03-19 at 10:21 +0100, Pavel Machek wrote:
> > > > > On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> > > > > > Pavel Machek wrote:
> > > > > > > Hi!
> > > > > > > 
> > > > > > > With recent linux-next, after resume networkmanager often
> > > > > > > claims
> > > > > > > that
> > > > > > > "network is disabled". Sometimes suspend/resume clears
> > > > > > > that.
> > > > > > > 
> > > > > > > Any ideas? Does it work for you?
> > > > > > >   
> > > > > > >   
> > > > > > > Pavel
> > > > > > 
> > > > > > Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> > > > > 
> > > > > Thanks for testing... but yes, 4.16 should be ok. If not
> > > > > fixed,
> > > > > problem will appear in 4.17-rc1.
> > > > 
> > > > Where does the complaint occur?  In the GUI, or with nmcli, or
> > > > somewhere else?  Also, what's the output of "nmcli dev" after
> > > > resume?
> > > 
> > > In the GUI. I click in place where I'd select access point, and
> > > menu
> > > does not show up, telling me that "network is disabled".
> > 
> > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> 
> Broken state.
> 
> pavel@amd:~$ nmcli dev
> DEVICE  TYPE  STATECONNECTION
> eth1ethernet  unavailable  --
> lo  loopback  unmanaged--
> wlan0   wifi  unmanaged--

If the state is "unmanaged" on resume, that would indicate a problem
with sleep/wake and likely not a kernel network device issue.

We should probably move this discussion to the NM lists to debug
further.  Before you suspend, run "nmcli gen log level trace" to turn
on full debug logging, then reproduce the issue, and send a pointer to
those logs (scrubbed for anything you consider sensitive) to the NM
mailing list.

Dan

> pavel@amd:~$ nmcli radio
> WIFI-HW  WIFI WWAN-HW  WWAN
> enabled  enabled  enabled  enabled
> pavel@amd:~$ uname -a
> Linux amd 4.16.0-rc5-next-20180314+ #31 SMP Thu Mar 15 14:27:19 CET
> 2018 i686 GNU/Linux
> 
> Let me suspend/resume. I was lucky and got it into working state:
> 
> pavel@amd:~$ nmcli dev
> DEVICE  TYPE  STATECONNECTION
> wlan0   wifi  connectedopenwireless.org
> eth1ethernet  unavailable  --
> lo  loopback  unmanaged--
> pavel@amd:~$ nmcli radio
> WIFI-HW  WIFI WWAN-HW  WWAN
> enabled  enabled  enabled  enabled
> paveamd:~$
> 
> Best regards,
>   
> Pavel
>

Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-03-19 Thread Dan Williams

On Mon, 2018-03-19 at 18:33 +0100, Pavel Machek wrote:
> On Mon 2018-03-19 10:40:08, Dan Williams wrote:
> > On Mon, 2018-03-19 at 10:21 +0100, Pavel Machek wrote:
> > > On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> > > > Pavel Machek wrote:
> > > > > Hi!
> > > > > 
> > > > > With recent linux-next, after resume networkmanager often
> > > > > claims
> > > > > that
> > > > > "network is disabled". Sometimes suspend/resume clears that.
> > > > > 
> > > > > Any ideas? Does it work for you?
> > > > >   
> > > > >   
> > > > > Pavel
> > > > 
> > > > Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> > > 
> > > Thanks for testing... but yes, 4.16 should be ok. If not fixed,
> > > problem will appear in 4.17-rc1.
> > 
> > Where does the complaint occur?  In the GUI, or with nmcli, or
> > somewhere else?  Also, what's the output of "nmcli dev" after
> > resume?
> 
> In the GUI. I click in place where I'd select access point, and menu
> does not show up, telling me that "network is disabled".

Ok, what does 'nmcli dev' and 'nmcli radio' show?

Dan

Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-03-19 Thread Dan Williams

On Mon, 2018-03-19 at 10:21 +0100, Pavel Machek wrote:
> On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> > Pavel Machek wrote:
> > > Hi!
> > > 
> > > With recent linux-next, after resume networkmanager often claims
> > > that
> > > "network is disabled". Sometimes suspend/resume clears that.
> > > 
> > > Any ideas? Does it work for you?
> > >   
> > > Pavel
> > 
> > Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> 
> Thanks for testing... but yes, 4.16 should be ok. If not fixed,
> problem will appear in 4.17-rc1.

Where does the complaint occur?  In the GUI, or with nmcli, or
somewhere else?  Also, what's the output of "nmcli dev" after resume?

Dan

[PATCH] mpls, nospec: Sanitize array index in mpls_label_ok()

2018-02-07 Thread Dan Williams

mpls_label_ok() validates that the 'platform_label' array index from a
userspace netlink message payload is valid. Under speculation the
mpls_label_ok() result may not resolve in the CPU pipeline until after
the index is used to access an array element. Sanitize the index to zero
to prevent userspace-controlled arbitrary out-of-bounds speculation, a
precursor for a speculative execution side channel vulnerability.

Cc: 
Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Dan Williams 
---
 net/mpls/af_mpls.c |   24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..aae3565c3a92 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -935,24 +936,27 @@ static int mpls_nh_build_multi(struct mpls_route_config 
*cfg,
return err;
 }
 
-static bool mpls_label_ok(struct net *net, unsigned int index,
+static bool mpls_label_ok(struct net *net, unsigned int *index,
  struct netlink_ext_ack *extack)
 {
+   bool is_ok = true;
+
/* Reserved labels may not be set */
-   if (index < MPLS_LABEL_FIRST_UNRESERVED) {
+   if (*index < MPLS_LABEL_FIRST_UNRESERVED) {
NL_SET_ERR_MSG(extack,
   "Invalid label - must be 
MPLS_LABEL_FIRST_UNRESERVED or higher");
-   return false;
+   is_ok = false;
}
 
/* The full 20 bit range may not be supported. */
-   if (index >= net->mpls.platform_labels) {
+   if (is_ok && *index >= net->mpls.platform_labels) {
NL_SET_ERR_MSG(extack,
   "Label >= configured maximum in 
platform_labels");
-   return false;
+   is_ok = false;
}
 
-   return true;
+   *index = array_index_nospec(*index, net->mpls.platform_labels);
+   return is_ok;
 }
 
 static int mpls_route_add(struct mpls_route_config *cfg,
@@ -975,7 +979,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
index = find_free_label(net);
}
 
-   if (!mpls_label_ok(net, index, extack))
+   if (!mpls_label_ok(net, &index, extack))
goto errout;
 
/* Append makes no sense with mpls */
@@ -1052,7 +1056,7 @@ static int mpls_route_del(struct mpls_route_config *cfg,
 
index = cfg->rc_label;
 
-   if (!mpls_label_ok(net, index, extack))
+   if (!mpls_label_ok(net, &index, extack))
goto errout;
 
mpls_route_update(net, index, NULL, &cfg->rc_nlinfo);
@@ -1810,7 +1814,7 @@ static int rtm_to_route_config(struct sk_buff *skb,
goto errout;
 
if (!mpls_label_ok(cfg->rc_nlinfo.nl_net,
-  cfg->rc_label, extack))
+  &cfg->rc_label, extack))
goto errout;
break;
}
@@ -2137,7 +2141,7 @@ static int mpls_getroute(struct sk_buff *in_skb, struct 
nlmsghdr *in_nlh,
goto errout;
}
 
-   if (!mpls_label_ok(net, in_label, extack)) {
+   if (!mpls_label_ok(net, &in_label, extack)) {
err = -EINVAL;
goto errout;
}

Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-18 Thread Dan Williams

On Thu, Jan 18, 2018 at 5:18 AM, Will Deacon  wrote:
> Hi Dan, Linus,
>
> On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote:
>> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
>>  wrote:
>> > On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
>> > wrote:
>> >>
>> >> This series incorporates Mark Rutland's latest ARM changes and adds
>> >> the x86 specific implementation of 'ifence_array_ptr'. That ifence
>> >> based approach is provided as an opt-in fallback, but the default
>> >> mitigation, '__array_ptr', uses a 'mask' approach that removes
>> >> conditional branches instructions, and otherwise aims to redirect
>> >> speculation to use a NULL pointer rather than a user controlled value.
>> >
>> > Do you have any performance numbers and perhaps example code
>> > generation? Is this noticeable? Are there any microbenchmarks showing
>> > the difference between lfence use and the masking model?
>>
>> I don't have performance numbers, but here's a sample code generation
>> from __fcheck_files, where the 'and; lea; and' sequence is portion of
>> array_ptr() after the mask generation with 'sbb'.
>>
>> fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
>>  8e7:   8b 02   mov(%rdx),%eax
>>  8e9:   48 39 c7cmp%rax,%rdi
>>  8ec:   48 19 c9sbb%rcx,%rcx
>>  8ef:   48 8b 42 08 mov0x8(%rdx),%rax
>>  8f3:   48 89 femov%rdi,%rsi
>>  8f6:   48 21 ceand%rcx,%rsi
>>  8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
>>  8fd:   48 21 c8and%rcx,%rax
>>
>>
>> > Having both seems good for testing, but wouldn't we want to pick one in 
>> > the end?
>>
>> I was thinking we'd keep it as a 'just in case' sort of thing, at
>> least until the 'probably safe' assumption of the 'mask' approach has
>> more time to settle out.
>
> From the arm64 side, the only concern I have (and this actually applies to
> our CSDB sequence as well) is the calculation of the array size by the
> caller. As Linus mentioned at the end of [1], if the determination of the
> size argument is based on a conditional branch, then masking doesn't help
> because you bound within the wrong range under speculation.
>
> We ran into this when trying to use masking to protect our uaccess routines
> where the conditional bound is either KERNEL_DS or USER_DS. It's possible
> that a prior conditional set_fs(KERNEL_DS) could defeat the masking and so
> we'd need to throw some heavy barriers in set_fs to make it robust.

At least in the conditional mask case near set_fs() usage the approach
we are taking is to use a barrier. I.e. the following guidance from
Linus:

"Basically, the rule is trivial: find all 'stac' users, and use address
masking if those users already integrate the limit check, and lfence
they don't."

...which translates to narrow the pointer for get_user() and use a
barrier  for __get_user().

Re: [PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Fri, Jan 12, 2018 at 12:01 PM, Christian Lamparter
 wrote:
> On Friday, January 12, 2018 7:39:50 PM CET Dan Williams wrote:
>> On Fri, Jan 12, 2018 at 6:42 AM, Christian Lamparter  
>> wrote:
>> > On Friday, January 12, 2018 1:47:46 AM CET Dan Williams wrote:
>> >> Static analysis reports that 'queue' may be a user controlled value that
>> >> is used as a data dependency to read from the 'ar9170_qmap' array. In
>> >> order to avoid potential leaks of kernel memory values, block
>> >> speculative execution of the instruction stream that could issue reads
>> >> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
>> >> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
>> >> 'ar->edcf' array.
>> >>
>> >> Based on an original patch by Elena Reshetova.
>> >>
>> >> Cc: Christian Lamparter 
>> >> Cc: Kalle Valo 
>> >> Cc: linux-wirel...@vger.kernel.org
>> >> Cc: netdev@vger.kernel.org
>> >> Signed-off-by: Elena Reshetova 
>> >> Signed-off-by: Dan Williams 
>> >> ---
>> > This patch (and p54, cw1200) look like the same patch?!
>> > Can you tell me what happend to:
>> >
>> > On Saturday, January 6, 2018 5:34:03 PM CET Dan Williams wrote:
>> >> On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter  
>> >> wrote:
>> >> > And Furthermore a invalid queue (param->ac) would cause a crash in
>> >> > this line in mac80211 before it even reaches the driver [1]:
>> >> > |   sdata->tx_conf[params->ac] = p;
>> >> > |   
>> >> > |   if (drv_conf_tx(local, sdata, >>>> params->ac <<<<, &p)) {
>> >> > |^^ (this is a wrapper for the *_op_conf_tx)
>> >> >
>> >> > I don't think these chin-up exercises are needed.
>> >>
>> >> Quite the contrary, you've identified a better place in the call stack
>> >> to sanitize the input and disable speculation. Then we can kill the
>> >> whole class of the wireless driver reports at once it seems.
>> > <https://www.spinics.net/lists/netdev/msg476353.html>
>>
>> I didn't see where ac is being validated against the driver specific
>> 'queues' value in that earlier patch.
> The link to the check is right there in the earlier post. It's in
> parse_txq_params():
> <https://github.com/torvalds/linux/blob/master/net/wireless/nl80211.c#L2070>
> |   if (txq_params->ac >= NL80211_NUM_ACS)
> |   return -EINVAL;
>
> NL80211_NUM_ACS is 4
> <http://elixir.free-electrons.com/linux/v4.15-rc7/source/include/uapi/linux/nl80211.h#L3748>
>
> This check was added ever since mac80211's ieee80211_set_txq_params():
> | sdata->tx_conf[params->ac] = p;
>
> For cw1200: the driver just sets the dev->queue to 4.
> In carl9170 dev->queues is set to __AR9170_NUM_TXQ and
> p54 uses P54_QUEUE_AC_NUM.
>
> Both __AR9170_NUM_TXQ and P54_QUEUE_AC_NUM are 4.
> And this is not going to change since all drivers
> have to follow mac80211's queue API:
> <https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/queues>
>
> Some background:
> In the old days (linux 2.6 and early 3.x), the parse_txq_params() function did
> not verify the "queue" value. That's why these drivers had to do it.
>
> Here's the relevant code from 2.6.39:
> <http://elixir.free-electrons.com/linux/v2.6.39/source/net/wireless/nl80211.c#L879>
> You'll notice that the check is missing there.
> Here's mac80211's ieee80211_set_txq_params for reference:
> <http://elixir.free-electrons.com/linux/v2.6.39/source/net/mac80211/cfg.c#L1197>
>
> However over time, the check in the driver has become redundant.
>

Excellent, thank you for pointing that out and the background so clearly.

What this tells me though is that we want to inject an ifence() at
this input validation point, i.e.:

if (txq_params->ac >= NL80211_NUM_ACS) {
ifence();
return -EINVAL;
}

...but the kernel, in these patches, only has ifence() defined for
x86. The only way we can sanitize the 'txq_params->ac' value without
ifence() is to do it at array access time, but then we're stuck
touching all drivers when standard kernel development practice says
'refactor common code out of drivers'.

Ugh, the bigger concern is that this driver is being flagged and not
that initial bounds check. Imagine if cw1200 and p54 had already been
converted to assume that they can just trust 'queue'. It would never
have been flagged.

I think we should focus on the get_user path and  __fcheck_files for v3.

Re: [PATCH v2 10/19] ipv4: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Thu, Jan 11, 2018 at 11:59 PM, Greg KH  wrote:
> On Thu, Jan 11, 2018 at 04:47:18PM -0800, Dan Williams wrote:
>> Static analysis reports that 'offset' may be a user controlled value
>> that is used as a data dependency reading from a raw_frag_vec buffer.
>> In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid '*(rfv->c + offset)' value.
>>
>> Based on an original patch by Elena Reshetova.
>
> There is the "Co-Developed-by:" tag now, if you want to use it...

Ok, thanks.

>
>> Cc: "David S. Miller" 
>> Cc: Alexey Kuznetsov 
>> Cc: Hideaki YOSHIFUJI 
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
>>  net/ipv4/raw.c |   10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> Ugh, what is this, the 4th time I've said "I don't think this is an
> issue, so why are you changing this code." to this patch.  To be
> followed by a "oh yeah, you are right, I'll drop it", only to see it
> show back up in the next time this patch series is sent out?
>
> Same for the other patches in this series that I have reviewed 4, maybe
> 5, times already.  The "v2" is not very true here...

The theme of the review feedback on v1 was 'don't put ifence in any
net/ code', and that was addressed.

I honestly thought the new definition of array_ptr() changed the
calculus on this patch. Given the same pattern appears in the ipv6
case, and I have yet to hear that we should drop the ipv6 patch, make
the code symmetric just for readability purposes. Otherwise we need a
comment saying why this is safe for ipv4, but maybe not safe for ipv6,
I think 'array_ptr' is effectively that comment. I.e. 'array_ptr()' is
designed to be low impact for instrumenting false positives. If that
new argument does not hold water I will definitely drop this patch.

Re: [PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Fri, Jan 12, 2018 at 6:42 AM, Christian Lamparter  wrote:
> On Friday, January 12, 2018 1:47:46 AM CET Dan Williams wrote:
>> Static analysis reports that 'queue' may be a user controlled value that
>> is used as a data dependency to read from the 'ar9170_qmap' array. In
>> order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue reads
>> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
>> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
>> 'ar->edcf' array.
>>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Christian Lamparter 
>> Cc: Kalle Valo 
>> Cc: linux-wirel...@vger.kernel.org
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
> This patch (and p54, cw1200) look like the same patch?!
> Can you tell me what happend to:
>
> On Saturday, January 6, 2018 5:34:03 PM CET Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter  
>> wrote:
>> > And Furthermore a invalid queue (param->ac) would cause a crash in
>> > this line in mac80211 before it even reaches the driver [1]:
>> > |   sdata->tx_conf[params->ac] = p;
>> > |   
>> > |   if (drv_conf_tx(local, sdata, >>>> params->ac <<<<, &p)) {
>> > |^^ (this is a wrapper for the *_op_conf_tx)
>> >
>> > I don't think these chin-up exercises are needed.
>>
>> Quite the contrary, you've identified a better place in the call stack
>> to sanitize the input and disable speculation. Then we can kill the
>> whole class of the wireless driver reports at once it seems.
> <https://www.spinics.net/lists/netdev/msg476353.html>

I didn't see where ac is being validated against the driver specific
'queues' value in that earlier patch.

>
> Anyway, I think there's an easy way to solve this: remove the
> "if (queue < ar->hw->queues)" check altogether. It's no longer needed
> anymore as the "queue" value is validated long before the driver code
> gets called.

Can you point me to where that validation happens?

> And from my understanding, this will fix the "In this case
> the value of 'ar9170_qmap[queue]' is immediately reused as an index to
> the 'ar->edcf' array." gripe your tool complains about.
>
> This is here's a quick test-case for carl9170.:
> ---
> diff --git a/drivers/net/wireless/ath/carl9170/main.c 
> b/drivers/net/wireless/ath/carl9170/main.c
> index 988c8857d78c..2d3afb15bb62 100644
> --- a/drivers/net/wireless/ath/carl9170/main.c
> +++ b/drivers/net/wireless/ath/carl9170/main.c
> @@ -1387,13 +1387,8 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
> int ret;
>
> mutex_lock(&ar->mutex);
> -   if (queue < ar->hw->queues) {
> -   memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
> -   ret = carl9170_set_qos(ar);
> -   } else {
> -   ret = -EINVAL;
> -   }
> -
> +   memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
> +   ret = carl9170_set_qos(ar);
> mutex_unlock(&ar->mutex);
> return ret;
>  }
> ---
> What does your tool say about this?

If you take away the 'if' then I don't the tool will report on this.

> (If necessary, the "queue" value could be even sanitized with a
> queue %= ARRAY_SIZE(ar9170_qmap); before the mutex_lock.)

That is what array_ptr() is doing in a more sophisticated way.

Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
 wrote:
> On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
> wrote:
>>
>> This series incorporates Mark Rutland's latest ARM changes and adds
>> the x86 specific implementation of 'ifence_array_ptr'. That ifence
>> based approach is provided as an opt-in fallback, but the default
>> mitigation, '__array_ptr', uses a 'mask' approach that removes
>> conditional branches instructions, and otherwise aims to redirect
>> speculation to use a NULL pointer rather than a user controlled value.
>
> Do you have any performance numbers and perhaps example code
> generation? Is this noticeable? Are there any microbenchmarks showing
> the difference between lfence use and the masking model?

I don't have performance numbers, but here's a sample code generation
from __fcheck_files, where the 'and; lea; and' sequence is portion of
array_ptr() after the mask generation with 'sbb'.

fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
 8e7:   8b 02   mov(%rdx),%eax
 8e9:   48 39 c7cmp%rax,%rdi
 8ec:   48 19 c9sbb%rcx,%rcx
 8ef:   48 8b 42 08 mov0x8(%rdx),%rax
 8f3:   48 89 femov%rdi,%rsi
 8f6:   48 21 ceand%rcx,%rsi
 8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
 8fd:   48 21 c8and%rcx,%rax


> Having both seems good for testing, but wouldn't we want to pick one in the 
> end?

I was thinking we'd keep it as a 'just in case' sort of thing, at
least until the 'probably safe' assumption of the 'mask' approach has
more time to settle out.

>
> Also, I do think that there is one particular array load that would
> seem to be pretty obvious: the system call function pointer array.
>
> Yes, yes, the actual call is now behind a retpoline, but that protects
> against a speculative BTB access, it's not obvious that it  protects
> against the mispredict of the __NR_syscall_max comparison in
> arch/x86/entry/entry_64.S.
>
> The act of fetching code is a kind of read too. And retpoline protects
> against BTB stuffing etc, but what if the _actual_ system call
> function address is wrong (due to mis-prediction of the system call
> index check)?
>
> Should the array access in entry_SYSCALL_64_fastpath be made to use
> the masking approach?

I'll take a look. I'm firmly in the 'patch first / worry later' stance
on these investigations.

[PATCH v2 10/19] ipv4: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv4/raw.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 125c1eab3eaa..91091a10294f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -472,17 +473,18 @@ static int raw_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   rfv_buf = array_ptr(rfv->hdr.c, offset, rfv->hlen);
+   if (rfv_buf) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->hdr.c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->hdr.c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

[PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'ar9170_qmap' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'ar9170_qmap[queue]'. In this case the
value of 'ar9170_qmap[queue]' is immediately reused as an index to the
'ar->edcf' array.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/ath/carl9170/main.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/carl9170/main.c 
b/drivers/net/wireless/ath/carl9170/main.c
index 988c8857d78c..0acfa8c22b7d 100644
--- a/drivers/net/wireless/ath/carl9170/main.c
+++ b/drivers/net/wireless/ath/carl9170/main.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "hw.h"
@@ -1384,11 +1385,13 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
   const struct ieee80211_tx_queue_params *param)
 {
struct ar9170 *ar = hw->priv;
+   const u8 *elem;
int ret;
 
mutex_lock(&ar->mutex);
-   if (queue < ar->hw->queues) {
-   memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
+   elem = array_ptr(ar9170_qmap, queue, ar->hw->queues);
+   if (elem) {
+   memcpy(&ar->edcf[*elem], param, sizeof(*param));
ret = carl9170_set_qos(ar);
} else {
ret = -EINVAL;

[PATCH v2 16/19] p54: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'priv->qos_params' array.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'priv->qos_params[queue]'.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/intersil/p54/main.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intersil/p54/main.c 
b/drivers/net/wireless/intersil/p54/main.c
index ab6d39e12069..5ce693ff547e 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -411,12 +412,14 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
   const struct ieee80211_tx_queue_params *params)
 {
struct p54_common *priv = dev->priv;
+   struct p54_edcf_queue_param *p54_q;
int ret;
 
mutex_lock(&priv->conf_mutex);
-   if (queue < dev->queues) {
-   P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
-   params->cw_min, params->cw_max, params->txop);
+   p54_q = array_ptr(priv->qos_params, queue, dev->queues);
+   if (p54_q) {
+   P54_SET_QUEUE(p54_q[0], params->aifs, params->cw_min,
+   params->cw_max, params->txop);
ret = p54_set_edcf(priv);
} else
ret = -EINVAL;

[PATCH v2 18/19] cw1200: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read 'txq_params' from the
'priv->tx_queue_params.params' array.  In order to avoid potential leaks
of kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'txq_params'.
In this case 'txq_params' is referenced later in the function.

Based on an original patch by Elena Reshetova.

Cc: Solomon Peachy 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/st/cw1200/sta.c |   11 +++
 drivers/net/wireless/st/cw1200/wsm.h |4 +---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/st/cw1200/sta.c 
b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..7521077e50a4 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cw1200.h"
 #include "sta.h"
@@ -612,18 +613,20 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct 
ieee80211_vif *vif,
   u16 queue, const struct ieee80211_tx_queue_params *params)
 {
struct cw1200_common *priv = dev->priv;
+   struct wsm_set_tx_queue_params *txq_params;
int ret = 0;
/* To prevent re-applying PM request OID again and again*/
bool old_uapsd_flags;
 
mutex_lock(&priv->conf_mutex);
 
-   if (queue < dev->queues) {
+   txq_params = array_ptr(priv->tx_queue_params.params, queue,
+   dev->queues);
+   if (txq_params) {
old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);
 
-   WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0);
-   ret = wsm_set_tx_queue_params(priv,
- 
&priv->tx_queue_params.params[queue], queue);
+   WSM_TX_QUEUE_SET(txq_params, 0, 0, 0);
+   ret = wsm_set_tx_queue_params(priv, txq_params, queue);
if (ret) {
ret = -EINVAL;
goto out;
diff --git a/drivers/net/wireless/st/cw1200/wsm.h 
b/drivers/net/wireless/st/cw1200/wsm.h
index 48086e849515..8c8d9191e233 100644
--- a/drivers/net/wireless/st/cw1200/wsm.h
+++ b/drivers/net/wireless/st/cw1200/wsm.h
@@ -1099,10 +1099,8 @@ struct wsm_tx_queue_params {
 };
 
 
-#define WSM_TX_QUEUE_SET(queue_params, queue, ack_policy, allowed_time,\
-   max_life_time)  \
+#define WSM_TX_QUEUE_SET(p, ack_policy, allowed_time, max_life_time)   \
 do {   \
-   struct wsm_set_tx_queue_params *p = &(queue_params)->params[queue]; \
p->ackPolicy = (ack_policy);\
p->allowedMediumTime = (allowed_time);  \
p->maxTransmitLifetime = (max_life_time);   \

[PATCH v2 19/19] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array.  In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid 'rt' value.

Based on an original patch by Elena Reshetova.

Eric notes:
"
When val is a pointer not an integer.
Then
array2[val] = y;
/* or */
y = array2[va];

Won't happen.

val->field;

Will happen.

Which looks similar.  However the address space of pointers is too
large.  Making it impossible for an attack to know where to look in
the cache to see if "val->field" happened.  At least on the
assumption that val is an arbitrary value.

Further mpls_forward is small enough the entire scope of "rt" the
value read possibly past the bound check is auditable without too
much trouble.  I have looked and I don't see anything that could
possibly allow the value of "rt" to be exfitrated.  The problem
continuing to be that it is a pointer and the only operation on the
pointer besides derferencing it is testing if it is NULL.

Other types of timing attacks are very hard if not impossible
because any packet presenting with a value outside the bounds check
will be dropped.  So it will hard if not impossible to find
something to time to see how long it took to drop the packet.
"

The motivation of resending this patch despite the NAK is to
continue a community wide discussion on the bar for judging Spectre
changes. I.e. is any user controlled speculative pointer in the
kernel a pointer too far, especially given the current array_ptr()
implementation.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/mpls/af_mpls.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..c92b1033adc2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct 
mpls_route *rt,
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
struct mpls_route *rt = NULL;
+   struct mpls_route __rcu **platform_label =
+   rcu_dereference(net->mpls.platform_label);
+   struct mpls_route __rcu **rtp;
 
-   if (index < net->mpls.platform_labels) {
-   struct mpls_route __rcu **platform_label =
-   rcu_dereference(net->mpls.platform_label);
-   rt = rcu_dereference(platform_label[index]);
-   }
+   rtp = array_ptr(platform_label, index, net->mpls.platform_labels);
+   if (rtp)
+   rt = rcu_dereference(*rtp);
return rt;
 }

[PATCH v2 09/19] ipv6: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw6_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv6/raw.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 761a473a07c5..0b7ceeb6f709 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -725,17 +726,18 @@ static int raw6_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw6_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   rfv_buf = array_ptr(rfv->c, offset, rfv->hlen);
+   if (rfv_buf) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

[PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

Changes since v1 [1]:
* fixup the ifence definition to use alternative_2 per recent AMD
  changes in tip/x86/pti (Tom)

* drop 'nospec_ptr' (Linus, Mark)

* rename 'nospec_array_ptr' to 'array_ptr' (Alexei)

* rename 'nospec_barrier' to 'ifence' (Peter, Ingo)

* clean up occasions of 'variable assignment in if()' (Sergei, Stephen)

* make 'array_ptr' use a mask instead of an architectural ifence by
  default (Linus, Alexei)

* provide a command line and compile-time opt-in to the ifence
  mechanism, if an architecture provides 'ifence_array_ptr'.

* provide an optimized mask generation helper, 'array_ptr_mask', for
  x86 (Linus)

* move 'get_user' hardening from '__range_not_ok' to '__uaccess_begin'
  (Linus)

* drop "Thermal/int340x: prevent bounds-check..." since userspace does
  not have arbitrary control over the 'trip' index (Srinivas)

* update the changelog of "net: mpls: prevent bounds-check..." and keep
  it in the series to continue the debate about Spectre hygiene patches.
  (Eric).

* record a reviewed-by from Laurent on "[media] uvcvideo: prevent
  bounds-check..."

* update the cover letter

[1]: https://lwn.net/Articles/743376/

---

Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [2]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest ARM changes and adds
the x86 specific implementation of 'ifence_array_ptr'. That ifence
based approach is provided as an opt-in fallback, but the default
mitigation, '__array_ptr', uses a 'mask' approach that removes
conditional branches instructions, and otherwise aims to redirect
speculation to use a NULL pointer rather than a user controlled value.

The mask is generated by the following from Alexei, and Linus:

mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);

...and Linus provided an optimized mask generation helper for x86:

asm ("cmpq %1,%2; sbbq %0,%0;"
:"=r" (mask)
:"r"(sz),"r" (idx)
:"cc");

The 'array_ptr' mechanism can be switched between 'mask' and 'ifence'
via the spectre_v1={mask,ifence} command line option, and the
compile-time default is set by selecting either CONFIG_SPECTRE1_MASK or
CONFIG_SPECTRE1_IFENCE.

The 'array_ptr' infrastructure is the primary focus this patch set. The
individual patches that perform 'array_ptr' conversions are a point in
time (i.e. earlier kernel, early analysis tooling, x86 only etc...)
start at finding some of these gadgets.

Another consideration for reviewing these patches is the 'hygiene'
argument. When a patch refers to hygiene it is concerned with stopping
speculation on an unconstrained or insufficiently constrained pointer
value under userspace control. That by itself is not sufficient for
attack (per current understanding) [3], but it is a necessary
pre-condition.  So 'hygiene' refers to cleaning up those suspect
pointers regardless of whether they are usable as a gadget.

These patches are also be available via the 'nospec-v2' git branch
here:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec-v2

Note that the BPF fix for Spectre variant1 is merged in the bpf.git
tree [4], and is not included in this branch.

[2]: 
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[3]: https://spectreattack.com/spectre.pdf
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc98

---

Dan Williams (16):
  x86: implement ifence()
  x86: implement ifence_array_ptr() and array_ptr_mask()
  asm-generic/barrier: mask speculative execution flows
  x86: introduce __uaccess_begin_nospec and ASM_IFENCE
  x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  ipv6: prevent bounds-check bypass via speculative execution
  ipv4: prevent bounds-check bypass via speculative execution
  vfs, fdtable: prevent bounds-check bypass via speculative execution
  userns: prevent bounds-check bypass via speculative execution
  udf: prevent bounds-check bypass via speculative execution
  [media] uvcvideo: prevent bounds-check bypass via speculative execution
  carl9170: prevent bounds-check bypass via speculative execution
  p54: prevent bounds-check bypass via speculative execution
  qla2xxx: prevent bounds-check bypass via speculative execution
  cw1200: prevent bounds-che

Re: [PATCH 10/18] qla2xxx: prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

On Sat, Jan 6, 2018 at 1:03 AM, Greg KH  wrote:
> On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
>> Static analysis reports that 'handle' may be a user controlled value
>> that is used as a data dependency to read 'sp' from the
>> 'req->outstanding_cmds' array.  In order to avoid potential leaks of
>> kernel memory values, block speculative execution of the instruction
>> stream that could issue reads based on an invalid value of 'sp'. In this
>> case 'sp' is directly dereferenced later in the function.
>
> I'm pretty sure that 'handle' comes from the hardware, not from
> userspace, from what I can tell here.  If we want to start auditing
> __iomem data sources, great!  But that's a bigger task, and one I don't
> think we are ready to tackle...

I think it falls in the hygiene bucket of shutting off an array index
from a source that could be under attacker control. Should we leave
this one un-patched while we decide if we generally have a problem
with trusting completion 'tags' from hardware? My vote is patch it for
now.

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Dan Williams

On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina  wrote:
> On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
>
>> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
>> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
>> > > On Fri, 5 Jan 2018, Dan Williams wrote:
>> > >
>> > > [ ... snip ... ]
>> > >> Andi Kleen (1):
>> > >>   x86, barrier: stop speculation for failed access_ok
>> > >>
>> > >> Dan Williams (13):
>> > >>   x86: implement nospec_barrier()
>> > >>   [media] uvcvideo: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   carl9170: prevent bounds-check bypass via speculative execution
>> > >>   p54: prevent bounds-check bypass via speculative execution
>> > >>   qla2xxx: prevent bounds-check bypass via speculative execution
>> > >>   cw1200: prevent bounds-check bypass via speculative execution
>> > >>   Thermal/int340x: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   ipv6: prevent bounds-check bypass via speculative execution
>> > >>   ipv4: prevent bounds-check bypass via speculative execution
>> > >>   vfs, fdtable: prevent bounds-check bypass via speculative 
>> > >> execution
>> > >>   net: mpls: prevent bounds-check bypass via speculative execution
>> > >>   udf: prevent bounds-check bypass via speculative execution
>> > >>   userns: prevent bounds-check bypass via speculative execution
>> > >>
>> > >> Mark Rutland (4):
>> > >>   asm-generic/barrier: add generic nospec helpers
>> > >>   Documentation: document nospec helpers
>> > >>   arm64: implement nospec_ptr()
>> > >>   arm: implement nospec_ptr()
>> > >
>> > > So considering the recent publication of [1], how come we all of a sudden
>> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
>> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>> > >
>> > > Is this going to be handled in eBPF in some other way?
>> > >
>> > > Without that in place, and considering Jann Horn's paper, it would seem
>> > > like PTI doesn't really lock it down fully, right?
>> >
>> > Here is the latest (v3) bpf fix:
>> >
>> > https://patchwork.ozlabs.org/patch/856645/
>> >
>> > I currently have v2 on my 'nospec' branch and will move that to v3 for
>> > the next update, unless it goes upstream before then.
>
> Daniel, I guess you're planning to send this still for 4.15?

It's pending in the bpf.git tree:


https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9

>> That patch seems specific to CONFIG_BPF_SYSCALL.  Is the bpf() syscall
>> the only attack vector?  Or are there other ways to run bpf programs
>> that we should be worried about?
>
> Seems like Alexei is probably the only person in the whole universe who
> isn't CCed here ... let's fix that.

He will be cc'd on v2 of this series which will be available later today.

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 5:57 PM, Alexei Starovoitov
 wrote:
> On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote:
>>
>> #define __nospec_array_ptr(base, idx, sz)   \
>> ({  \
>> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;   \
>> unsigned long _i = (idx);   \
>> unsigned long _s = (sz);\
>> unsigned long _v = (long)(_i | _s - 1 - _i) \
>> >> BITS_PER_LONG - 1;   \
>> unsigned long _mask = _v * ~0UL; \
>> OPTIMIZER_HIDE_VAR(_mask);  \
>> __u._ptr = &base[_i & _mask];   \
>> __u._bit &= _mask;  \
>> __u._ptr;   \
>> })
>
> _v * ~0UL doesn't seem right and non intuitive.
> What's wrong with:
>   unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1;

Yeah, I noticed it was ok immediately after I sent that.

> and why OPTIMIZER_HIDE_VAR ?

It was in Linus' original. but that was when it had the ternary
conditional, I'll drop it. It does not change the generated assembly.

> Could you remove '&' ?

Yes, that should be __u.ptr = base + (i & _mask)

> since in doesn't work for:
> struct {
>   int fd[4];
>   ...
> } *fdt;
> it cannot be used as array_acces(fdt->fd, ...);
>
> Could you please drop nospec_ prefix since it is misleading ?

When you came up with that tweak you noted:

"The following:
[..]
is generic and no speculative flows."

> This macro doesn't prevent speculation.

It masks dangerous speculation. At least, I read nospec as "No
Spectre" and it is a prefix used in the Spectre-v2 patches.

I also want to include the option, with a static branch, to switch it
to the hard "no speculation" version with an ifence if worse comes to
worse and we find a compiler / cpu where it doesn't work. The default
will be the fast and practical implementation.

> I think array_access() was the best name so far.

For other usages I need the pointer to the array element, also
array_access() by itself is unsuitable for __fcheck_files because we
still need rcu_dereference_raw() on the element de-reference. So, I
think it's better to get a sanitized array element pointer which can
be used with rcu, READ_ONCE(), etc... directly rather than try to do
the access in the same macro.

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams  wrote:
> On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
>  wrote:
>>
>> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams  
>> wrote:
>> >
>> > originally from Linus and tweaked by Alexei and I:
>>
>> Sadly, that tweak - while clever - is wrong.
>>
>> > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>>
>> Why?
>>
>> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
>> can still be positive.
>>
>> Think "m = 100", "i=bignum". The subtraction will overflow and become
>> positive again, the shift will shift to zero, and then the mask will
>> become ~0.
>>
>> Now, you can fix it, but you need to be a tiny bit more clever.  In
>> particular, just make sure that you retain the high bit of "_i",
>> basically making the rule be that a negative index is not ever valid.
>>
>> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
>> Now the sign bit is set if _i had it set, _or_ if the subtraction
>> turned negative, and you don't have to worry about the overflow
>> situation.
>>
>> But it does require that extra step to be trustworthy. Still purely
>> cheap arithmetic operations, although there is possibly some
>> additional register pressure there.
>>
>> Somebody might be able to come up with something even more minimal (or
>> find a fault in my fix of your tweak).
>
> I looks like there is another problem, or I'm misreading the
> cleverness. We want the mask to be ~0 in the ok case and 0 in the
> out-of-bounds case. As far as I can see we end up with ~0 in the ok
> case, and ~1 in the bad case. Don't we need to do something like the
> following, at which point are we getting out of the realm of "cheap
> ALU instructions"?
>
> #define __nospec_array_ptr(base, idx, sz)   \
> ({  \
> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;   \
> unsigned long _i = (idx);   \
> unsigned long _s = (sz);\
> unsigned long _v = (long)(_i | _s - 1 - _i) \
> >> BITS_PER_LONG - 1;   \
> unsigned long _mask = _v * ~0UL; \
> OPTIMIZER_HIDE_VAR(_mask);  \
> __u._ptr = &base[_i & _mask];   \
> __u._bit &= _mask;  \
> __u._ptr;   \
> })

Sorry, I'm slow of course ~(-1L) is 0.

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 4:54 PM, Eric W. Biederman  wrote:
> Dan Williams  writes:
[..]
> Sigh.
> I am not worrying about what is in the speculation window.  My
> assumption is that the speculation window could be infinite, as we don't
> know the limitations of all possible processors.
>
> I am saying there is not a way to get the data out of the speculation
> window.
>
> I was saying that when you have:
> if (x < max)
> val = array1[x];
>
> When val is a pointer not an integer.
> Then
> array2[val] = y;
> /* or */
> y = array2[va];
>
> Won't happen.
>
> val->field;
>
> Will happen.
>
> Which looks similar.  However the address space of pointers is too
> large.  Making it impossible for an attack to know where to look in the
> cache to see if "val->field" happened.  At least on the assumption that
> val is an arbitrary value.
>
> Further mpls_forward is small enough the entire scope of "rt" the value
> read possibly past the bound check is auditable without too much
> trouble.  I have looked and I don't see anything that could possibly
> allow the value of "rt" to be exfitrated.  The problem continuing to be
> that it is a pointer and the only operation on the pointer besides
> derferencing it is testing if it is NULL.
>
> Other types of timing attacks are very hard if not impossible because
> any packet presenting with a value outside the bounds check will be
> dropped.  So it will hard if not impossible to find something to time to
> see how long it took to drop the packet.
>
> So no this code path does not present with the classic signature of
> variant 1: bounds check bypass CVE-2017-5753
>
> Show me where I am wrong and I will gladly take patches.  As it is the
> mpls fast path does not appear vulnerable to the issue you are
> addressing.  So the best thing we can do for performance is nothing at
> all.

That's completely valid. Let's drop this one.

> All I am after is a plausible scenario.  I just want to see it spelled
> out which combinations of things could possibly be a problem.

In fact, I'm not here to spell out the speculative execution problem
in this path beyond the initial user controlled speculative read. As
noted in the cover letter thread, this and the other patches are
simply reports that are fully expected to be resolved as false
positives by sub-system owners in some cases. I'm otherwise more
focused in landing common infrastructure that could be used in the
future as data-flow-analysis tooling improves to find these locations
with higher accuracy. In other words, the goal at the end of this
exercise is to identify a default nospec_array_ptr() implementation
that all sub-systems could accept for when the problem report turns
out to be real, and your pushback has already resulted in good
discussion of alternate nospec_array_ptr() implementations.

Thanks for your patience to keep the conversation going.

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
 wrote:
>
> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams  wrote:
> >
> > originally from Linus and tweaked by Alexei and I:
>
> Sadly, that tweak - while clever - is wrong.
>
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>
> Why?
>
> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
> can still be positive.
>
> Think "m = 100", "i=bignum". The subtraction will overflow and become
> positive again, the shift will shift to zero, and then the mask will
> become ~0.
>
> Now, you can fix it, but you need to be a tiny bit more clever.  In
> particular, just make sure that you retain the high bit of "_i",
> basically making the rule be that a negative index is not ever valid.
>
> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
> Now the sign bit is set if _i had it set, _or_ if the subtraction
> turned negative, and you don't have to worry about the overflow
> situation.
>
> But it does require that extra step to be trustworthy. Still purely
> cheap arithmetic operations, although there is possibly some
> additional register pressure there.
>
> Somebody might be able to come up with something even more minimal (or
> find a fault in my fix of your tweak).

I looks like there is another problem, or I'm misreading the
cleverness. We want the mask to be ~0 in the ok case and 0 in the
out-of-bounds case. As far as I can see we end up with ~0 in the ok
case, and ~1 in the bad case. Don't we need to do something like the
following, at which point are we getting out of the realm of "cheap
ALU instructions"?

#define __nospec_array_ptr(base, idx, sz)   \
({  \
union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;   \
unsigned long _i = (idx);   \
unsigned long _s = (sz);\
unsigned long _v = (long)(_i | _s - 1 - _i) \
>> BITS_PER_LONG - 1;   \
unsigned long _mask = _v * ~0UL; \
OPTIMIZER_HIDE_VAR(_mask);  \
__u._ptr = &base[_i & _mask];   \
__u._bit &= _mask;  \
__u._ptr;   \
})

elem = nospec_array_ptr(array, idx, 3);
 106:   b8 02 00 00 00  mov$0x2,%eax
 10b:   48 63 ffmovslq %edi,%rdi
 10e:   48 29 f8sub%rdi,%rax
 111:   48 09 f8or %rdi,%rax
 114:   48 c1 e8 3f shr$0x3f,%rax
 118:   48 21 c7and%rax,%rdi
 11b:   48 8d 54 bc 04  lea0x4(%rsp,%rdi,4),%rdx
 120:   48 21 d0and%rdx,%rax

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 2:23 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
>> > Right, but what's the purpose of preventing speculation past
>> > access_ok()?
>>
>> Caution. It's the same rationale for the nospec_array_ptr() patches.
>> If we, kernel community, can identify any possible speculation past a
>> bounds check we should inject a speculation mitigation. Unless there's
>> a way to be 100% certain that the first unwanted speculation can be
>> turned into a gadget later on in the instruction stream, err on the
>> side of shutting it down early.
>
> I'm all for being cautious.  The nospec_array_ptr() patches are fine,
> and they make sense in light of the variant 1 CVE.
>
> But that still doesn't answer my question.  I haven't seen *any*
> rationale for this patch.  It would be helpful to at least describe
> what's being protected against, even if it's hypothetical.  How can we
> review it if the commit log doesn't describe its purpose?

Certainly the changelog needs improvement, I'll roll these concerns
into v2 and we can go from there.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 1:49 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
>> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
>> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> >> wrote:
>> >> > From: Andi Kleen 
>> >> >
>> >> > When access_ok fails we should always stop speculating.
>> >> > Add the required barriers to the x86 access_ok macro.
>> >>
>> >> Honestly, this seems completely bogus.
>> >>
>> >> The description is pure garbage afaik.
>> >>
>> >> The fact is, we have to stop speculating when access_ok() does *not*
>> >> fail - because that's when we'll actually do the access. And it's that
>> >> access that needs to be non-speculative.
>> >>
>> >> That actually seems to be what the code does (it stops speculation
>> >> when __range_not_ok() returns false, but access_ok() is
>> >> !__range_not_ok()). But the explanation is crap, and dangerous.
>> >
>> > The description also seems to be missing the "why", as it's not
>> > self-evident (to me, at least).
>> >
>> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>> >
>> > i.e., wouldn't the pattern be:
>> >
>> > get_user(uval, uptr);
>> > if (uval < array_size) {
>> > lfence();
>> > foo = a2[a1[uval] * 256];
>> > }
>> >
>> > Shouldn't the lfence come much later, *after* reading the variable and
>> > comparing it and branching accordingly?
>>
>> The goal of putting the lfence in uaccess_begin() is to prevent
>> speculation past access_ok().
>
> Right, but what's the purpose of preventing speculation past
> access_ok()?

Caution. It's the same rationale for the nospec_array_ptr() patches.
If we, kernel community, can identify any possible speculation past a
bounds check we should inject a speculation mitigation. Unless there's
a way to be 100% certain that the first unwanted speculation can be
turned into a gadget later on in the instruction stream, err on the
side of shutting it down early.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> wrote:
>> > From: Andi Kleen 
>> >
>> > When access_ok fails we should always stop speculating.
>> > Add the required barriers to the x86 access_ok macro.
>>
>> Honestly, this seems completely bogus.
>>
>> The description is pure garbage afaik.
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>>
>> That actually seems to be what the code does (it stops speculation
>> when __range_not_ok() returns false, but access_ok() is
>> !__range_not_ok()). But the explanation is crap, and dangerous.
>
> The description also seems to be missing the "why", as it's not
> self-evident (to me, at least).
>
> Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>
> i.e., wouldn't the pattern be:
>
> get_user(uval, uptr);
> if (uval < array_size) {
> lfence();
> foo = a2[a1[uval] * 256];
> }
>
> Shouldn't the lfence come much later, *after* reading the variable and
> comparing it and branching accordingly?

The goal of putting the lfence in uaccess_begin() is to prevent
speculation past access_ok(). You are correct that the cpu could later
mis-speculate on uval, that's where taint analysis tooling needs to
come into play to track uval to where it is used. That's where the
nospec_array_ptr() patches come into play.

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
> On Fri, 5 Jan 2018, Dan Williams wrote:
>
> [ ... snip ... ]
>> Andi Kleen (1):
>>   x86, barrier: stop speculation for failed access_ok
>>
>> Dan Williams (13):
>>   x86: implement nospec_barrier()
>>   [media] uvcvideo: prevent bounds-check bypass via speculative execution
>>   carl9170: prevent bounds-check bypass via speculative execution
>>   p54: prevent bounds-check bypass via speculative execution
>>   qla2xxx: prevent bounds-check bypass via speculative execution
>>   cw1200: prevent bounds-check bypass via speculative execution
>>   Thermal/int340x: prevent bounds-check bypass via speculative execution
>>   ipv6: prevent bounds-check bypass via speculative execution
>>   ipv4: prevent bounds-check bypass via speculative execution
>>   vfs, fdtable: prevent bounds-check bypass via speculative execution
>>   net: mpls: prevent bounds-check bypass via speculative execution
>>   udf: prevent bounds-check bypass via speculative execution
>>   userns: prevent bounds-check bypass via speculative execution
>>
>> Mark Rutland (4):
>>   asm-generic/barrier: add generic nospec helpers
>>   Documentation: document nospec helpers
>>   arm64: implement nospec_ptr()
>>   arm: implement nospec_ptr()
>
> So considering the recent publication of [1], how come we all of a sudden
> don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>
> Is this going to be handled in eBPF in some other way?
>
> Without that in place, and considering Jann Horn's paper, it would seem
> like PTI doesn't really lock it down fully, right?

Here is the latest (v3) bpf fix:

https://patchwork.ozlabs.org/patch/856645/

I currently have v2 on my 'nospec' branch and will move that to v3 for
the next update, unless it goes upstream before then.

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Dan Williams

On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman  wrote:
> Dan Williams  writes:
[..]
> The user controlled value condition of your patchset implies that you
> assume indirect branch predictor poisoning is handled in other ways.
>
> Which means that we can assume speculation will take some variation of
> the static call chain.
>
> Further you are worrying about array accesses.  Which means you
> are worried about attacks that are the equivalent of meltdown that
> can give you reading of all memory available to the kernel.
>
>
> The mpls code in question reads a pointer from memory.
>
>
> The only thing the code does with that pointer is verify it is not NULL
> and dereference it.
>
> That does not make it possible to extricate the pointer bits via a cache
> side-channel as a pointer is 64bits wide.
>
> There might maybe be a timing attack where it is timed how long the
> packet takes to deliver.  If you can find the base address of the array,
> at best such a timeing attack will tell you is if some arbitrary cache
> line is already cached in the kernel.  Which is not the class of attack
> your patchset is worried about.  Further there are more direct ways
> to probe the cache from a local process.
>
> So I submit to you that the mpls code is not vulnerable to the class of
> attack you are addressing.
>
> Further I would argue that anything that reads a pointer from memory is
> a very strong clue that it falls outside the class of code that you are
> addressing.
>
> Show me where I am wrong and I will consider patches.

No, the concern is a second dependent read (or write) within the
speculation window after this first bounds-checked dependent read.
I.e. this mpls code path appears to have the setup condition:

if (x < max)
val = array1[x];

...but it's not clear that there is an exploit condition later on in
the instruction stream:

array2[val] = y;
/* or */
y = array2[val];

My personal paranoia says submit the patch and not worry about finding
that later exploit condition, if DaveM wants to drop the patch that's
his prerogative. In general, with the performance conscious version of
nospec_array_ptr() being the default, why worry about what is / is not
in the speculation window?

Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-08 Thread Dan Williams

On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman  wrote:
> Dan Williams  writes:
>
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency reading 'rt' from the 'platform_label'
>> array.  In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid 'rt' value.
>
>
> In detail.
> a) This code is fast path packet forwarding code.  Introducing an
>unconditional pipeline stall is not ok.
>
>AKA either there is no speculation and so this is invulnerable
>or there is speculation and you are creating an unconditional
>pipeline stall here.
>
>My back of the napkin caluculations say that a pipeline stall
>is about 20 cycles.  Which is about the same length of time
>as a modern cache miss.
>
>On a good day this code will perform with 0 cache misses. On a less
>good day 1 cache miss.  Which means you are quite possibly doubling
>the runtime of mpls_forward.
>
> b) The array is dynamically allocated which should provide some
>protection, as it will be more difficult to predict the address
>of the array which is needed to craft an malicious userspace value.
>
> c) The code can be trivially modified to say:
>
> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned 
> index)
> {
> struct mpls_route *rt = NULL;
>
> if (index < net->mpls.platform_labels) {
> struct mpls_route __rcu **platform_label =
> rcu_dereference(net->mpls.platform_label);
> rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
> }
> return rt;
> }
>
> AKA a static mask will ensure that there is not a primitive that can be
> used to access all of memory.  That is max a 1 cycle slowdown in the
> code, which is a much better trade off.
>
> d) If we care more it is straight forward to modify
>resize_platform_label_table() to ensure that the size of the array
>is always a power of 2.
>
> e) The fact that a pointer is returned from the array and it is treated
>like a pointer would seem to provide a defense against the
>exfiltration technique of using the value read as an index into
>a small array, that user space code can probe aliased cached
>lines of, to see which value was dereferenced.
>
>
> So to this patch in particular.
> Nacked-by: "Eric W. Biederman" 
>
> This code path will be difficult to exploit.  This change messes with
> performance.  There are ways to make this code path useless while
> preserving the performance of the code.
>

Thanks, Eric understood. The discussion over the weekend  came to the
conclusion that using a mask will be the default approach. The
nospec_array_ptr() will be defined to something similar to the
following, originally from Linus and tweaked by Alexei and I:

#define __nospec_array_ptr(base, idx, sz)   \
({  \
union { typeof(&base[0]) _ptr; unsigned long _bit; } __u;   \
unsigned long _i = (idx);   \
unsigned long _m = (max);   \
unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
OPTIMIZER_HIDE_VAR(_mask);  \
__u._ptr = &base[_i & _mask];   \
__u._bit &= _mask;  \
__u._ptr;   \
})

Does that address your performance concerns?

Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-08 Thread Dan Williams

On Mon, Jan 8, 2018 at 3:23 AM, Laurent Pinchart
 wrote:
> Hi Dan,
>
> Thank you for the patch.
>
> On Saturday, 6 January 2018 03:10:32 EET Dan Williams wrote:
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency to read 'pin' from the
>> 'selector->baSourceID' array. In order to avoid potential leaks of
>> kernel memory values, block speculative execution of the instruction
>> stream that could issue reads based on an invalid value of 'pin'.
>
> I won't repeat the arguments already made in the thread regarding having
> documented coverity rules for this, even if I agree with them.
>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Laurent Pinchart 
>> Cc: Mauro Carvalho Chehab 
>> Cc: linux-me...@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
>>  drivers/media/usb/uvc/uvc_v4l2.c |7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
>> b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..7442626dc20e 100644
>> --- a/drivers/media/usb/uvc/uvc_v4l2.c
>> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
>> @@ -22,6 +22,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, struct uvc_entity *iterm = NULL;
>>   u32 index = input->index;
>>   int pin = 0;
>> + __u8 *elem;
>>
>>   if (selector == NULL ||
>>   (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
>> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, break;
>>   }
>>   pin = iterm->id;
>> - } else if (index < selector->bNrInPins) {
>> - pin = selector->baSourceID[index];
>> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
>> + selector->bNrInPins))) {
>> + pin = *elem;
>>   list_for_each_entry(iterm, &chain->entities, chain) {
>>   if (!UVC_ENTITY_IS_ITERM(iterm))
>>   continue;
>
> (adding a bit more context)
>
>>   if (iterm->id == pin)
>>   break;
>>   }
>>   }
>>
>>   if (iterm == NULL || iterm->id != pin)
>>   return -EINVAL;
>>
>>   memset(input, 0, sizeof(*input));
>>   input->index = index;
>>   strlcpy(input->name, iterm->name, sizeof(input->name));
>>   if (UVC_ENTITY_TYPE(iterm) == UVC_ITT_CAMERA)
>>   input->type = V4L2_INPUT_TYPE_CAMERA;
>
> So pin is used to search for an entry in the chain->entities list. Entries in
> that list are allocated separately through kmalloc and can thus end up in
> different cache lines, so I agree we have an issue. However, this is mitigated
> by the fact that typical UVC devices have a handful (sometimes up to a dozen)
> entities, so an attacker would only be able to read memory values that are
> equal to the entity IDs used by the device. Entity IDs can be freely allocated
> but typically count continuously from 0. It would take a specially-crafted UVC
> device to be able to read all memory.
>
> On the other hand, as this is nowhere close to being a fast path, I think we
> can close this potential hole as proposed in the patch. So,
>
> Reviewed-by: Laurent Pinchart 

Thanks Laurent!

> Will you merge the whole series in one go, or would you like me to take the
> patch in my tree ? In the latter case I'll wait until the nospec_array_ptr()
> gets merged in mainline.

I'll track it for now. Until the 'nospec_array_ptr()' discussion
resolves there won't be a stabilized commit-id for you to base a
branch.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams

On Mon, Jan 8, 2018 at 3:44 PM, Linus Torvalds
 wrote:
> On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams  wrote:
>> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>>  wrote:
>>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  
>>> wrote:
>>>>
>>>> I assume if we put this in uaccess_begin() we also need audit for
>>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>>> quick glance shows a few places where we are open coding the stac().
>>>> Perhaps land the lfence in stac() directly?
>>>
>>> Yeah, we should put it in uaccess_begin(), and in the actual user
>>> accessor helpers that do stac. Some of them probably should be changed
>>> to use uaccess_begin() instead while at it.
>>>
>>> One question for the CPU people: do we actually care and need to do
>>> this for things that might *write* to something? The speculative write
>>> obviously is killed, but does it perhaps bring in a cacheline even
>>> when killed?
>>
>> As far as I understand a write could trigger a request-for-ownership
>> read for the target cacheline.
>
> Oh, absolutely.
>
> I just wonder at what point that happens.
>
> Honestly, trying to get exclusive access to a cacheline can be _very_
> expensive (not just for the local thread), so I would actually expect
> that doing so for speculative writes is actually bad for performance.
>
> That's doubly true because - unlike reads - there is no critical
> latency issue, so trying to get the cache access started as early as
> possible simply isn't all that important.
>
> So I suspect that a write won't actually try to allocate the cacheline
> until the write has actually retired.
>
> End result: writes - unlike reads - *probably* will not speculatively
> perturb the cache with speculative write addresses.
>
>> Even though writes can trigger reads, as far as I can see the write
>> needs to be dependent on the first out-of-bounds read
>
> Yeah. A write on its own wouldn't matter, even if it were to perturb
> the cache state, because the address already comes from user space, so
> there's no new information in the cache perturbation for the attacker.
>
> But that all implies that we shouldn't need the lfence for the
> "put_user()" case, only for the get_user() (where the value we read
> would then perhaps be used to do another access).
>
> So we want to add the lfence (or "and") to get_user(), but not
> necessarily put_user().

Yes, perhaps __uaccess_begin_get() and __uaccess_begin_put() to keep
things separate?

> Agreed?

I've been thinking the "and" is only suitable for the array bounds
check, for get_user() we're trying to block speculation past
access_ok() at which point we can only do the lfence?

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams

On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>>
>> I assume if we put this in uaccess_begin() we also need audit for
>> paths that use access_ok but don't do on to call uaccess_begin()? A
>> quick glance shows a few places where we are open coding the stac().
>> Perhaps land the lfence in stac() directly?
>
> Yeah, we should put it in uaccess_begin(), and in the actual user
> accessor helpers that do stac. Some of them probably should be changed
> to use uaccess_begin() instead while at it.
>
> One question for the CPU people: do we actually care and need to do
> this for things that might *write* to something? The speculative write
> obviously is killed, but does it perhaps bring in a cacheline even
> when killed?

As far as I understand a write could trigger a request-for-ownership
read for the target cacheline.

> Because maybe we don't need the lfence in put_user(), only in get_user()?

Even though writes can trigger reads, as far as I can see the write
needs to be dependent on the first out-of-bounds read:

if (x < max)
y = array1[x];
put_user(array2 + y, z);

...in other words that first read should be annotated with
nospec_array_ptr() making an lfence in put_user() or other writes
moot.

yp = nospec_array_ptr(array1, x, max);
if (yp)
y = *yp;
put_user(array2 + y, z);

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Dan Williams

On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>>
>> To be fair there's overreaction on both sides. The vast majority of
>> users need to get a 100% safe system and will never notice  any
>> difference.
>
> There is no such thing as a "100% safe system". Never will be - unless
> you make sure you have no users.
>
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
>
> So this whole "security is so important that performance doesn't
> matter" mindset is pure and utter garbage.
>
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.

I'm thinking we should provide the option to at least build the
hot-path nospec_array_ptr() usages without an lfence.

CONFIG_SPECTRE1_PARANOIA_SAFE
CONFIG_SPECTRE1_PARANOIA_PERF

...if only for easing performance testing and let the distribution set
its policy.

Where hot-path usages can do:

nospec_relax(nospec_array_ptr())

...to optionally ellide the lfence.

Re: [PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-07 Thread Dan Williams

On Sun, Jan 7, 2018 at 1:09 AM, Greg KH  wrote:
[..]
> Sorry for the confusion, no, I don't mean the "taint tracking", I mean
> the generic pattern of "speculative out of bounds access" that we are
> fixing here.
>
> Yes, as you mentioned before, there are tons of false-positives in the
> tree, as to find the real problems you have to show that userspace
> controls the access index.  But if we have a generic pattern that can
> rewrite that type of logic into one where it does not matter at all
> (i.e. like the ebpf proposed changes), then it would not be an issue if
> they are false or not, we just rewrite them all to be safe.
>
> We need to find some way not only to fix these issues now (like you are
> doing with this series), but to prevent them from every coming back into
> the codebase again.  It's that second part that we need to keep in the
> back of our minds here, while doing the first portion of this work.

I understand the goal, but I'm not sure any of our current annotation
mechanisms are suitable. We have:

__attribute__((noderef, address_space(x)))

...for the '__user' annotation and other pointers that must not be
de-referenced without a specific accessor. We also have:

__attribute__((bitwise))

...for values that should not be consumed directly without a specific
conversion like endian swapping.

The problem is that we need to see if a value derived from a userspace
controlled input is used to trigger a chain of dependent reads. As far
as I can see the annotation would need to be guided by taint analysis
to be useful, at which point we can just "annotate" the problem spot
with nospec_array_ptr(). Otherwise it seems the scope of a
"__nospec_array_index" annotation would have a low signal to noise
ratio.

Stopping speculation past a uacess_begin() boundary appears to handle
a wide swath of potential problems, and the rest likely needs taint
analysis, at least for now.

All that to say, yes, we need better tooling and infrastructure going forward.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams

On Fri, Jan 5, 2018 at 7:09 PM, Linus Torvalds
 wrote:
> On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
>  wrote:
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>
> I also suspect we should probably do this entirely differently.
>
> Maybe the whole lfence can be part of uaccess_begin() instead (ie
> currently 'stac()'). That would fit the existing structure better, I
> think. And it would avoid any confusion about the whole "when to stop
> speculation".

I assume if we put this in uaccess_begin() we also need audit for
paths that use access_ok but don't do on to call uaccess_begin()? A
quick glance shows a few places where we are open coding the stac().
Perhaps land the lfence in stac() directly?

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-06 Thread Dan Williams

On Sat, Jan 6, 2018 at 11:37 AM, Dan Williams  wrote:
> On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>>
>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>>
>> These patches also will also be available via the 'nospec' git branch
>> here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
>
> It appears that git.kernel.org has not mirrored out the new branch. In
> the meantime here's an alternative location:
>
> https://github.com/djbw/linux.git nospec
>
> If there are updates to these patches they will appear in nospec-v2,
> nospec-v3, etc... branches.

For completeness I appended the bpf fix [1] to the git branch.

https://lwn.net/Articles/743288/

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-06 Thread Dan Williams

On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
>
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
>
> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.
>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec

It appears that git.kernel.org has not mirrored out the new branch. In
the meantime here's an alternative location:

https://github.com/djbw/linux.git nospec

If there are updates to these patches they will appear in nospec-v2,
nospec-v3, etc... branches.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams

On Sat, Jan 6, 2018 at 11:25 AM, Alexei Starovoitov
 wrote:
> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>>  wrote:
>> [..]
>> >> retpoline is variant-2, this patch series is about variant-1.
>> >
>> > that's exactly the point. Don't slow down the kernel with lfences
>> > to solve variant 1. retpoline for 2 is ok from long term kernel
>> > viability perspective.
>> >
>>
>> Setting aside that we still need to measure the impact of these
>> changes the end result will still be nospec_array_ptr() sprinkled in
>> various locations. So can we save the debate about what's inside that
>> macro on various architectures and at least proceed with annotating
>> the problematic locations? Perhaps we can go a step further and have a
>> config option to switch between the clever array_access() approach
>> from Linus that might be fine depending on the compiler, and the
>> cpu-vendor-recommended not to speculate implementation of
>> nospec_array_ptr().
>
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.
> It probably can be made independent of compiler choice to use setbe-like insn.

Right, when that 'probably' is 'certainly' for the architecture you
care about just update the nospec_array_ptr() definition at that
point.

1 2 3 4 5 >

1 - 100 of 414 matches

Mail list logo