date:20171115

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Nikolay Aleksandrov

On 15/11/17 21:27, Sarah Newman wrote:
> Current memory and CPU usage for managing bridge fdb entries is unbounded.
> Add a parameter max_fdb_count, controlled from sysfs, which places an upper
> limit on the number of entries. Defaults to 1024.
> 
> When max_fdb_count is met or exceeded, whether traffic is sent out a
> given port should depend on its flooding behavior.
> 
> This may instead be mitigated by filtering mac address entries in the
> PREROUTING chain of the ebtables nat table, but this is only practical
> when mac addresses are known in advance.
> 

One alternative solution: if limit is the only requirement it could be done
in user-space (even with a shell script) looking at fdb notifications and
if you reach some limit then remove the learning flag from ports, later if
enough expire turn it back on. In fact you can make any policy and if you
catch an offending port - you can disable only its learning flag and leave the
rest.

> Signed-off-by: Sarah Newman 
> ---
>  net/bridge/br_device.c   |  2 ++
>  net/bridge/br_fdb.c  | 25 -
>  net/bridge/br_private.h  |  3 +++
>  net/bridge/br_sysfs_br.c | 24 
>  4 files changed, 49 insertions(+), 5 deletions(-)
> 

As others have mentioned placing a default limit would break existing users,
so this must be a config option that is off (or very large) by default.

> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index af5b8c8..aa7a7f4 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -432,6 +432,8 @@ void br_dev_setup(struct net_device *dev)
>   br->bridge_hello_time = br->hello_time = 2 * HZ;
>   br->bridge_forward_delay = br->forward_delay = 15 * HZ;
>   br->bridge_ageing_time = br->ageing_time = BR_DEFAULT_AGEING_TIME;
> + br->max_fdb_count = 1024;> +br->fdb_count = 0;

No need to zero this explicitly, it is already zero.

>   dev->max_mtu = ETH_MAX_MTU;
>  
>   br_netfilter_rtable_init(br);
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index 4ea5c8b..0422d48 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -179,6 +179,7 @@ static void fdb_delete(struct net_bridge *br, struct 
> net_bridge_fdb_entry *f)
>  
>   hlist_del_init_rcu(>hlist);
>   fdb_notify(br, f, RTM_DELNEIGH);
> + br->fdb_count -= 1;>call_rcu(>rcu, fdb_rcu_free);
>  }
>  
> @@ -478,7 +479,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
>   return num;
>  }
>  
> -static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
> +static struct net_bridge_fdb_entry *fdb_create(struct net_bridge *br,
> +struct hlist_head *head,
>  struct net_bridge_port *source,
>  const unsigned char *addr,
>  __u16 vid,
> @@ -498,6 +500,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
> hlist_head *head,
>   fdb->added_by_external_learn = 0;
>   fdb->offloaded = 0;
>   fdb->updated = fdb->used = jiffies;
> + br->fdb_count += 1;
>   hlist_add_head_rcu(>hlist, head);
>   }
>   return fdb;
> @@ -524,7 +527,7 @@ static int fdb_insert(struct net_bridge *br, struct 
> net_bridge_port *source,
>   fdb_delete(br, fdb);
>   }
>  
> - fdb = fdb_create(head, source, addr, vid, 1, 1);
> + fdb = fdb_create(br, head, source, addr, vid, 1, 1);
>   if (!fdb)
>   return -ENOMEM;
>  
> @@ -556,6 +559,10 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
>   if (hold_time(br) == 0)
>   return;
>  
> + /* Place maximum on number of learned entries. */
> + if (br->max_fdb_count <= br->fdb_count)
> + return;
> +

checking for fdb_count in the beginning of br_fdb_update would cause stale fdb
entries, that is fdb moves between ports will stop and fdb updates will stop too
(i.e. everything expires)

>   /* ignore packets unless we are using this port */
>   if (!(source->state == BR_STATE_LEARNING ||
> source->state == BR_STATE_FORWARDING))
> @@ -591,7 +598,7 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
>   } else {
>   spin_lock(>hash_lock);
>   if (likely(!fdb_find_rcu(head, addr, vid))) {
> - fdb = fdb_create(head, source, addr, vid, 0, 0);
> + fdb = fdb_create(br, head, source, addr, vid, 0, 0);
>   if (fdb) {
>   if (unlikely(added_by_user))
>   fdb->added_by_user = 1;
> @@ -787,7 +794,7 @@ static int fdb_add_entry(struct net_bridge *br, struct 
> net_bridge_port *source,
>   if (!(flags & NLM_F_CREATE))
>   return -ENOENT;
>  
> -

Re: [PATCH] mac80211: mwl8k: Expand non-DFS 5G channels

2017-11-15 Thread Kalle Valo

Weixiao Zhang  writes:

> Signed-off-by: Weixiao Zhang 

No empty commit logs, please.

And use v2, v3 to clearly mark what's the version of the patch:

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#patch_version_missing

>  drivers/net/wireless/marvell/mwl8k.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)

You should use "mac80211:" prefix only with patches touching
net/mac80211. So for this one "mwl8k:" is enough.

-- 
Kalle Valo

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Björn Töpel

2017-11-16 4:35 GMT+01:00 Willem de Bruijn :
> On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov  wrote:
>> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>>
>>>
>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>   in a separate patchset.
>>
>>
>>
>> all sounds good to me except above bit.
>> I don't remember people suggesting to split it this way.
>> What's the value of it without tx?
>>
>
> We definitely need Tx for our use-cases! I'll rephrase, so the
> idea was making the initial patch set without Tx *driver*
> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>
> So AF_ZEROCOPY, the socket parts, would have Tx support.
>
> @John Did I recall that correctly?
>

 Yep, that is what I said. However, on second thought, without the
 driver tx half I guess tx will be significantly slower.
>>>
>>>
>>> The idea was that existing packet rings already send without
>>> copying, so the benefit from device driver changes is not obvious.
>>>
>>> I would leave them out for now and evaluate before possibly
>>> sending a separate patchset.
>>
>>
>> are you suggesting to use new af_zerocopy for rx and old
>> af_packet for tx ? imo that's too cumbersome to use.
>> New interface has to be symmetrical from the start.
>
> No, that tx can be implemented without device driver
> changes. At least initially.
>
> Unlike rx, tx does not need driver support to implement
> copy avoidance, as pf_packet tx_ring already has this.
>
> Having to go through ndo_start_xmit does introduce other
> overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
> better choice (but I'm not very familiar with that).
>
> If some cost is inherent to a device-independent solution
> and needs driver support to avoid it, then that can be added
> in a follow-on patchset. But this one is large already without
> the i40e tx patch.

Ideally, it would be best not having to introduce yet another xmit
ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but
we need to extend it with a destructor callback and potentially some
kind of DMA trait. Why DMA? For zerocopy, we know the working set of
packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit
does yet another DMA mapping. Paying for the DMA mapping in the
fast-path is something we'd like to avoid.

Re: [PATCH v2 2/2] chcr: Add support for Inline IPSec

2017-11-15 Thread Herbert Xu

On Thu, Nov 16, 2017 at 01:19:52PM +0800, kbuild test robot wrote:
> Hi Atul,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on cryptodev/master]
> [also build test ERROR on next-20171115]
> [cannot apply to v4.14]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Atul-Gupta/cxgb4-Add-support-for-Inline-IPSec-Tx/20171112-012558
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git 
> master
> config: x86_64-randconfig-g0-11160917 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_tx_handler':
> >> drivers/crypto/chelsio/chcr_core.c:195: undefined reference to 
> >> `chcr_ipsec_xmit'
>drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_add':
> >> drivers/crypto/chelsio/chcr_core.c:169: undefined reference to 
> >> `chcr_add_xfrmops'

Atul, can you fix this and resubmit your patches please?

Thanks!
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v3] wcn36xx: Set default BTLE coexistence config

2017-11-15 Thread Bjorn Andersson

On Mon 13 Nov 23:23 PST 2017, Ramon Fried wrote:

> From: Eyal Ilsar 
> 
> If the value for the firmware configuration parameters
> BTC_STATIC_LEN_LE_BT and BTC_STATIC_LEN_LE_WLAN are not set the duty
> cycle between BT and WLAN is such that if BT (including BLE) is active
> WLAN gets 0 bandwidth. When tuning these parameters having a too high
> value for WLAN means that BLE performance degrades.
> The "sweet" point of roughly half of the maximal values was empirically
> found to achieve a balance between BLE and Wi-Fi coexistence
> performance.
> 
> Signed-off-by: Eyal Ilsar 
> Signed-off-by: Ramon Fried 
> ---
>  drivers/net/wireless/ath/wcn36xx/smd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/ath/wcn36xx/smd.c 
> b/drivers/net/wireless/ath/wcn36xx/smd.c
> index 9c6590d5348a..1c7598752255 100644
> --- a/drivers/net/wireless/ath/wcn36xx/smd.c
> +++ b/drivers/net/wireless/ath/wcn36xx/smd.c
> @@ -72,8 +72,10 @@ static struct wcn36xx_cfg_val wcn36xx_cfg_vals[] = {
>   WCN36XX_CFG_VAL(DYNAMIC_PS_POLL_VALUE, 0),
>   WCN36XX_CFG_VAL(TX_PWR_CTRL_ENABLE, 1),
>   WCN36XX_CFG_VAL(ENABLE_CLOSE_LOOP, 1),
> - WCN36XX_CFG_VAL(ENABLE_LPWR_IMG_TRANSITION, 0),

I still don't see why you're moving this unrelated line, can you please
help me see what I'm missing?

Thanks,
Bjorn

> + WCN36XX_CFG_VAL(BTC_STATIC_LEN_LE_BT, 12),
> + WCN36XX_CFG_VAL(BTC_STATIC_LEN_LE_WLAN, 3),
>   WCN36XX_CFG_VAL(MAX_ASSOC_LIMIT, 10),
> + WCN36XX_CFG_VAL(ENABLE_LPWR_IMG_TRANSITION, 0),
>   WCN36XX_CFG_VAL(ENABLE_MCC_ADAPTIVE_SCHEDULER, 0),
>  };
>  
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Roopa Prabhu

On Wed, Nov 15, 2017 at 10:13 PM, Toshiaki Makita
 wrote:
> On 2017/11/16 13:54, Sarah Newman wrote:
>> On 11/15/2017 08:05 PM, Toshiaki Makita wrote:
>>> On 2017/11/16 11:25, Andrew Lunn wrote:
> Also what do the vendors using bridge for L2 offload to switch think?

 The Marvell L2 switches which DSA supports have 8K FDB/MDB entries. So
 maybe 1024 is a bit low?
>>>
>>> How about U32_MAX by default since it is currently not restricted.
>>> (assuming the field will be changed to u32 as per Stephen's feedback).
>>>
>>> Otherwise users may suffer from unexpected behavior change by updating
>>> kernel?
>>>
>>
>> U32_MAX seems like much too high a default to be helpful to a typical user. 
>> How many devices are realistically on a single bridge in the wild? Double
>> that seems like a reasonable default.
>
> I'm suggesting the most unrealistic number to essentially disable the
> restriction by default.
> My understanding is that we put a priority on not to break existing
> users even if the new restriction looks reasonable for most people.

+1 , and yes, 1024 is very low. some of the switches we use support
around 128K FDB entries and we have seen that number increase fairly
quickly in newer generation switches. Default should be no limit to
not break existing users.

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Toshiaki Makita

On 2017/11/16 13:54, Sarah Newman wrote:
> On 11/15/2017 08:05 PM, Toshiaki Makita wrote:
>> On 2017/11/16 11:25, Andrew Lunn wrote:
 Also what do the vendors using bridge for L2 offload to switch think?
>>>
>>> The Marvell L2 switches which DSA supports have 8K FDB/MDB entries. So
>>> maybe 1024 is a bit low?
>>
>> How about U32_MAX by default since it is currently not restricted.
>> (assuming the field will be changed to u32 as per Stephen's feedback).
>>
>> Otherwise users may suffer from unexpected behavior change by updating
>> kernel?
>>
> 
> U32_MAX seems like much too high a default to be helpful to a typical user. 
> How many devices are realistically on a single bridge in the wild? Double
> that seems like a reasonable default.

I'm suggesting the most unrealistic number to essentially disable the
restriction by default.
My understanding is that we put a priority on not to break existing
users even if the new restriction looks reasonable for most people.

-- 
Toshiaki Makita

Re: [PATCH v2 2/2] chcr: Add support for Inline IPSec

2017-11-15 Thread kbuild test robot

Hi Atul,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cryptodev/master]
[also build test ERROR on next-20171115]
[cannot apply to v4.14]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Atul-Gupta/cxgb4-Add-support-for-Inline-IPSec-Tx/20171112-012558
base:   
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: x86_64-randconfig-g0-11160917 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_tx_handler':
>> drivers/crypto/chelsio/chcr_core.c:195: undefined reference to 
>> `chcr_ipsec_xmit'
   drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_add':
>> drivers/crypto/chelsio/chcr_core.c:169: undefined reference to 
>> `chcr_add_xfrmops'

vim +195 drivers/crypto/chelsio/chcr_core.c

   152  
   153  static void *chcr_uld_add(const struct cxgb4_lld_info *lld)
   154  {
   155  struct uld_ctx *u_ctx;
   156  
   157  /* Create the device and add it in the device list */
   158  if (!(lld->ulp_crypto & ULP_CRYPTO_LOOKASIDE))
   159  return ERR_PTR(-EOPNOTSUPP);
   160  
   161  /* Create the device and add it in the device list */
   162  u_ctx = kzalloc(sizeof(*u_ctx), GFP_KERNEL);
   163  if (!u_ctx) {
   164  u_ctx = ERR_PTR(-ENOMEM);
   165  goto out;
   166  }
   167  u_ctx->lldi = *lld;
   168  if (lld->crypto & ULP_CRYPTO_IPSEC_INLINE)
 > 169  chcr_add_xfrmops(lld);
   170  out:
   171  return u_ctx;
   172  }
   173  
   174  int chcr_uld_rx_handler(void *handle, const __be64 *rsp,
   175  const struct pkt_gl *pgl)
   176  {
   177  struct uld_ctx *u_ctx = (struct uld_ctx *)handle;
   178  struct chcr_dev *dev = u_ctx->dev;
   179  const struct cpl_fw6_pld *rpl = (struct cpl_fw6_pld *)rsp;
   180  
   181  if (rpl->opcode != CPL_FW6_PLD) {
   182  pr_err("Unsupported opcode\n");
   183  return 0;
   184  }
   185  
   186  if (!pgl)
   187  work_handlers[rpl->opcode](dev, (unsigned char 
*)[1]);
   188  else
   189  work_handlers[rpl->opcode](dev, pgl->va);
   190  return 0;
   191  }
   192  
   193  int chcr_uld_tx_handler(struct sk_buff *skb, struct net_device *dev)
   194  {
 > 195  return chcr_ipsec_xmit(skb, dev);
   196  }
   197  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH] rtnetlink: fix missing size for IFLA_IF_NETNSID

2017-11-15 Thread Flavio Leitner

On Mon, 6 Nov 2017 16:16:20 +0100
Jiri Benc  wrote:

> On Mon,  6 Nov 2017 15:04:54 +, Colin King wrote:
> > The size for IFLA_IF_NETNSID is missing from the size calculation
> > because the proceeding semicolon was not removed. Fix this by removing
> > the semicolon.  
> 
> Acked-by: Jiri Benc 
> 
> Thanks for spotting this! Looking at my initial code, I had that right,
> this was probably introduced during one of rebases, so I blame
> Flavio :-p (On a serious note, thank you, Flavio, for taking care of
> the rebases.)

That's right, ouch!
Thanks for fixing it.

-- 
Flavio

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Sarah Newman

On 11/15/2017 08:05 PM, Toshiaki Makita wrote:
> On 2017/11/16 11:25, Andrew Lunn wrote:
>>> Also what do the vendors using bridge for L2 offload to switch think?
>>
>> The Marvell L2 switches which DSA supports have 8K FDB/MDB entries. So
>> maybe 1024 is a bit low?
> 
> How about U32_MAX by default since it is currently not restricted.
> (assuming the field will be changed to u32 as per Stephen's feedback).
> 
> Otherwise users may suffer from unexpected behavior change by updating
> kernel?
> 

U32_MAX seems like much too high a default to be helpful to a typical user. How 
many devices are realistically on a single bridge in the wild? Double
that seems like a reasonable default.

Additionally, since the exact limit seems controversial, it can be made a 
configuration parameter.

What about the rest?

--Sarah

[PATCH 08/12] isdn: hisax: Fix pnp_irq's error checking for setup_isurf

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/isurf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/isurf.c b/drivers/isdn/hisax/isurf.c
index 1399ddd..53e299b 100644
--- a/drivers/isdn/hisax/isurf.c
+++ b/drivers/isdn/hisax/isurf.c
@@ -238,7 +238,7 @@ int setup_isurf(struct IsdnCard *card)
cs->hw.isurf.reset = pnp_port_start(pnp_d, 0);
cs->hw.isurf.phymem = pnp_mem_start(pnp_d, 1);
cs->irq = pnp_irq(pnp_d, 0);
-   if (!cs->irq || !cs->hw.isurf.reset || 
!cs->hw.isurf.phymem) {
+   if (cs->irq == -1 || !cs->hw.isurf.reset || 
!cs->hw.isurf.phymem) {
printk(KERN_ERR "ISurfPnP:some 
resources are missing %d/%x/%lx\n",
   cs->irq, cs->hw.isurf.reset, 
cs->hw.isurf.phymem);
pnp_disable_dev(pnp_d);
-- 
1.9.1

Re: [PATCH 1/2] r8169: reinstate ALDPS for power saving

2017-11-15 Thread Kai Heng Feng

> On 15 Nov 2017, at 6:53 PM, David Miller  wrote:
> 
> From: Kai-Heng Feng 
> Date: Wed, 15 Nov 2017 04:00:18 -0500
> 
>> Commit ("r8169: enable ALDPS for power saving") caused a regression on
>> RTL8168evl/8111evl [1], so it got reverted.
>> 
>> Instead of reverting the whole commit, let's reinstate ALDPS for all
>> models other than RTL8168evl/8111evl.
>> 
>> [1] https://lkml.org/lkml/2013/1/5/36
>> 
>> Cc: Francois Romieu 
>> Cc: Hayes Wang 
>> Cc: Jörg Otte 
>> Signed-off-by: Kai-Heng Feng 
> 
> Sorry, this isn't going to work.
> 
> The problem is that we have no idea what chips this really
> works for.  All we know is that it definitely does not work
> for one particular chip variant.

Another guy tried to enable ASPM before [1], with extra knobs,
which is discouraged.
I want to do the same without the knobs, with the intention not to
re-introduce the regression to RTL8168evl/8111evl.

> The amount of coverage this change is going to get is very small as
> well, meaning an even greater change of regressions.
> 
> Therefore the only acceptable way to handle this is to have
> a white-list, specific chips that have been explicitly tested
> and are known to work with this feature, rather than the other
> way around.

I think a chip-specific whitelist will be quite hard. That requires *both*
Realtek and OEM/ODM to do the test. IIUC, the very same chip can be used
in different Laptops/Motherboards, regression may happen on a very specific 
combination.

A more plausible solution is model specific whitelist. In this case, 
extra knobs are desirable, so users can do the testing themselves before
adding new models to whitelist.

Currently users report that by enabling r8169’s ASPM on Dell Inspiron 7559,
power consumption can be drastically reduced.

> Furthermore, you're not even checking the chip version, you're
> checking instead whether the firmware is loaded or not.  That
> doesn't seem like a safe way to guard this at all.

Hopefully Hayes (or Realtek) can shed more lights on the issue. Apparently
ALDPS and ASPM for r8169 is enabled in different commercial products, just
not in Linux mainline.

Kai-Heng

[1] https://www.spinics.net/lists/netdev/msg403994.html

[PATCH 07/12] isdn: hisax: Handle return value of pnp_irq and pnp_port_start

2017-11-15 Thread Arvind Yadav

pnp_irq() and pnp_port_start() can fail here and we must check
its return value.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/hisax_fcpcipnp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/isdn/hisax/hisax_fcpcipnp.c 
b/drivers/isdn/hisax/hisax_fcpcipnp.c
index e4f7573..7a7137d 100644
--- a/drivers/isdn/hisax/hisax_fcpcipnp.c
+++ b/drivers/isdn/hisax/hisax_fcpcipnp.c
@@ -940,6 +940,8 @@ static int fcpnp_probe(struct pnp_dev *pdev, const struct 
pnp_device_id *dev_id)
}
adapter->io = pnp_port_start(pdev, 0);
adapter->irq = pnp_irq(pdev, 0);
+   if (!adapter->io || adapter->irq == -1)
+   goto err_free;
 
printk(KERN_INFO "hisax_fcpcipnp: found adapter %s at IO %#x irq %d\n",
   (char *) dev_id->driver_data, adapter->io, adapter->irq);
-- 
1.9.1

[PATCH 12/12] isdn: hisax: Fix pnp_irq's error checking for setup_teles3

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/teles3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/teles3.c b/drivers/isdn/hisax/teles3.c
index 38fb2c1..1eef693 100644
--- a/drivers/isdn/hisax/teles3.c
+++ b/drivers/isdn/hisax/teles3.c
@@ -306,7 +306,7 @@ int setup_teles3(struct IsdnCard *card)
card->para[2] = pnp_port_start(pnp_d, 
1);
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1] || 
!card->para[2]) {
+   if (card->para[0] == -1 || 
!card->para[1] || !card->para[2]) {
printk(KERN_ERR "Teles PnP:some 
resources are missing %ld/%lx/%lx\n",
   card->para[0], 
card->para[1], card->para[2]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 05/12] isdn: hisax: Fix pnp_irq's error checking for setup_hfcsx

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/hfc_sx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/hfc_sx.c b/drivers/isdn/hisax/hfc_sx.c
index 3aef8e1..99f5a93 100644
--- a/drivers/isdn/hisax/hfc_sx.c
+++ b/drivers/isdn/hisax/hfc_sx.c
@@ -1422,7 +1422,7 @@ int setup_hfcsx(struct IsdnCard *card)
}
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || 
!card->para[1]) {
printk(KERN_ERR "HFC PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], 
card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 09/12] isdn: hisax: Fix pnp_irq's error checking for setup_ix1micro

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/ix1_micro.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/ix1_micro.c b/drivers/isdn/hisax/ix1_micro.c
index 7ae39f5..bfb79f3 100644
--- a/drivers/isdn/hisax/ix1_micro.c
+++ b/drivers/isdn/hisax/ix1_micro.c
@@ -256,7 +256,7 @@ int setup_ix1micro(struct IsdnCard *card)
}
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || 
!card->para[1]) {
printk(KERN_ERR "ITK PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], 
card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 04/12] isdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/elsa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/elsa.c b/drivers/isdn/hisax/elsa.c
index 03bc5d5..f8dcac8 100644
--- a/drivers/isdn/hisax/elsa.c
+++ b/drivers/isdn/hisax/elsa.c
@@ -945,7 +945,7 @@ static int setup_elsa_isapnp(struct IsdnCard *card)
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
 
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || 
!card->para[1]) {
printk(KERN_ERR "Elsa PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], 
card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 11/12] isdn: hisax: Fix pnp_irq's error checking for setup_sedlbauer_isapnp

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/sedlbauer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/sedlbauer.c b/drivers/isdn/hisax/sedlbauer.c
index f16a47b..c0b97b8 100644
--- a/drivers/isdn/hisax/sedlbauer.c
+++ b/drivers/isdn/hisax/sedlbauer.c
@@ -558,7 +558,7 @@ static int setup_sedlbauer_isapnp(struct IsdnCard *card, 
int *bytecnt)
card->para[1] = pnp_port_start(pnp_d, 0);
card->para[0] = pnp_irq(pnp_d, 0);
 
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || !card->para[1]) {
printk(KERN_ERR "Sedlbauer PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 10/12] isdn: hisax: Fix pnp_irq's error checking for setup_niccy

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/niccy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/niccy.c b/drivers/isdn/hisax/niccy.c
index e4c33cf..dfbcd2e 100644
--- a/drivers/isdn/hisax/niccy.c
+++ b/drivers/isdn/hisax/niccy.c
@@ -261,7 +261,7 @@ int setup_niccy(struct IsdnCard *card)
card->para[1] = pnp_port_start(pnp_d, 0);
card->para[2] = pnp_port_start(pnp_d, 1);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1] ||
+   if (card->para[0] == -1 || !card->para[1] ||
!card->para[2]) {
printk(KERN_ERR "NiccyPnP:some resources are "
   "missing %ld/%lx/%lx\n",
-- 
1.9.1

[PATCH 06/12] isdn: hisax: Fix pnp_irq's error checking for setup_hfcs

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/hfcscard.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/hfcscard.c b/drivers/isdn/hisax/hfcscard.c
index 4672870..3cbde72 100644
--- a/drivers/isdn/hisax/hfcscard.c
+++ b/drivers/isdn/hisax/hfcscard.c
@@ -195,7 +195,7 @@ int setup_hfcs(struct IsdnCard *card)
}
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || 
!card->para[1]) {
printk(KERN_ERR "HFC PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], 
card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 01/12] isdn: hisax: Fix pnp_irq's error checking for setup_asuscom

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/asuscom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/asuscom.c b/drivers/isdn/hisax/asuscom.c
index 62f9c43..74c8714 100644
--- a/drivers/isdn/hisax/asuscom.c
+++ b/drivers/isdn/hisax/asuscom.c
@@ -348,7 +348,7 @@ int setup_asuscom(struct IsdnCard *card)
}
card->para[1] = pnp_port_start(pnp_d, 
0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || 
!card->para[1]) {
printk(KERN_ERR "AsusPnP:some 
resources are missing %ld/%lx\n",
   card->para[0], 
card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 03/12] isdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/diva.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/diva.c b/drivers/isdn/hisax/diva.c
index 3fc94e7..bed749c 100644
--- a/drivers/isdn/hisax/diva.c
+++ b/drivers/isdn/hisax/diva.c
@@ -1094,7 +1094,7 @@ static int setup_diva_isapnp(struct IsdnCard *card)
}
card->para[1] = pnp_port_start(pnp_d, 0);
card->para[0] = pnp_irq(pnp_d, 0);
-   if (!card->para[0] || !card->para[1]) {
+   if (card->para[0] == -1 || !card->para[1]) {
printk(KERN_ERR "Diva PnP:some 
resources are missing %ld/%lx\n",
   card->para[0], card->para[1]);
pnp_disable_dev(pnp_d);
-- 
1.9.1

[PATCH 00/12] isdn: hisax: Fix pnp_irq's error checking

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Arvind Yadav (12):
  [PATCH 01/12] isdn: hisax: Fix pnp_irq's error checking for setup_asuscom
  [PATCH 02/12] isdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup
  [PATCH 03/12] isdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp
  [PATCH 04/12] isdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp
  [PATCH 05/12] isdn: hisax: Fix pnp_irq's error checking for setup_hfcsx
  [PATCH 06/12] isdn: hisax: Fix pnp_irq's error checking for setup_hfcs
  [PATCH 07/12] isdn: hisax: Handle return value of pnp_irq and pnp_port_start
  [PATCH 08/12] isdn: hisax: Fix pnp_irq's error checking for setup_isurf
  [PATCH 09/12] isdn: hisax: Fix pnp_irq's error checking for setup_ix1micro
  [PATCH 10/12] isdn: hisax: Fix pnp_irq's error checking for setup_niccy
  [PATCH 11/12] isdn: hisax: Fix pnp_irq's error checking for 
setup_sedlbauer_isapnp
  [PATCH 12/12] isdn: hisax: Fix pnp_irq's error checking for setup_teles3

 drivers/isdn/hisax/asuscom.c| 2 +-
 drivers/isdn/hisax/avm_pci.c| 2 +-
 drivers/isdn/hisax/diva.c   | 2 +-
 drivers/isdn/hisax/elsa.c   | 2 +-
 drivers/isdn/hisax/hfc_sx.c | 2 +-
 drivers/isdn/hisax/hfcscard.c   | 2 +-
 drivers/isdn/hisax/hisax_fcpcipnp.c | 2 ++
 drivers/isdn/hisax/isurf.c  | 2 +-
 drivers/isdn/hisax/ix1_micro.c  | 2 +-
 drivers/isdn/hisax/niccy.c  | 2 +-
 drivers/isdn/hisax/sedlbauer.c  | 2 +-
 drivers/isdn/hisax/teles3.c | 2 +-
 12 files changed, 13 insertions(+), 11 deletions(-)

-- 
1.9.1

[PATCH 02/12] isdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup

2017-11-15 Thread Arvind Yadav

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav 
---
 drivers/isdn/hisax/avm_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/isdn/hisax/avm_pci.c b/drivers/isdn/hisax/avm_pci.c
index daf3742..a18b605 100644
--- a/drivers/isdn/hisax/avm_pci.c
+++ b/drivers/isdn/hisax/avm_pci.c
@@ -805,7 +805,7 @@ static int avm_pnp_setup(struct IsdnCardState *cs)
cs->hw.avm.cfg_reg =
pnp_port_start(pnp_avm_d, 0);
cs->irq = pnp_irq(pnp_avm_d, 0);
-   if (!cs->irq) {
+   if (cs->irq == -1) {
printk(KERN_ERR "FritzPnP:No IRQ\n");
return (0);
}
-- 
1.9.1

[PATCH net] net/sctp: Always set scope_id in sctp_inet6_skb_msgname

2017-11-15 Thread Eric W. Biederman


Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.

Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.

With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.

That is almost reasonable as scope_id's are meaniningful only for link
local addresses.  Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.

There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.

Cc: sta...@vger.kernel.org
Fixes: 372f525b495c ("SCTP:  Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reported-by: Alexander Potapenko 
Tested-by: Alexander Potapenko 
Signed-off-by: "Eric W. Biederman" 
---
 net/sctp/ipv6.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index a6dfa86c0201..3b18085e3b10 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -807,9 +807,10 @@ static void sctp_inet6_skb_msgname(struct sk_buff *skb, 
char *msgname,
addr->v6.sin6_flowinfo = 0;
addr->v6.sin6_port = sh->source;
addr->v6.sin6_addr = ipv6_hdr(skb)->saddr;
-   if (ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
+   if (ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL)
addr->v6.sin6_scope_id = sctp_v6_skb_iif(skb);
-   }
+   else
+   addr->v6.sin6_scope_id = 0;
}
 
*addr_len = sctp_v6_addr_to_user(sctp_sk(skb->sk), addr);
-- 
2.14.1

NETDEV WATCHDOG: eth0 (dwc-eth-dwmac): transmit queue 1 timed out

2017-11-15 Thread Bhadram Varka

Hi,

I am trying to enable multi-queue in Tegra186 EQOS (which has support for 4 
channels). Observed below netdev watchdog warning. Its easily reproable with 
iperf test. 
In normal ping scenario this is not observed. I did not observe any issue if we 
disable TSO. Looks like issue in stmmac_tso_xmit() in multi-channel scenario.

[   88.801672] NETDEV WATCHDOG: eth0 (dwc-eth-dwmac): transmit queue 0 timed out
[   88.808818] [ cut here ]
[   88.813435] WARNING: CPU: 5 PID: 0 at net/sched/sch_generic.c:320 
dev_watchdog+0x2cc/0x2d8
[   88.821681] Modules linked in: dwmac_dwc_qos_eth stmmac_platform crc32_ce 
crct10dif_ce stmmac ip_tables x_tables ipv6
[   88.832290] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G S  
4.14.0-rc7-01956-g9395db5-dirty #21
[   88.841663] Hardware name: NVIDIA Tegra186 P2771- Development Board (DT)
[   88.848697] task: 8001ec8fd400 task.stack: 09e38000
[   88.854606] PC is at dev_watchdog+0x2cc/0x2d8
[   88.858952] LR is at dev_watchdog+0x2cc/0x2d8
[   88.863300] pc : [] lr : [] pstate: 
2145
[   88.870678] sp : 0802bd80
[   88.873983] x29: 0802bd80 x28: 00a0
[   88.879287] x27:  x26: 8001eae2c3b0
[   88.884589] x25: 0005 x24: 8001ecb6be80
[   88.889891] x23: 8001eae2c39c x22: 8001eae2bfb0
[   88.895192] x21: 8001eae2c000 x20: 08fe7000
[   88.900493] x19: 0001 x18: 0010
[   88.905795] x17:  x16: 
[   88.911098] x15:  x14: 756f2064656d6974
[   88.916399] x13: 2031206575657571 x12: 08fe9df0
[   88.921699] x11: 08586180 x10: 642d6874652d6377
[   88.927000] x9 : 0016 x8 : 3a474f4448435441
[   88.932301] x7 : 572056454454454e x6 : 014f
[   88.937602] x5 : 0020 x4 : 
[   88.942902] x3 :  x2 : 08fec4c0
[   88.948203] x1 : 8001ec8fd400 x0 : 0041
[   88.953504] Call trace:
[   88.955944] Exception stack(0x0802bc40 to 0x0802bd80)
[   88.962371] bc40: 0041 8001ec8fd400 08fec4c0 

[   88.970184] bc60:  0020 014f 
572056454454454e
[   88.977998] bc80: 3a474f4448435441 0016 642d6874652d6377 
08586180
[   88.985811] bca0: 08fe9df0 2031206575657571 756f2064656d6974 

[   88.993624] bcc0:   0010 
0001
[   89.001439] bce0: 08fe7000 8001eae2c000 8001eae2bfb0 
8001eae2c39c
[   89.009252] bd00: 8001ecb6be80 0005 8001eae2c3b0 

[   89.017065] bd20: 00a0 0802bd80 0894a76c 
0802bd80
[   89.024879] bd40: 0894a76c 2145 00b67570 
0001
[   89.032693] bd60: 0001 8001ecb6b200 0802bd80 
0894a76c
[   89.040508] [] dev_watchdog+0x2cc/0x2d8
[   89.045900] [] call_timer_fn.isra.5+0x24/0x80
[   89.051809] [] expire_timers+0xa4/0xb0
[   89.057111] [] run_timer_softirq+0x140/0x170
[   89.062933] [] __do_softirq+0x12c/0x228
[   89.068323] [] irq_exit+0xd0/0x108
[   89.073278] [] __handle_domain_irq+0x60/0xb8
[   89.079098] [] gic_handle_irq+0x58/0xa8
[   89.084484] Exception stack(0x09e3be20 to 0x09e3bf60)
[   89.090910] be20:   0001 

[   89.098724] be40:  09e3bf60 8001ecffd000 
0001
[   89.106537] be60: 0002 09e3bee0 0a00 

[   89.114351] be80: 0001  001c3dfbd9959589 
1daf5b7a4860
[   89.122164] bea0: 0825b000  c0311284 
08fc5000
[   89.129978] bec0: 08fe9000 08fe9000 08fd04a0 
08fe9e90
[   89.137792] bee0:   8001ec8fd400 

[   89.145605] bf00:  09e3bf60 0808548c 
09e3bf60
[   89.153418] bf20: 08085490 0145  

[   89.161231] bf40:  081409c4 09e3bf60 
08085490
[   89.169044] [] el1_irq+0xb0/0x124
[   89.173912] [] arch_cpu_idle+0x10/0x18
[   89.179213] [] do_idle+0x120/0x1e0
[   89.184166] [] cpu_startup_entry+0x24/0x28
[   89.189814] [] secondary_start_kernel+0x110/0x120
[   89.196067] ---[ end trace 039d403d63546b77 ]---
 
Below are the DT changes -

diff --git a/arch/arm64/boot/dts/nvidia/tegra186.dtsi 
b/arch/arm64/boot/dts/nvidia/tegra186.dtsi
index 0b0552c..ffe1b80 100644
--- a/arch/arm64/boot/dts/nvidia/tegra186.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra186.dtsi
@@ -27,21 +27,40 @@
#gpio-cells = <2>;
gpio-controller;
};
+
+   mtl_tx_setup: tx-queues-config {
+

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Toshiaki Makita

On 2017/11/16 11:25, Andrew Lunn wrote:
>> Also what do the vendors using bridge for L2 offload to switch think?
> 
> The Marvell L2 switches which DSA supports have 8K FDB/MDB entries. So
> maybe 1024 is a bit low?

How about U32_MAX by default since it is currently not restricted.
(assuming the field will be changed to u32 as per Stephen's feedback).

Otherwise users may suffer from unexpected behavior change by updating
kernel?

-- 
Toshiaki Makita

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Willem de Bruijn

On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov  wrote:
> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>
>>
>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>   in a separate patchset.
>
>
>
> all sounds good to me except above bit.
> I don't remember people suggesting to split it this way.
> What's the value of it without tx?
>

 We definitely need Tx for our use-cases! I'll rephrase, so the
 idea was making the initial patch set without Tx *driver*
 specific code, e.g. use ndo_xdp_xmit/flush at a later point.

 So AF_ZEROCOPY, the socket parts, would have Tx support.

 @John Did I recall that correctly?

>>>
>>> Yep, that is what I said. However, on second thought, without the
>>> driver tx half I guess tx will be significantly slower.
>>
>>
>> The idea was that existing packet rings already send without
>> copying, so the benefit from device driver changes is not obvious.
>>
>> I would leave them out for now and evaluate before possibly
>> sending a separate patchset.
>
>
> are you suggesting to use new af_zerocopy for rx and old
> af_packet for tx ? imo that's too cumbersome to use.
> New interface has to be symmetrical from the start.

No, that tx can be implemented without device driver
changes. At least initially.

Unlike rx, tx does not need driver support to implement
copy avoidance, as pf_packet tx_ring already has this.

Having to go through ndo_start_xmit does introduce other
overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
better choice (but I'm not very familiar with that).

If some cost is inherent to a device-independent solution
and needs driver support to avoid it, then that can be added
in a follow-on patchset. But this one is large already without
the i40e tx patch.

[PATCH] mac80211: mwl8k: Expand non-DFS 5G channels

2017-11-15 Thread Weixiao Zhang

Signed-off-by: Weixiao Zhang 
---
 drivers/net/wireless/marvell/mwl8k.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/marvell/mwl8k.c 
b/drivers/net/wireless/marvell/mwl8k.c
index e813b2ca740c..8e4e9b6919e0 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -199,7 +199,7 @@ struct mwl8k_priv {
struct ieee80211_channel channels_24[14];
struct ieee80211_rate rates_24[13];
struct ieee80211_supported_band band_50;
-   struct ieee80211_channel channels_50[4];
+   struct ieee80211_channel channels_50[9];
struct ieee80211_rate rates_50[8];
u32 ap_macids_supported;
u32 sta_macids_supported;
@@ -383,6 +383,11 @@ static const struct ieee80211_channel mwl8k_channels_50[] 
= {
{ .band = NL80211_BAND_5GHZ, .center_freq = 5200, .hw_value = 40, },
{ .band = NL80211_BAND_5GHZ, .center_freq = 5220, .hw_value = 44, },
{ .band = NL80211_BAND_5GHZ, .center_freq = 5240, .hw_value = 48, },
+   { .band = NL80211_BAND_5GHZ, .center_freq = 5745, .hw_value = 149, },
+   { .band = NL80211_BAND_5GHZ, .center_freq = 5765, .hw_value = 153, },
+   { .band = NL80211_BAND_5GHZ, .center_freq = 5785, .hw_value = 157, },
+   { .band = NL80211_BAND_5GHZ, .center_freq = 5805, .hw_value = 161, },
+   { .band = NL80211_BAND_5GHZ, .center_freq = 5825, .hw_value = 165, },
 };
 
 static const struct ieee80211_rate mwl8k_rates_50[] = {
-- 
2.11.0

[PATCH] fealnx: Fix building error on MIPS

2017-11-15 Thread Huacai Chen

This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the LONG macro, which conflicts with the LONG enum
in drivers/net/ethernet/fealnx.c.

Cc: sta...@vger.kernel.org
Signed-off-by: Huacai Chen 
---
 drivers/net/ethernet/fealnx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index 2305391..ae55da6 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x0800, /* first descriptor */
RXLSD = 0x0400, /* last descriptor */
ErrorSummary = 0x80,/* error summary */
-   RUNT = 0x40,/* runt packet received */
-   LONG = 0x20,/* long packet received */
+   RUNTPKT = 0x40, /* runt packet received */
+   LONGPKT = 0x20, /* long packet received */
FAE = 0x10, /* frame align error */
CRC = 0x08, /* crc error */
RXER = 0x04,/* receive error */
@@ -1628,7 +1628,7 @@ static int netdev_rx(struct net_device *dev)
   dev->name, rx_status);
 
dev->stats.rx_errors++; /* end of a packet. */
-   if (rx_status & (LONG | RUNT))
+   if (rx_status & (LONGPKT | RUNTPKT))
dev->stats.rx_length_errors++;
if (rx_status & RXER)
dev->stats.rx_frame_errors++;
-- 
2.7.0

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Andrew Lunn

> @@ -556,6 +559,10 @@ void br_fdb_update(struct net_bridge *br, struct 
> net_bridge_port *source,
>   if (hold_time(br) == 0)
>   return;
>  
> + /* Place maximum on number of learned entries. */
> + if (br->max_fdb_count <= br->fdb_count)
> + return;
> +

Hi Sarah

We probably want a rate limited warning printed here. If the table is
full, it will have to start flooding frames, which is not good for
performance. Having something in the log will help debug the problem.
The log will also give a hit that a DoS attack could be happening.

Andrew

Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-15 Thread Alexei Starovoitov


On 11/14/17 4:20 AM, Willem de Bruijn wrote:


* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.



all sounds good to me except above bit.
I don't remember people suggesting to split it this way.
What's the value of it without tx?



We definitely need Tx for our use-cases! I'll rephrase, so the
idea was making the initial patch set without Tx *driver*
specific code, e.g. use ndo_xdp_xmit/flush at a later point.

So AF_ZEROCOPY, the socket parts, would have Tx support.

@John Did I recall that correctly?



Yep, that is what I said. However, on second thought, without the
driver tx half I guess tx will be significantly slower.


The idea was that existing packet rings already send without
copying, so the benefit from device driver changes is not obvious.

I would leave them out for now and evaluate before possibly
sending a separate patchset.


are you suggesting to use new af_zerocopy for rx and old
af_packet for tx ? imo that's too cumbersome to use.
New interface has to be symmetrical from the start.

Re: [RFC PATCH net-next 0/2] Configuring PFC stall prevention via ethtool

2017-11-15 Thread Andrew Lunn

> What do other vendors support? Time? Number of pause frames sent?

So i checked a few Marvell Switches. You can also specify a time. It
is a little bit more complex than that, since the units of time depend
on the link speed. But converting a time in ms to what the register
wants is possible.

So i'm thinking rather than a poorly defined 'Auto', passing a time
would be better.

  Andrew

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Andrew Lunn

> Also what do the vendors using bridge for L2 offload to switch think?

The Marvell L2 switches which DSA supports have 8K FDB/MDB entries. So
maybe 1024 is a bit low?

How big is an FDB entry? Even an OpenWRT/LEDE class devices with 512MB
RAM can probably support 1MB of RAM for FDB entries.

> Many switches need to clone table, and similar limits must be in
> other versions.

DSA does not maintain synchronisation between the hardware tables and
the software tables. So no cloning. We expect both bridges to perform
learning based on the frames. This means we should be able to handle
one bridge flooding because its table is full, while the other has an
entry and forwards out the correct port.

 Andrew

Re: [PATCH] net: ethernet: ti: cpsw: fix min eth packet size

2017-11-15 Thread David Miller

From: Grygorii Strashko 
Date: Wed, 15 Nov 2017 09:46:35 -0600

> Now CPSW driver configures min eth packet size to 60 octets (ETH_ZLEN)
> which works in most of cases, but when port VLAN is configured on some
> switch port, it also can be configured to force all egress packets to be
> VLAN untagged. And in this case, CPSW driver will pad small packets to 60
> octets, but final packet size on port egress can became less than 60 octets
> due to VLAN tag removal and packet will be dropped.
> 
> Hence, fix it by accounting VLAN header in CPSW min eth packet size. While
> here, use proper defines for CPSW_MAX_PACKET_SIZE also, instead of open
> coding.
> 
> Signed-off-by: Grygorii Strashko 

Applied.

Re: [net-next 1/1] tipc: enforce valid ratio between skb truesize and contents

2017-11-15 Thread David Miller

From: Jon Maloy 
Date: Wed, 15 Nov 2017 21:23:56 +0100

> The socket level flow control is based on the assumption that incoming
> buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
> where the latter value is rounded off upwards to the nearest 1k number.
> This does empirically hold true for the device drivers we know, but we
> cannot trust that it will always be so, e.g., in a system with jumbo
> frames and very small packets.
> 
> We now introduce a check for this condition at packet arrival, and if
> we find it to be false, we copy the packet to a new, smaller buffer,
> where the condition will be true. We expect this to affect only a small
> fraction of all incoming packets, if at all.
> 
> Acked-by: Ying Xue 
> Signed-off-by: Jon Maloy 

Applied.

Re: [PATCH net v2] hv_netvsc: preserve hw_features on mtu/channels/ringparam changes

2017-11-15 Thread David Miller

From: Vitaly Kuznetsov 
Date: Wed, 15 Nov 2017 15:12:55 +0100

> rndis_filter_device_add() is called both from netvsc_probe() when we
> initially create the device and from set channels/mtu/ringparam
> routines where we basically remove the device and add it back.
> 
> hw_features is reset in rndis_filter_device_add() and filled with
> host data. However, we lose all additional flags which are set outside
> of the driver, e.g. register_netdevice() adds NETIF_F_SOFT_FEATURES and
> many others.
> 
> Unfortunately, calls to rndis_{query_hwcaps(), _set_offload_params()}
> calls cannot be avoided on every RNDIS reset: host expects us to set
> required features explicitly. Moreover, in theory hardware capabilities
> can change and we need to reflect the change in hw_features.
> 
> Reset net->hw_features bits according to host data in
> rndis_netdev_set_hwcaps(), clear corresponding feature bits
> from net->features in case some features went missing (will never happen
> in real life I guess but let's be consistent).
> 
> Signed-off-by: Vitaly Kuznetsov 

Applied.

Re: [PATCH net-next] netfilter: add ifdef around ctnetlink_proto_size

2017-11-15 Thread David Miller

From: Pablo Neira Ayuso 
Date: Wed, 15 Nov 2017 18:28:21 +0100

> From: Arnd Bergmann 
> 
> This function is no longer marked 'inline', so we now get a warning
> when it is unused:
> 
> net/netfilter/nf_conntrack_netlink.c:536:15: error: 'ctnetlink_proto_size' 
> defined but not used [-Werror=unused-function]
> 
> We could mark it inline again, mark it __maybe_unused, or add an #ifdef
> around the definition. I'm picking the third approach here since that
> seems to be what the rest of the file has.
> 
> Fixes: 5caaed151a68 ("netfilter: conntrack: don't cache nlattr_tuple_size 
> result in nla_size")
> Signed-off-by: Arnd Bergmann 
> Signed-off-by: Pablo Neira Ayuso 

Applied.

Re: [PATCH] qed: use kzalloc instead of kmalloc and memset

2017-11-15 Thread David Miller

From: Colin King 
Date: Wed, 15 Nov 2017 14:03:20 +

> From: Colin Ian King 
> 
> Replace kmalloc followed by a memset with kzalloc
> 
> Signed-off-by: Colin Ian King 

Applied.

Re: [net-next v2] ipv6: sr: update the struct ipv6_sr_hdr

2017-11-15 Thread David Miller

From: Ahmed Abdelsalam 
Date: Wed, 15 Nov 2017 15:34:23 +0100

> The IPv6 Segment Routing Header (SRH) format has been updated (revision 6
> of the SRH ietf draft). The update includes the following SRH fields:
> 
> (1) The "First Segment" field changed to be "Last Entry" which contains
> the index, in the Segment List, of the last element of the Segment List.
> 
> (2) The 16 bit "reserved" field now is used as a "tag" which tags a packet
> as part of a class or group of packets, e.g.,packets sharing the same
> set of properties.
> 
> This patch updates the struct ipv6_sr_hdr, so it complies with the updated
> SRH draft. The 16 bit "reserved" field is changed to be "tag", In addition
> a comment is added to the "first_segment" field, showing that it represents
> the "Last Entry" field of the SRH.
> 
> Signed-off-by: Ahmed Abdelsalam 

Applied.

Re: [PATCH net] genetlink: fix genlmsg_nlhdr()

2017-11-15 Thread David Miller

From: Michal Kubecek 
Date: Wed, 15 Nov 2017 13:09:32 +0100 (CET)

> According to the description, first argument of genlmsg_nlhdr() points to
> what genlmsg_put() returns, i.e. beginning of user header. Therefore we
> should only subtract size of genetlink header and netlink message header,
> not user header.
> 
> This also means we don't need to pass the pointer to genetlink family and
> the same is true for genl_dump_check_consistent() which is the only caller
> of genlmsg_nlhdr(). (Note that at the moment, these functions are only
> used for families which do not have user header so that they are not
> affected.)
> 
> Fixes: 670dc2833d14 ("netlink: advertise incomplete dumps")
> Signed-off-by: Michal Kubecek 

Applied.

Re: [PATCHv2 net] sctp: check stream reset info len before making reconf chunk

2017-11-15 Thread David Miller

From: Xin Long 
Date: Wed, 15 Nov 2017 17:00:11 +0800

> Now when resetting stream, if both in and out flags are set, the info
> len can reach:
>   sizeof(struct sctp_strreset_outreq) + SCTP_MAX_STREAM(65535) +
>   sizeof(struct sctp_strreset_inreq)  + SCTP_MAX_STREAM(65535)
> even without duplicated stream no, this value is far greater than the
> chunk's max size.
> 
> _sctp_make_chunk doesn't do any check for this, which would cause the
> skb it allocs is huge, syzbot even reported a crash due to this.
> 
> This patch is to check stream reset info len before making reconf
> chunk and return EINVAL if the len exceeds chunk's capacity.
> 
> Thanks Marcelo and Neil for making this clear.
> 
> v1->v2:
>   - move the check into sctp_send_reset_streams instead.
> 
> Fixes: cc16f00f6529 ("sctp: add support for generating stream reconf ssn 
> reset request chunk")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCHv2 net] sctp: use the right sk after waking up from wait_buf sleep

2017-11-15 Thread David Miller

From: Xin Long 
Date: Wed, 15 Nov 2017 16:57:26 +0800

> Commit dfcb9f4f99f1 ("sctp: deny peeloff operation on asocs with threads
> sleeping on it") fixed the race between peeloff and wait sndbuf by
> checking waitqueue_active(>wait) in sctp_do_peeloff().
> 
> But it actually doesn't work, as even if waitqueue_active returns false
> the waiting sndbuf thread may still not yet hold sk lock. After asoc is
> peeled off, sk is not asoc->base.sk any more, then to hold the old sk
> lock couldn't make assoc safe to access.
> 
> This patch is to fix this by changing to hold the new sk lock if sk is
> not asoc->base.sk, meanwhile, also set the sk in sctp_sendmsg with the
> new sk.
> 
> With this fix, there is no more race between peeloff and waitbuf, the
> check 'waitqueue_active' in sctp_do_peeloff can be removed.
> 
> Thanks Marcelo and Neil for making this clear.
> 
> v1->v2:
>   fix it by changing to lock the new sock instead of adding a flag in asoc.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCHv2 net] sctp: do not free asoc when it is already dead in sctp_sendmsg

2017-11-15 Thread David Miller

From: Xin Long 
Date: Wed, 15 Nov 2017 16:55:54 +0800

> Now in sctp_sendmsg sctp_wait_for_sndbuf could schedule out without
> holding sock sk. It means the current asoc can be freed elsewhere,
> like when receiving an abort packet.
> 
> If the asoc is just created in sctp_sendmsg and sctp_wait_for_sndbuf
> returns err, the asoc will be freed again due to new_asoc is not nil.
> An use-after-free issue would be triggered by this.
> 
> This patch is to fix it by setting new_asoc with nil if the asoc is
> already dead when cpu schedules back, so that it will not be freed
> again in sctp_sendmsg.
> 
> v1->v2:
>   set new_asoc as nil in sctp_sendmsg instead of sctp_wait_for_sndbuf.
> 
> Suggested-by: Neil Horman 
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Xin Long 

Applied.

Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API

2017-11-15 Thread David Miller

From: chet l 
Date: Wed, 15 Nov 2017 14:34:32 -0800

> I have not reviewed the entire patchset but I think if we could add a
> version_hdr and then unionize the fields, it might be easier to add
> SVM support without having to spin v5. I could be wrong though.

Please, NO VERSION FIELDS!

Design things properly from the start rather than using a crutch of
being able to "adjust things later".

Re: [GIT] Networking

2017-11-15 Thread David Miller

From: Daniel Borkmann 
Date: Wed, 15 Nov 2017 23:15:20 +0100

> On 11/15/2017 09:19 PM, Linus Torvalds wrote:
>> On Wed, Nov 15, 2017 at 3:33 AM, David Miller  wrote:
>>>
>>> Highlights:
>> 
>> Lowlights:
>> 
>>  1) it duplicated a commit from the hrtimer tree, which had been
>> cleaned up and rewritten, but then merging the second copy of the
>> commit re-introduced the bad code that had been cleaned up.
>> 
>> I'm talking about commits
>> 
>>  - 7d9285e82db5:
>>   perf/bpf: Extend the perf_event_read_local() interface, a.k.a.
>> "bpf: perf event change needed for subsequent bpf helpers"
>>  - 97562633bcba
>>   bpf: perf event change needed for subsequent bpf helpers
>> 
>> where apparently there was no discussion between the groups about the
>> subsequent changes.
>> 
>> And this must have shown up in linux-next as a conflict, but no
>> mention of it from either the perf event tree or the networking tree
>> merge.
>> 
>> Although it is of course possible that depending on merge order, the
>> problem never showed up in next.
> 
> Sorry about that, it was discussed that the patch in [1] would get
> routed through net-next and again cherry-picked from tracing folks
> due to conflicting changes in perf event tree that were being worked
> on to avoid later merge conflicts - clearly that didn't give the
> desired result.
> 
> There was a subsequent discussion in [2] but not sure if cherry-picking
> 0d3d73aac2ff ("perf/core: Rewrite event timekeeping") into net-next
> would have made it better or worse. We'll have a bpf sub-tree up and
> running soon for the next development cycle that can be pulled from
> by different parties when needed; potentially this could reduce such
> conflicts between trees in future. Sorry for the trouble.

Yeah, sorry about all of this.

I had hoped that since the patch was being appied to both trees in
order to avoid merge problems, no modifications would have been made
to the change at either end.  This obviously didn't happen.

I also didn't communicate the issue to you clearly in the pull
request, and for this I apologize.

As Daniel says, we realize that bpf is breaching multiple subsystems
more and more so over time, and we hope a bpf GIT tree will help
alleviate this moving forward.

Thanks!

Re: [PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-15 Thread David Miller

From: Vasyl Gomonovych 
Date: Wed, 15 Nov 2017 20:58:15 +0100

> @@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct 
> qed_hwfn *hwfn,
>  {
>   struct qed_dcbx_get *dcbx_info;
>  
> - dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC);
> + dcbx_info = kzmalloc(sizeof(*dcbx_info), GFP_ATOMIC);

Thanks for not even compile testing your patch.

Re: [PATCH] vhost/scsi: Use safe iteration in vhost_scsi_complete_cmd_work()

2017-11-15 Thread Byungchul Park


On 11/16/2017 10:30 AM, Michael S. Tsirkin wrote:

On Thu, Nov 16, 2017 at 08:52:39AM +0900, Byungchul Park wrote:

On Thu, Nov 09, 2017 at 09:17:29AM +0900, Byungchul Park wrote:

I am sorry for having made a mistake on it.


Hello Nicholas,

Please consider this patch urgently. I'm sorry for having changed the
original behavior with the previous patch.

The safe version of llist API should be used to keep the original
behavior.

Thanks,
Byungchul


I have included this patch in my tree.


Thank you.

--
Thanks,
Byungchul

Re: [PATCH] vhost/scsi: Use safe iteration in vhost_scsi_complete_cmd_work()

2017-11-15 Thread Michael S. Tsirkin

On Thu, Nov 16, 2017 at 08:52:39AM +0900, Byungchul Park wrote:
> On Thu, Nov 09, 2017 at 09:17:29AM +0900, Byungchul Park wrote:
> > I am sorry for having made a mistake on it.
> 
> Hello Nicholas,
> 
> Please consider this patch urgently. I'm sorry for having changed the
> original behavior with the previous patch.
> 
> The safe version of llist API should be used to keep the original
> behavior.
> 
> Thanks,
> Byungchul

I have included this patch in my tree.

> > -8<-
> > >From ba9a0f76dffceffa4fa3aa2d9be49cdb0d9b7d4f Mon Sep 17 00:00:00 2001
> > From: Byungchul Park 
> > Date: Thu, 9 Nov 2017 09:00:21 +0900
> > Subject: [PATCH] vhost/scsi: Use safe iteration in 
> > vhost_scsi_complete_cmd_work()
> > 
> > The following patch changed the behavior which originally did safe
> > iteration. Make it safe as it was.
> > 
> >12bdcbd539c6327c09da0503c674733cb2d82cb5
> >vhost/scsi: Don't reinvent the wheel but use existing llist API
> > 
> > Signed-off-by: Byungchul Park 
> > ---
> >  drivers/vhost/scsi.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> > index 046f6d2..46539ca 100644
> > --- a/drivers/vhost/scsi.c
> > +++ b/drivers/vhost/scsi.c
> > @@ -519,7 +519,7 @@ static void vhost_scsi_complete_cmd_work(struct 
> > vhost_work *work)
> > vs_completion_work);
> > DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
> > struct virtio_scsi_cmd_resp v_rsp;
> > -   struct vhost_scsi_cmd *cmd;
> > +   struct vhost_scsi_cmd *cmd, *t;
> > struct llist_node *llnode;
> > struct se_cmd *se_cmd;
> > struct iov_iter iov_iter;
> > @@ -527,7 +527,7 @@ static void vhost_scsi_complete_cmd_work(struct 
> > vhost_work *work)
> >  
> > bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
> > llnode = llist_del_all(>vs_completion_list);
> > -   llist_for_each_entry(cmd, llnode, tvc_completion_list) {
> > +   llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
> > se_cmd = >tvc_se_cmd;
> >  
> > pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__,
> > -- 
> > 1.9.1

Re: Request for -stable inclusion: time stamping fix for nfp

2017-11-15 Thread David Miller

From: Guillaume Nault 
Date: Wed, 15 Nov 2017 17:20:46 +0100

> Can you please queue commit 46f1c52e66db
> ("nfp: TX time stamp packets before HW doorbell is rung") for -stable?
> We got hit but this bug in the late summer. We run this fix internally
> since a couple of months, but that'd be better to have it officially
> backported so everyone can benefit of it.

Queued up.

[PATCH net-next] driver: ipvlan: Add new func ipvlan_is_valid_dev instead of duplicated codes

2017-11-15 Thread gfree . wind

From: Gao Feng 

There are multiple duplicated condition checks in the current codes, so
I add the new func ipvlan_is_valid_dev instead of the duplicated codes to
check if the netdev is real ipvlan dev.

Signed-off-by: Gao Feng 
---
 drivers/net/ipvlan/ipvlan_main.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index a266aa4..2ec1ea2 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -844,6 +844,19 @@ static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, 
struct in6_addr *ip6_addr)
return ipvlan_del_addr(ipvlan, ip6_addr, true);
 }
 
+static bool ipvlan_is_valid_dev(const struct net_device *dev)
+{
+   struct ipvl_dev *ipvlan = netdev_priv(dev);
+
+   if (!netif_is_ipvlan(dev))
+   return false;
+
+   if (!ipvlan || !ipvlan->port)
+   return false;
+
+   return true;
+}
+
 static int ipvlan_addr6_event(struct notifier_block *unused,
  unsigned long event, void *ptr)
 {
@@ -851,10 +864,7 @@ static int ipvlan_addr6_event(struct notifier_block 
*unused,
struct net_device *dev = (struct net_device *)if6->idev->dev;
struct ipvl_dev *ipvlan = netdev_priv(dev);
 
-   if (!netif_is_ipvlan(dev))
-   return NOTIFY_DONE;
-
-   if (!ipvlan || !ipvlan->port)
+   if (!ipvlan_is_valid_dev(dev))
return NOTIFY_DONE;
 
switch (event) {
@@ -882,10 +892,7 @@ static int ipvlan_addr6_validator_event(struct 
notifier_block *unused,
if (in_softirq())
return NOTIFY_DONE;
 
-   if (!netif_is_ipvlan(dev))
-   return NOTIFY_DONE;
-
-   if (!ipvlan || !ipvlan->port)
+   if (!ipvlan_is_valid_dev(dev))
return NOTIFY_DONE;
 
switch (event) {
@@ -926,10 +933,7 @@ static int ipvlan_addr4_event(struct notifier_block 
*unused,
struct ipvl_dev *ipvlan = netdev_priv(dev);
struct in_addr ip4_addr;
 
-   if (!netif_is_ipvlan(dev))
-   return NOTIFY_DONE;
-
-   if (!ipvlan || !ipvlan->port)
+   if (!ipvlan_is_valid_dev(dev))
return NOTIFY_DONE;
 
switch (event) {
@@ -955,10 +959,7 @@ static int ipvlan_addr4_validator_event(struct 
notifier_block *unused,
struct net_device *dev = (struct net_device *)ivi->ivi_dev->dev;
struct ipvl_dev *ipvlan = netdev_priv(dev);
 
-   if (!netif_is_ipvlan(dev))
-   return NOTIFY_DONE;
-
-   if (!ipvlan || !ipvlan->port)
+   if (!ipvlan_is_valid_dev(dev))
return NOTIFY_DONE;
 
switch (event) {
-- 
1.9.1

Re: [RFC PATCH net-next 0/2] Configuring PFC stall prevention via ethtool

2017-11-15 Thread Andrew Lunn

On Wed, Nov 15, 2017 at 09:00:09PM +0200, Eran Ben Elisha wrote:
> From: Inbar Karmy 
> 
> This RFC adds support for configuring PFC stall prevention through ethtool.
> 
> In the event where the device unexpectedly becomes unresponsive for a long
> period of time, flow control mechanism may propagate pause frames which will
> cause congestion spreading to the entire network.
> 
> To prevent this scenario, the device may implement a protection mechanism for
> monitoring and resolving such state.  The following patches allow the user to
> control the stall prevention functionality.
> 
> PFC stall prevention configuration is done via ethtool -a (pause).
> Two modes are introduced:
> Default - current behavior per driver.
> Auto - protection mechanism controlled automatically by the driver.

Why Auto?

Down in the driver you seem to translate this to a time. And it looks
like your hardware is flexible on that time, it can probably do at
least 8s to 100ms.

Why not specify a time?

What do other vendors support? Time? Number of pause frames sent?

 Andrew

Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-15 Thread Benjamin Herrenschmidt

On Wed, 2017-11-15 at 10:45 -0600, Bryant G. Ly wrote:
> This patch just closes the window, bad things can still happen. I wanted to 
> leave it
> up to the people who actively develop in ibmveth to close the window, since 
> introducing
> a lock can be expensive in tx. 

You don't need to instroduce a lock. The network stack already have a
per-queue lock, you just use the existing one.

Look at what I did in sungem or ftgmac100 with the reset task, those
are fairly simple drivers and should illustrate the technique.

Cheers,
Ben.

Re: [PATCH] vhost/scsi: Use safe iteration in vhost_scsi_complete_cmd_work()

2017-11-15 Thread Byungchul Park

On Thu, Nov 09, 2017 at 09:17:29AM +0900, Byungchul Park wrote:
> I am sorry for having made a mistake on it.

Hello Nicholas,

Please consider this patch urgently. I'm sorry for having changed the
original behavior with the previous patch.

The safe version of llist API should be used to keep the original
behavior.

Thanks,
Byungchul

> -8<-
> >From ba9a0f76dffceffa4fa3aa2d9be49cdb0d9b7d4f Mon Sep 17 00:00:00 2001
> From: Byungchul Park 
> Date: Thu, 9 Nov 2017 09:00:21 +0900
> Subject: [PATCH] vhost/scsi: Use safe iteration in 
> vhost_scsi_complete_cmd_work()
> 
> The following patch changed the behavior which originally did safe
> iteration. Make it safe as it was.
> 
>12bdcbd539c6327c09da0503c674733cb2d82cb5
>vhost/scsi: Don't reinvent the wheel but use existing llist API
> 
> Signed-off-by: Byungchul Park 
> ---
>  drivers/vhost/scsi.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 046f6d2..46539ca 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -519,7 +519,7 @@ static void vhost_scsi_complete_cmd_work(struct 
> vhost_work *work)
>   vs_completion_work);
>   DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ);
>   struct virtio_scsi_cmd_resp v_rsp;
> - struct vhost_scsi_cmd *cmd;
> + struct vhost_scsi_cmd *cmd, *t;
>   struct llist_node *llnode;
>   struct se_cmd *se_cmd;
>   struct iov_iter iov_iter;
> @@ -527,7 +527,7 @@ static void vhost_scsi_complete_cmd_work(struct 
> vhost_work *work)
>  
>   bitmap_zero(signal, VHOST_SCSI_MAX_VQ);
>   llnode = llist_del_all(>vs_completion_list);
> - llist_for_each_entry(cmd, llnode, tvc_completion_list) {
> + llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
>   se_cmd = >tvc_se_cmd;
>  
>   pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__,
> -- 
> 1.9.1

Re: [PATCH 1/2] r8169: reinstate ALDPS for power saving

2017-11-15 Thread Francois Romieu

David Miller  :
[...]
> The amount of coverage this change is going to get is very small as
> well, meaning an even greater chance of regressions.

Yes.

> Therefore the only acceptable way to handle this is to have
> a white-list, specific chips that have been explicitly tested
> and are known to work with this feature, rather than the other
> way around.
> 
> Furthermore, you're not even checking the chip version, you're
> checking instead whether the firmware is loaded or not.  That
> doesn't seem like a safe way to guard this at all.

Actually the chip specific xyz_hw_phy_config methods call the relevant
aldps enabling helper _but_ the 8168evl dedicated xyz_hw_phy_config
doesn't. The firmware loaded check is just a distraction for the
busy reviewer.

-- 
Ueimor

Re: [patch net-next v2 00/10] net: sched: allow qdiscs to share filter block instances

2017-11-15 Thread Cong Wang

On Sun, Nov 12, 2017 at 7:55 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> Currently the filters added to qdiscs are independent. So for example if you
> have 2 netdevices and you create ingress qdisc on both and you want to add
> identical filter rules both, you need to add them twice. This patchset
> makes this easier and mainly saves resources allowing to share all filters
> within a qdisc - I call it a "filter block". Also this helps to save
> resources when we do offload to hw for example to expensive TCAM.
>
> So back to the example. First, we create 2 qdiscs. Both will share
> block number 22. "22" is just an identification. If we don't pass any
> block number, a new one will be generated by kernel:

Should not block 0 by used by default if not specified by user?
Why a new one?


>
> $ tc qdisc add dev ens7 ingress block 22
> 
> $ tc qdisc add dev ens8 ingress block 22
> 
>
> Now if we list the qdiscs, we will see the block index in the output:
>
> $ tc qdisc
> qdisc ingress : dev ens7 parent :fff1 block 22
> qdisc ingress : dev ens8 parent :fff1 block 22
>
> Now we can add filter to any of qdiscs sharing the same block:
>
> $ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 
> 192.168.0.0/16 action drop
>

So you don't need to specify block 22 for this filter?
Because there is only one block???


>
> We will see the same output if we list filters for ens7 and ens8, including 
> stats:
>
> $ tc -s filter show dev ens7 ingress
> filter protocol ip pref 25 flower chain 0
> filter protocol ip pref 25 flower chain 0 handle 0x1
>   eth_type ipv4
>   dst_ip 192.168.0.0/16
>   not_in_hw
> action order 1: gact action drop
>  random type none pass val 0
>  index 1 ref 1 bind 1 installed 39 sec used 2 sec
> Action statistics:
> Sent 3108 bytes 37 pkt (dropped 37, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0


Don't see which block it belongs to here.

Re: linux-next: build warning after merge of the netfilter-next tree

2017-11-15 Thread Stephen Rothwell

Hi Pablo,

On Thu, 9 Nov 2017 00:40:14 +0100 Pablo Neira Ayuso  wrote:
>
> On Wed, Nov 08, 2017 at 07:00:52PM +1100, Stephen Rothwell wrote:
> > 
> > On Tue, 7 Nov 2017 11:02:48 +1100 Stephen Rothwell  
> > wrote:  
> > >
> > > After merging the netfilter-next tree, today's linux-next build (powerpc
> > > ppc64_defconfig) produced this warning:
> > > 
> > > net/netfilter/nf_conntrack_netlink.c:536:15: warning: 
> > > 'ctnetlink_proto_size' defined but not used [-Wunused-function]
> > >  static size_t ctnetlink_proto_size(const struct nf_conn *ct)
> > >^
> > > 
> > > Introduced by commit
> > > 
> > >   5caaed151a68 ("netfilter: conntrack: don't cache nlattr_tuple_size 
> > > result in nla_size")  
> > 
> > I assume that this warning will now be in the net-next tree ...  
> 
> It's my fault, I'll fix this in my next batch, sorry for the inconvenience.

This has now made it into Linus' tree :-(

-- 
Cheers,
Stephen Rothwell

Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API

2017-11-15 Thread chet l

On Tue, Oct 31, 2017 at 5:41 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>

> +/*
> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
> + * to register user memory which should be used to store the packet
> + * data.
> + *
> + * There are some constraints for the memory being registered:
> + * - The memory area has to be memory page size aligned.
> + * - The frame size has to be a power of 2.
> + * - The frame size cannot be smaller than 2048B.
> + * - The frame size cannot be larger than the memory page size.
> + *
> + * Corollary: The number of frames that can be stored is
> + * len / frame_size.
> + *
> + */
> +struct tpacket_memreg_req {
> +   unsigned long   addr;   /* Start of packet data area */
> +   unsigned long   len;/* Length of packet data area */
> +   unsigned intframe_size; /* Frame size */
> +   unsigned intdata_headroom;  /* Frame head room */
> +};
> +

I have not reviewed the entire patchset but I think if we could add a
version_hdr and then unionize the fields, it might be easier to add
SVM support without having to spin v5. I could be wrong though.


Chetan

Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API

2017-11-15 Thread chet l

>
> Actually, we started out with that approach, where the packet_mmap
> call mapped Tx/Rx descriptor rings and the packet buffer region. We
> later moved to this (register umem) approach, because it's more
> flexible for user space, not having to use a AF_PACKET specific
> allocator (i.e. continue to use regular mallocs, huge pages and such).
>


One quick question:
Any thoughts on SVM support?
Is SVM support going to be so disruptive that we will need to churn a tp_v5?

If not then to accommodate future SVM enablement do you think it might
make sense to add/stuff a control-info union in the tp4_queue (or umem
etc). And then in the future, I think setmemreg (or something else)
would need to pass the PASID in addition to the malloc'd addr.
Assumption here is that the user-app will bind PID<->PASID before
invoking the AF_ZC setup.



> Björn

Chetan

Re: [PATCH RfC 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-11-15 Thread Heiner Kallweit

Am 15.11.2017 um 23:04 schrieb Florian Fainelli:
> On 11/12/2017 01:08 PM, Heiner Kallweit wrote:
>> After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
>> sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
>> some simplification" all relevant code pieces run in process context
>> anyway and I don't think we need the disabling of interrupts any longer.
>>
>> Interestingly enough, latter commit already removed the comment
>> explaining why interrupts need to be temporarily disabled.
>>
>> On my system phy interrupt mode works fine with this patch.
>> However I may miss something, especially in the context of shared phy
>> interrupts, therefore I'd appreciate if more people could test this.
> 
> I am not currently in a position to test this, but this looks very
> similar, if not identical to what Ard submitted a few days earlier:
> 

Thanks for the hint. Indeed it's exactly the same patch, so the one
sent by me can be disregarded.

> https://patchwork.kernel.org/patch/10048901/
> 
> Since net-next is closed at the moment, that should allow us to give
> this some good amount of testing.
> 
> Thanks
> 
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  drivers/net/phy/phy.c | 26 ++
>>  include/linux/phy.h   |  1 -
>>  2 files changed, 2 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 2b1e67bc1..b3784c9a2 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -629,9 +629,6 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
>>  if (PHY_HALTED == phydev->state)
>>  return IRQ_NONE;/* It can't be ours.  */
>>  
>> -disable_irq_nosync(irq);
>> -atomic_inc(>irq_disable);
>> -
>>  phy_change(phydev);
>>  
>>  return IRQ_HANDLED;
>> @@ -689,7 +686,6 @@ static int phy_disable_interrupts(struct phy_device 
>> *phydev)
>>   */
>>  int phy_start_interrupts(struct phy_device *phydev)
>>  {
>> -atomic_set(>irq_disable, 0);
>>  if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
>>   IRQF_ONESHOT | IRQF_SHARED,
>>   phydev_name(phydev), phydev) < 0) {
>> @@ -716,13 +712,6 @@ int phy_stop_interrupts(struct phy_device *phydev)
>>  
>>  free_irq(phydev->irq, phydev);
>>  
>> -/* If work indeed has been cancelled, disable_irq() will have
>> - * been left unbalanced from phy_interrupt() and enable_irq()
>> - * has to be called so that other devices on the line work.
>> - */
>> -while (atomic_dec_return(>irq_disable) >= 0)
>> -enable_irq(phydev->irq);
>> -
>>  return err;
>>  }
>>  EXPORT_SYMBOL(phy_stop_interrupts);
>> @@ -736,7 +725,7 @@ void phy_change(struct phy_device *phydev)
>>  if (phy_interrupt_is_valid(phydev)) {
>>  if (phydev->drv->did_interrupt &&
>>  !phydev->drv->did_interrupt(phydev))
>> -goto ignore;
>> +return;
>>  
>>  if (phy_disable_interrupts(phydev))
>>  goto phy_err;
>> @@ -748,27 +737,16 @@ void phy_change(struct phy_device *phydev)
>>  mutex_unlock(>lock);
>>  
>>  if (phy_interrupt_is_valid(phydev)) {
>> -atomic_dec(>irq_disable);
>> -enable_irq(phydev->irq);
>> -
>>  /* Reenable interrupts */
>>  if (PHY_HALTED != phydev->state &&
>>  phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
>> -goto irq_enable_err;
>> +goto phy_err;
>>  }
>>  
>>  /* reschedule state queue work to run as soon as possible */
>>  phy_trigger_machine(phydev, true);
>>  return;
>>  
>> -ignore:
>> -atomic_dec(>irq_disable);
>> -enable_irq(phydev->irq);
>> -return;
>> -
>> -irq_enable_err:
>> -disable_irq(phydev->irq);
>> -atomic_inc(>irq_disable);
>>  phy_err:
>>  phy_error(phydev);
>>  }
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index dc82a07cb..8a87e441f 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -468,7 +468,6 @@ struct phy_device {
>>  /* Interrupt and Polling infrastructure */
>>  struct work_struct phy_queue;
>>  struct delayed_work state_queue;
>> -atomic_t irq_disable;
>>  
>>  struct mutex lock;
>>  
>>
> 
>

Re: [GIT] Networking

2017-11-15 Thread Daniel Borkmann

On 11/15/2017 09:19 PM, Linus Torvalds wrote:
> On Wed, Nov 15, 2017 at 3:33 AM, David Miller  wrote:
>>
>> Highlights:
> 
> Lowlights:
> 
>  1) it duplicated a commit from the hrtimer tree, which had been
> cleaned up and rewritten, but then merging the second copy of the
> commit re-introduced the bad code that had been cleaned up.
> 
> I'm talking about commits
> 
>  - 7d9285e82db5:
>   perf/bpf: Extend the perf_event_read_local() interface, a.k.a.
> "bpf: perf event change needed for subsequent bpf helpers"
>  - 97562633bcba
>   bpf: perf event change needed for subsequent bpf helpers
> 
> where apparently there was no discussion between the groups about the
> subsequent changes.
> 
> And this must have shown up in linux-next as a conflict, but no
> mention of it from either the perf event tree or the networking tree
> merge.
> 
> Although it is of course possible that depending on merge order, the
> problem never showed up in next.

Sorry about that, it was discussed that the patch in [1] would get
routed through net-next and again cherry-picked from tracing folks
due to conflicting changes in perf event tree that were being worked
on to avoid later merge conflicts - clearly that didn't give the
desired result.

There was a subsequent discussion in [2] but not sure if cherry-picking
0d3d73aac2ff ("perf/core: Rewrite event timekeeping") into net-next
would have made it better or worse. We'll have a bpf sub-tree up and
running soon for the next development cycle that can be pulled from
by different parties when needed; potentially this could reduce such
conflicts between trees in future. Sorry for the trouble.

Thanks,
Daniel

  [1] https://patchwork.ozlabs.org/patch/821919/
  [2] https://lkml.org/lkml/2017/11/1/53

Re: [PATCH RfC 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-11-15 Thread Florian Fainelli

On 11/12/2017 01:08 PM, Heiner Kallweit wrote:
> After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
> sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
> some simplification" all relevant code pieces run in process context
> anyway and I don't think we need the disabling of interrupts any longer.
> 
> Interestingly enough, latter commit already removed the comment
> explaining why interrupts need to be temporarily disabled.
> 
> On my system phy interrupt mode works fine with this patch.
> However I may miss something, especially in the context of shared phy
> interrupts, therefore I'd appreciate if more people could test this.

I am not currently in a position to test this, but this looks very
similar, if not identical to what Ard submitted a few days earlier:

https://patchwork.kernel.org/patch/10048901/

Since net-next is closed at the moment, that should allow us to give
this some good amount of testing.

Thanks

> 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/phy/phy.c | 26 ++
>  include/linux/phy.h   |  1 -
>  2 files changed, 2 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 2b1e67bc1..b3784c9a2 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -629,9 +629,6 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
>   if (PHY_HALTED == phydev->state)
>   return IRQ_NONE;/* It can't be ours.  */
>  
> - disable_irq_nosync(irq);
> - atomic_inc(>irq_disable);
> -
>   phy_change(phydev);
>  
>   return IRQ_HANDLED;
> @@ -689,7 +686,6 @@ static int phy_disable_interrupts(struct phy_device 
> *phydev)
>   */
>  int phy_start_interrupts(struct phy_device *phydev)
>  {
> - atomic_set(>irq_disable, 0);
>   if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
>IRQF_ONESHOT | IRQF_SHARED,
>phydev_name(phydev), phydev) < 0) {
> @@ -716,13 +712,6 @@ int phy_stop_interrupts(struct phy_device *phydev)
>  
>   free_irq(phydev->irq, phydev);
>  
> - /* If work indeed has been cancelled, disable_irq() will have
> -  * been left unbalanced from phy_interrupt() and enable_irq()
> -  * has to be called so that other devices on the line work.
> -  */
> - while (atomic_dec_return(>irq_disable) >= 0)
> - enable_irq(phydev->irq);
> -
>   return err;
>  }
>  EXPORT_SYMBOL(phy_stop_interrupts);
> @@ -736,7 +725,7 @@ void phy_change(struct phy_device *phydev)
>   if (phy_interrupt_is_valid(phydev)) {
>   if (phydev->drv->did_interrupt &&
>   !phydev->drv->did_interrupt(phydev))
> - goto ignore;
> + return;
>  
>   if (phy_disable_interrupts(phydev))
>   goto phy_err;
> @@ -748,27 +737,16 @@ void phy_change(struct phy_device *phydev)
>   mutex_unlock(>lock);
>  
>   if (phy_interrupt_is_valid(phydev)) {
> - atomic_dec(>irq_disable);
> - enable_irq(phydev->irq);
> -
>   /* Reenable interrupts */
>   if (PHY_HALTED != phydev->state &&
>   phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
> - goto irq_enable_err;
> + goto phy_err;
>   }
>  
>   /* reschedule state queue work to run as soon as possible */
>   phy_trigger_machine(phydev, true);
>   return;
>  
> -ignore:
> - atomic_dec(>irq_disable);
> - enable_irq(phydev->irq);
> - return;
> -
> -irq_enable_err:
> - disable_irq(phydev->irq);
> - atomic_inc(>irq_disable);
>  phy_err:
>   phy_error(phydev);
>  }
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index dc82a07cb..8a87e441f 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -468,7 +468,6 @@ struct phy_device {
>   /* Interrupt and Polling infrastructure */
>   struct work_struct phy_queue;
>   struct delayed_work state_queue;
> - atomic_t irq_disable;
>  
>   struct mutex lock;
>  
> 


-- 
Florian

Re: [PATCH 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers

2017-11-15 Thread Florian Fainelli

On 11/15/2017 01:45 PM, Heiner Kallweit wrote:
> Remove generic settings for callbacks config_aneg and read_status
> from drivers.
> 
> Signed-off-by: Heiner Kallweit 

Nice diffstat:

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default

2017-11-15 Thread Florian Fainelli

On 11/15/2017 01:42 PM, Heiner Kallweit wrote:
> read_status and config_aneg are the only mandatory callbacks and most
> of the time the generic implementation is used by drivers.
> So make the core fall back to the generic version if a driver doesn't
> implement the respective callback.
> 
> Also currently the core doesn't seem to verify that drivers implement
> the mandatory calls. If a driver doesn't do so we'd just get a NPE.

Right, which is not an unusual way to signal that something is mandatory.

> With this patch this potential issue doesn't exit any longer.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 

Note that net-next is closed at the moment, so you will have to resubmit
this when the tree opens back again.

> ---
>  drivers/net/phy/phy.c |  5 -
>  include/linux/phy.h   | 33 ++---
>  2 files changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 2b1e67bc1..a0e7605dc 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -493,7 +493,10 @@ static int phy_start_aneg_priv(struct phy_device 
> *phydev, bool sync)
>   /* Invalidate LP advertising flags */
>   phydev->lp_advertising = 0;
>  
> - err = phydev->drv->config_aneg(phydev);
> + if (phydev->drv->config_aneg)
> + err = phydev->drv->config_aneg(phydev);
> + else
> + err = genphy_config_aneg(phydev);
>   if (err < 0)
>   goto out_unlock;
>  
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index dc82a07cb..958b5162a 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -497,13 +497,13 @@ struct phy_device {
>   * flags: A bitfield defining certain other features this PHY
>   *   supports (like interrupts)
>   *
> - * The drivers must implement config_aneg and read_status.  All
> - * other functions are optional. Note that none of these
> - * functions should be called from interrupt time.  The goal is
> - * for the bus read/write functions to be able to block when the
> - * bus transaction is happening, and be freed up by an interrupt
> - * (The MPC85xx has this ability, though it is not currently
> - * supported in the driver).
> + * All functions are optional. If config_aneg or read_status
> + * are not implemented, the phy core uses the genphy versions.
> + * Note that none of these functions should be called from
> + * interrupt time. The goal is for the bus read/write functions
> + * to be able to block when the bus transaction is happening,
> + * and be freed up by an interrupt (The MPC85xx has this ability,
> + * though it is not currently supported in the driver).
>   */
>  struct phy_driver {
>   struct mdio_driver_common mdiodrv;
> @@ -841,14 +841,6 @@ int phy_aneg_done(struct phy_device *phydev);
>  int phy_stop_interrupts(struct phy_device *phydev);
>  int phy_restart_aneg(struct phy_device *phydev);
>  
> -static inline int phy_read_status(struct phy_device *phydev)
> -{
> - if (!phydev->drv)
> - return -EIO;
> -
> - return phydev->drv->read_status(phydev);
> -}
> -
>  #define phydev_err(_phydev, format, args...) \
>   dev_err(&_phydev->mdio.dev, format, ##args)
>  
> @@ -890,6 +882,17 @@ int genphy_c45_read_pma(struct phy_device *phydev);
>  int genphy_c45_pma_setup_forced(struct phy_device *phydev);
>  int genphy_c45_an_disable_aneg(struct phy_device *phydev);
>  
> +static inline int phy_read_status(struct phy_device *phydev)
> +{
> + if (!phydev->drv)
> + return -EIO;
> +
> + if (phydev->drv->read_status)
> + return phydev->drv->read_status(phydev);
> + else
> + return genphy_read_status(phydev);
> +}
> +
>  void phy_driver_unregister(struct phy_driver *drv);
>  void phy_drivers_unregister(struct phy_driver *drv, int n);
>  int phy_driver_register(struct phy_driver *new_driver, struct module *owner);
> 


-- 
Florian

[PATCH 2/2] net: phy: remove generic settings for callbacks config_aneg and read_status from drivers

2017-11-15 Thread Heiner Kallweit

Remove generic settings for callbacks config_aneg and read_status
from drivers.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/amd.c|  2 --
 drivers/net/phy/at803x.c |  6 --
 drivers/net/phy/bcm-cygnus.c |  2 --
 drivers/net/phy/bcm63xx.c|  4 
 drivers/net/phy/bcm7xxx.c|  6 --
 drivers/net/phy/broadcom.c   | 30 --
 drivers/net/phy/cicada.c |  4 
 drivers/net/phy/davicom.c|  5 -
 drivers/net/phy/dp83640.c|  2 --
 drivers/net/phy/dp83822.c|  2 --
 drivers/net/phy/dp83848.c|  2 --
 drivers/net/phy/dp83867.c|  2 --
 drivers/net/phy/icplus.c |  4 
 drivers/net/phy/intel-xway.c | 12 
 drivers/net/phy/lxt.c|  5 -
 drivers/net/phy/marvell.c|  9 -
 drivers/net/phy/meson-gxl.c  |  2 --
 drivers/net/phy/micrel.c | 24 
 drivers/net/phy/microchip.c  |  1 -
 drivers/net/phy/national.c   |  2 --
 drivers/net/phy/phy_device.c |  2 --
 drivers/net/phy/qsemi.c  |  2 --
 drivers/net/phy/realtek.c| 12 
 drivers/net/phy/rockchip.c   |  1 -
 drivers/net/phy/smsc.c   |  9 -
 drivers/net/phy/ste10Xp.c|  4 
 drivers/net/phy/uPD60620.c   |  1 -
 drivers/net/phy/vitesse.c| 12 
 28 files changed, 169 deletions(-)

diff --git a/drivers/net/phy/amd.c b/drivers/net/phy/amd.c
index 18141c022..6fe5dc920 100644
--- a/drivers/net/phy/amd.c
+++ b/drivers/net/phy/amd.c
@@ -68,8 +68,6 @@ static struct phy_driver am79c_driver[] = { {
.features   = PHY_BASIC_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
.config_init= am79c_config_init,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.ack_interrupt  = am79c_ack_interrupt,
.config_intr= am79c_config_intr,
 } };
diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index 5f93e6add..de7dd6566 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -408,8 +408,6 @@ static struct phy_driver at803x_driver[] = {
.resume = at803x_resume,
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.ack_interrupt  = at803x_ack_interrupt,
.config_intr= at803x_config_intr,
 }, {
@@ -426,8 +424,6 @@ static struct phy_driver at803x_driver[] = {
.resume = at803x_resume,
.features   = PHY_BASIC_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.ack_interrupt  = at803x_ack_interrupt,
.config_intr= at803x_config_intr,
 }, {
@@ -443,8 +439,6 @@ static struct phy_driver at803x_driver[] = {
.resume = at803x_resume,
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_INTERRUPT,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.aneg_done  = at803x_aneg_done,
.ack_interrupt  = _ack_interrupt,
.config_intr= _config_intr,
diff --git a/drivers/net/phy/bcm-cygnus.c b/drivers/net/phy/bcm-cygnus.c
index 3fe8cc5c1..683812983 100644
--- a/drivers/net/phy/bcm-cygnus.c
+++ b/drivers/net/phy/bcm-cygnus.c
@@ -136,8 +136,6 @@ static struct phy_driver bcm_cygnus_phy_driver[] = {
.name  = "Broadcom Cygnus PHY",
.features  = PHY_GBIT_FEATURES,
.config_init   = bcm_cygnus_config_init,
-   .config_aneg   = genphy_config_aneg,
-   .read_status   = genphy_read_status,
.ack_interrupt = bcm_phy_ack_intr,
.config_intr   = bcm_phy_config_intr,
.suspend   = genphy_suspend,
diff --git a/drivers/net/phy/bcm63xx.c b/drivers/net/phy/bcm63xx.c
index b0492ef2c..cf1461374 100644
--- a/drivers/net/phy/bcm63xx.c
+++ b/drivers/net/phy/bcm63xx.c
@@ -69,8 +69,6 @@ static struct phy_driver bcm63xx_driver[] = {
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_INTERRUPT | PHY_IS_INTERNAL,
.config_init= bcm63xx_config_init,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.ack_interrupt  = bcm_phy_ack_intr,
.config_intr= bcm63xx_config_intr,
 }, {
@@ -81,8 +79,6 @@ static struct phy_driver bcm63xx_driver[] = {
.features   = (PHY_BASIC_FEATURES | SUPPORTED_Pause),
.flags  = PHY_HAS_INTERRUPT | PHY_IS_INTERNAL,
.config_init= bcm63xx_config_init,
-   .config_aneg= genphy_config_aneg,
-   .read_status= genphy_read_status,
.ack_interrupt  =

[PATCH 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default

2017-11-15 Thread Heiner Kallweit

read_status and config_aneg are the only mandatory callbacks and most
of the time the generic implementation is used by drivers.
So make the core fall back to the generic version if a driver doesn't
implement the respective callback.

Also currently the core doesn't seem to verify that drivers implement
the mandatory calls. If a driver doesn't do so we'd just get a NPE.
With this patch this potential issue doesn't exit any longer.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c |  5 -
 include/linux/phy.h   | 33 ++---
 2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 2b1e67bc1..a0e7605dc 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -493,7 +493,10 @@ static int phy_start_aneg_priv(struct phy_device *phydev, 
bool sync)
/* Invalidate LP advertising flags */
phydev->lp_advertising = 0;
 
-   err = phydev->drv->config_aneg(phydev);
+   if (phydev->drv->config_aneg)
+   err = phydev->drv->config_aneg(phydev);
+   else
+   err = genphy_config_aneg(phydev);
if (err < 0)
goto out_unlock;
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index dc82a07cb..958b5162a 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -497,13 +497,13 @@ struct phy_device {
  * flags: A bitfield defining certain other features this PHY
  *   supports (like interrupts)
  *
- * The drivers must implement config_aneg and read_status.  All
- * other functions are optional. Note that none of these
- * functions should be called from interrupt time.  The goal is
- * for the bus read/write functions to be able to block when the
- * bus transaction is happening, and be freed up by an interrupt
- * (The MPC85xx has this ability, though it is not currently
- * supported in the driver).
+ * All functions are optional. If config_aneg or read_status
+ * are not implemented, the phy core uses the genphy versions.
+ * Note that none of these functions should be called from
+ * interrupt time. The goal is for the bus read/write functions
+ * to be able to block when the bus transaction is happening,
+ * and be freed up by an interrupt (The MPC85xx has this ability,
+ * though it is not currently supported in the driver).
  */
 struct phy_driver {
struct mdio_driver_common mdiodrv;
@@ -841,14 +841,6 @@ int phy_aneg_done(struct phy_device *phydev);
 int phy_stop_interrupts(struct phy_device *phydev);
 int phy_restart_aneg(struct phy_device *phydev);
 
-static inline int phy_read_status(struct phy_device *phydev)
-{
-   if (!phydev->drv)
-   return -EIO;
-
-   return phydev->drv->read_status(phydev);
-}
-
 #define phydev_err(_phydev, format, args...)   \
dev_err(&_phydev->mdio.dev, format, ##args)
 
@@ -890,6 +882,17 @@ int genphy_c45_read_pma(struct phy_device *phydev);
 int genphy_c45_pma_setup_forced(struct phy_device *phydev);
 int genphy_c45_an_disable_aneg(struct phy_device *phydev);
 
+static inline int phy_read_status(struct phy_device *phydev)
+{
+   if (!phydev->drv)
+   return -EIO;
+
+   if (phydev->drv->read_status)
+   return phydev->drv->read_status(phydev);
+   else
+   return genphy_read_status(phydev);
+}
+
 void phy_driver_unregister(struct phy_driver *drv);
 void phy_drivers_unregister(struct phy_driver *drv, int n);
 int phy_driver_register(struct phy_driver *new_driver, struct module *owner);
-- 
2.15.0

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Egil Hjelmeland




Den 15. nov. 2017 20:27, skrev Sarah Newman:

Current memory and CPU usage for managing bridge fdb entries is unbounded.
Add a parameter max_fdb_count, controlled from sysfs, which places an upper
limit on the number of entries. Defaults to 1024.

When max_fdb_count is met or exceeded, whether traffic is sent out a
given port should depend on its flooding behavior.

This may instead be mitigated by filtering mac address entries in the
PREROUTING chain of the ebtables nat table, but this is only practical
when mac addresses are known in advance.

Signed-off-by: Sarah Newman 
---
  net/bridge/br_device.c   |  2 ++
  net/bridge/br_fdb.c  | 25 -
  net/bridge/br_private.h  |  3 +++
  net/bridge/br_sysfs_br.c | 24 
  4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 723f25e..18fabdf 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -335,6 +335,28 @@ static ssize_t flush_store(struct device *d,
  }
  static DEVICE_ATTR_WO(flush);
  
+static ssize_t max_fdb_count_show(struct device *d, struct device_attribute *attr,

+char *buf)
+{
+   struct net_bridge *br = to_bridge(d);
+   return sprintf(buf, "%lu\n", br->max_fdb_count);
+}
+
+static ssize_t max_fdb_count_store(struct device *d, struct device_attribute 
*attr,
+ const char *buf, size_t len)
+{
+   return store_bridge_parm(d, buf, len, br_set_max_fdb_count);
+}
+static DEVICE_ATTR_RW(max_fdb_count);
+
+static ssize_t fdb_count_show(struct device *d, struct device_attribute *attr,
+   char *buf)
+{
+   struct net_bridge *br = to_bridge(d);
+   return sprintf(buf, "%lu\n", br->fdb_count);
+}
+static DEVICE_ATTR_RO(fdb_count);
+
  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
  static ssize_t multicast_router_show(struct device *d,
 struct device_attribute *attr, char *buf)
@@ -830,6 +852,8 @@ static ssize_t vlan_stats_enabled_store(struct device *d,
_attr_gc_timer.attr,
_attr_group_addr.attr,
_attr_flush.attr,
+   _attr_max_fdb_count.attr,
+   _attr_fdb_count.attr,
  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
_attr_multicast_router.attr,
_attr_multicast_snooping.attr,





Documentation/filesystems/sysfs.txt:

All new sysfs attributes must be documented in Documentation/ABI. See 
also Documentation/ABI/README for more information.



Egil

Re: [PATCH v3 2/4] net: nb8800: Simplify nb8800_pause_config()

2017-11-15 Thread Andrew Lunn

On Wed, Nov 15, 2017 at 04:19:56PM +0100, Marc Gonzalez wrote:
> On 15/11/2017 16:03, Andrew Lunn wrote:
> 
> > On Wed, Nov 15, 2017 at 03:33:47PM +0100, Marc Gonzalez wrote:
> > 
> >> On 15/11/2017 15:17, Andrew Lunn wrote:
> >>
> >> In our local branch, I have completely disabled flow control support,
> >> so I don't have to worry about this problem.
> > 
> > That is an interesting statement. You now know there is an issue here,
> > your solution is to fix your private branch and leave mainline as is.
> 
> All my patches are NACKed, what would you have me do?

Hi Marc

You need to consider your own maintenance burden. You want your local
branch to be as near to mainline as possible. Each change you have
means additional maintenance work for you. It also possibly means
additional work for your customers.

You seem to think flow control in your hardware is too broken to be
usable. So you probably want to submit a patch to mainline disabling
it. If it is accepted, that is one less patch you need to maintain.

Andrew

[net-next 1/1] tipc: enforce valid ratio between skb truesize and contents

2017-11-15 Thread Jon Maloy

The socket level flow control is based on the assumption that incoming
buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
where the latter value is rounded off upwards to the nearest 1k number.
This does empirically hold true for the device drivers we know, but we
cannot trust that it will always be so, e.g., in a system with jumbo
frames and very small packets.

We now introduce a check for this condition at packet arrival, and if
we find it to be false, we copy the packet to a new, smaller buffer,
where the condition will be true. We expect this to affect only a small
fraction of all incoming packets, if at all.

Acked-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/msg.c  | 24 +---
 net/tipc/msg.h  |  7 ++-
 net/tipc/node.c |  2 +-
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 1649d45..b0d07b3 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -174,7 +174,7 @@ int tipc_buf_append(struct sk_buff **headbuf, struct 
sk_buff **buf)
 
if (fragid == LAST_FRAGMENT) {
TIPC_SKB_CB(head)->validated = false;
-   if (unlikely(!tipc_msg_validate(head)))
+   if (unlikely(!tipc_msg_validate()))
goto err;
*buf = head;
TIPC_SKB_CB(head)->tail = NULL;
@@ -201,11 +201,21 @@ int tipc_buf_append(struct sk_buff **headbuf, struct 
sk_buff **buf)
  * TIPC will ignore the excess, under the assumption that it is optional info
  * introduced by a later release of the protocol.
  */
-bool tipc_msg_validate(struct sk_buff *skb)
+bool tipc_msg_validate(struct sk_buff **_skb)
 {
-   struct tipc_msg *msg;
+   struct sk_buff *skb = *_skb;
+   struct tipc_msg *hdr;
int msz, hsz;
 
+   /* Ensure that flow control ratio condition is satisfied */
+   if (unlikely(skb->truesize / buf_roundup_len(skb) > 4)) {
+   skb = skb_copy(skb, GFP_ATOMIC);
+   if (!skb)
+   return false;
+   kfree_skb(*_skb);
+   *_skb = skb;
+   }
+
if (unlikely(TIPC_SKB_CB(skb)->validated))
return true;
if (unlikely(!pskb_may_pull(skb, MIN_H_SIZE)))
@@ -217,11 +227,11 @@ bool tipc_msg_validate(struct sk_buff *skb)
if (unlikely(!pskb_may_pull(skb, hsz)))
return false;
 
-   msg = buf_msg(skb);
-   if (unlikely(msg_version(msg) != TIPC_VERSION))
+   hdr = buf_msg(skb);
+   if (unlikely(msg_version(hdr) != TIPC_VERSION))
return false;
 
-   msz = msg_size(msg);
+   msz = msg_size(hdr);
if (unlikely(msz < hsz))
return false;
if (unlikely((msz - hsz) > TIPC_MAX_USER_MSG_SIZE))
@@ -411,7 +421,7 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff 
**iskb, int *pos)
skb_pull(*iskb, offset);
imsz = msg_size(buf_msg(*iskb));
skb_trim(*iskb, imsz);
-   if (unlikely(!tipc_msg_validate(*iskb)))
+   if (unlikely(!tipc_msg_validate(iskb)))
goto none;
*pos += align(imsz);
return true;
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index d97d1ea..791b058 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -926,7 +926,7 @@ static inline bool msg_is_reset(struct tipc_msg *hdr)
 }
 
 struct sk_buff *tipc_buf_acquire(u32 size, gfp_t gfp);
-bool tipc_msg_validate(struct sk_buff *skb);
+bool tipc_msg_validate(struct sk_buff **_skb);
 bool tipc_msg_reverse(u32 own_addr, struct sk_buff **skb, int err);
 void tipc_skb_reject(struct net *net, int err, struct sk_buff *skb,
 struct sk_buff_head *xmitq);
@@ -954,6 +954,11 @@ static inline u16 buf_seqno(struct sk_buff *skb)
return msg_seqno(buf_msg(skb));
 }
 
+static inline int buf_roundup_len(struct sk_buff *skb)
+{
+   return (skb->len / 1024 + 1) * 1024;
+}
+
 /* tipc_skb_peek(): peek and reserve first buffer in list
  * @list: list to be peeked in
  * Returns pointer to first buffer in list, if any
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 009a816..507017f 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1539,7 +1539,7 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, 
struct tipc_bearer *b)
__skb_queue_head_init();
 
/* Ensure message is well-formed before touching the header */
-   if (unlikely(!tipc_msg_validate(skb)))
+   if (unlikely(!tipc_msg_validate()))
goto discard;
hdr = buf_msg(skb);
usr = msg_user(hdr);
-- 
2.1.4

Re: [GIT] Networking

2017-11-15 Thread Linus Torvalds

On Wed, Nov 15, 2017 at 3:33 AM, David Miller  wrote:
>
> Highlights:

Lowlights:

 1) it duplicated a commit from the hrtimer tree, which had been
cleaned up and rewritten, but then merging the second copy of the
commit re-introduced the bad code that had been cleaned up.

I'm talking about commits

 - 7d9285e82db5:
  perf/bpf: Extend the perf_event_read_local() interface, a.k.a.
"bpf: perf event change needed for subsequent bpf helpers"
 - 97562633bcba
  bpf: perf event change needed for subsequent bpf helpers

where apparently there was no discussion between the groups about the
subsequent changes.

And this must have shown up in linux-next as a conflict, but no
mention of it from either the perf event tree or the networking tree
merge.

Although it is of course possible that depending on merge order, the
problem never showed up in next.

Linus

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Stephen Hemminger

On Wed, 15 Nov 2017 11:27:07 -0800
Sarah Newman  wrote:

> Current memory and CPU usage for managing bridge fdb entries is unbounded.
> Add a parameter max_fdb_count, controlled from sysfs, which places an upper
> limit on the number of entries. Defaults to 1024.
> 
> When max_fdb_count is met or exceeded, whether traffic is sent out a
> given port should depend on its flooding behavior.
> 
> This may instead be mitigated by filtering mac address entries in the
> PREROUTING chain of the ebtables nat table, but this is only practical
> when mac addresses are known in advance.
> 
> Signed-off-by: Sarah Newman 

Thanks for your patch, I can see how this can be a real
problem, especially for a learning bridge.

There are some details which probably need to be resolved before
this can be applied.

* if the bridge is being DoS attacked by random MAC addresses
  then the policy under spam needs to be considered.
  Not sure if not inserting the new entry is the best policy.

* The counter probably doesn't need to be unsigned long (64 bit on x86_64)
  maybe u32 is enough.

* The bridge attributes in general are controllable both by netlink
  and sysfs. It would make sense to do this for max fdb as well.

Also what do the vendors using bridge for L2 offload to switch think?
Many switches need to clone table, and similar limits must be in other
versions. FreeBSD has a limit like this.

[PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-15 Thread Vasyl Gomonovych

Use kzalloc rather than kmalloc followed by memset with 0
drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING:
kzalloc should be used for dcbx_info, instead of kmalloc/memset
Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

Originally was fixed in:
'commit 561ed23331df ("qed: fix kzalloc-simple.cocci warnings")'
But later warning was introduced in:
'commit 62289ba27558 ("qed: Fix possible system hang in the
dcbnl-getdcbx() path.")'

Signed-off-by: Vasyl Gomonovych 
---
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index 8f6ccc0c39e5..9a1bd996a640 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct 
qed_hwfn *hwfn,
 {
struct qed_dcbx_get *dcbx_info;
 
-   dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC);
+   dcbx_info = kzmalloc(sizeof(*dcbx_info), GFP_ATOMIC);
if (!dcbx_info)
return NULL;
 
-   memset(dcbx_info, 0, sizeof(*dcbx_info));
if (qed_dcbx_query_params(hwfn, dcbx_info, type)) {
kfree(dcbx_info);
return NULL;
-- 
1.9.1

[RFC PATCH net-next 1/2] ethtool: Add support for configuring PFC stall prevention in ethtool

2017-11-15 Thread Eran Ben Elisha

From: Inbar Karmy 

In the event where the device unexpectedly becomes unresponsive
for a long period of time, flow control mechanism may propagate
pause frames which will cause congestion spreading to the entire
network.
To prevent this scenario, when the device is stalled for a period
longer than a pre-configured timeout, flow control mechanisms are
automatically disabled.
This patch defines a new ETHTOOL_GPFCPREVENTION/ETHTOOL_SPFCPREVENTION API,
handled by the new get_pfc_prevention_mode/set_pfc_prevention_mode
callbacks.
This API provides support for configuring flow control protection mechanism
into two deferent modes: default/auto.

Signed-off-by: Inbar Karmy 
Reviewed-by: Eran Ben Elisha 
---
 include/linux/ethtool.h  |  8 
 include/uapi/linux/ethtool.h | 20 
 net/core/ethtool.c   | 39 +++
 3 files changed, 67 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 2ec41a7..f6d5e26 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,10 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
*legacy_u32,
  * fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  * instead of the latter), any change to them will be overwritten
  * by kernel. Returns a negative error code or zero.
+ * @get_pfc_prevention_mode: Report pfc storm prevention mode-
+ * default/auto.
+ * @set_pfc_prevention_mode: Set pfc storm prevention mode. Returns
+ * a negative error code or zero.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -400,5 +404,9 @@ struct ethtool_ops {
  struct ethtool_fecparam *);
int (*set_fecparam)(struct net_device *,
  struct ethtool_fecparam *);
+   int (*get_pfc_prevention_mode)(struct net_device *,
+  struct ethtool_pfc_prevention *);
+   int (*set_pfc_prevention_mode)(struct net_device *,
+  struct ethtool_pfc_prevention *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index ac71559..6ba0d43 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -552,6 +552,24 @@ struct ethtool_pauseparam {
 #define ETH_GSTRING_LEN32
 
 /**
+ * struct ethtool_pfc_prevention - flow control stall prevention mode
+ * @cmd: Command number = %ETHTOOL_GPFCPREVENTION or %ETHTOOL_SPFCPREVENTION
+ * @mode: Flag to define the time till activate pfc stall prevention:
+ *   auto, default.
+ */
+
+enum pfc_prevention_type {
+   ETH_PFC_PREVENTION_DEFAULT  = 0,
+   ETH_PFC_PREVENTION_AUTO,
+};
+
+struct ethtool_pfc_prevention {
+   __u32   cmd;
+   __u32   mode;
+   __u32   reserved;
+};
+
+/**
  * enum ethtool_stringset - string set ID
  * @ETH_SS_TEST: Self-test result names, for use with %ETHTOOL_TEST
  * @ETH_SS_STATS: Statistic names, for use with %ETHTOOL_GSTATS
@@ -1374,6 +1392,8 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 #define ETHTOOL_GFECPARAM  0x0050 /* Get FEC settings */
 #define ETHTOOL_SFECPARAM  0x0051 /* Set FEC settings */
+#define ETHTOOL_GPFCPREVENTION  0x0052 /* Get PFC stall prevention 
configuration */
+#define ETHTOOL_SPFCPREVENTION  0x0053 /* Set PFC stall prevention 
configuration*/
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f8fcf45..a656ecc 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -2556,6 +2556,38 @@ static int ethtool_set_fecparam(struct net_device *dev, 
void __user *useraddr)
return dev->ethtool_ops->set_fecparam(dev, );
 }
 
+static int ethtool_get_pfc_prevention_mode(struct net_device *dev, void __user 
*useraddr)
+{
+   struct ethtool_pfc_prevention preventionmod = { .cmd = 
ETHTOOL_GPFCPREVENTION };
+   int ret;
+
+   if (!dev->ethtool_ops->get_pfc_prevention_mode)
+   return -EOPNOTSUPP;
+
+   ret = dev->ethtool_ops->get_pfc_prevention_mode(dev, );
+
+   if (copy_to_user(useraddr, , sizeof(preventionmod)))
+   return -EFAULT;
+
+   return ret;
+}
+
+static int ethtool_set_pfc_prevention_mode(struct net_device *dev, void __user 
*useraddr)
+{
+   struct ethtool_pfc_prevention preventionmod;
+
+   if (!dev->ethtool_ops->set_pfc_prevention_mode)
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(, useraddr, sizeof(preventionmod)))
+   return -EFAULT;
+
+   if (preventionmod.reserved)
+   return -EINVAL;
+
+   return

[RFC PATCH net-next 0/2] Configuring PFC stall prevention via ethtool

2017-11-15 Thread Eran Ben Elisha

From: Inbar Karmy 

This RFC adds support for configuring PFC stall prevention through ethtool.

In the event where the device unexpectedly becomes unresponsive for a long
period of time, flow control mechanism may propagate pause frames which will
cause congestion spreading to the entire network.

To prevent this scenario, the device may implement a protection mechanism for
monitoring and resolving such state.  The following patches allow the user to
control the stall prevention functionality.

PFC stall prevention configuration is done via ethtool -a (pause).
Two modes are introduced:
Default - current behavior per driver.
Auto - protection mechanism controlled automatically by the driver.
Due to lack of extension ability of ethtool_ops set_pauseparam, a new
ethtool_ops get_pfc_prevention_mode is introduced.

I based this patch set on net-next commit 50895b9de1d3("tcp: highest_sack fix
").

Inbar Karmy (2):
  ethtool: Add support for configuring PFC stall prevention in ethtool
  net/mlx5e: PFC stall prevention support

 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 51 
 drivers/net/ethernet/mellanox/mlx5/core/port.c | 56 ++
 include/linux/ethtool.h|  8 
 include/linux/mlx5/mlx5_ifc.h  | 22 +++--
 include/linux/mlx5/port.h  |  5 ++
 include/uapi/linux/ethtool.h   | 20 
 net/core/ethtool.c | 39 +++
 7 files changed, 188 insertions(+), 13 deletions(-)

-- 
1.8.3.1

[RFC PATCH net-next 2/2] net/mlx5e: PFC stall prevention support

2017-11-15 Thread Eran Ben Elisha

From: Inbar Karmy 

Implement set/get functions to configure PFC stall prevention
mode by ethtool.
On default the stall prevention timeout is configured to 8 sec.
Auto mode will set the stall prevention timeout to be 100 msec.

Signed-off-by: Inbar Karmy 
Reviewed-by: Eran Ben Elisha 
---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 51 
 drivers/net/ethernet/mellanox/mlx5/core/port.c | 56 ++
 include/linux/mlx5/mlx5_ifc.h  | 22 +++--
 include/linux/mlx5/port.h  |  5 ++
 4 files changed, 121 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 23425f0..74096f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1180,6 +1180,55 @@ static int mlx5e_set_pauseparam(struct net_device 
*netdev,
return err;
 }
 
+#define MLX5E_PFC_PREVEN_CRITICAL_AUTO_MSEC100
+#define MLX5E_PFC_PREVEN_MINOR_AUTO_MSEC   85
+#define MLX5E_PFC_PREVEN_CRITICAL_DEFAULT_MSEC 8000
+#define MLX5E_PFC_PREVEN_MINOR_DEFAULT_MSEC6800
+
+static int mlx5e_get_pfc_prevention_mode(struct net_device *netdev,
+struct ethtool_pfc_prevention 
*pfc_preven)
+{
+   struct mlx5e_priv *priv= netdev_priv(netdev);
+   struct mlx5_core_dev *mdev = priv->mdev;
+   u16 pfc_prevention_critical;
+   int err;
+
+   if (!MLX5_CAP_PCAM_FEATURE((priv)->mdev, pfcc_mask))
+   return -EOPNOTSUPP;
+
+   err = mlx5_query_port_pfc_prevention(mdev, _prevention_critical);
+
+   pfc_preven->mode = (pfc_prevention_critical ==
+   MLX5E_PFC_PREVEN_CRITICAL_DEFAULT_MSEC) ?
+   ETH_PFC_PREVENTION_DEFAULT : 
ETH_PFC_PREVENTION_AUTO;
+   return err;
+}
+
+static int mlx5e_set_pfc_prevention_mode(struct net_device *netdev,
+struct ethtool_pfc_prevention 
*pfc_preven)
+{
+   struct mlx5e_priv *priv= netdev_priv(netdev);
+   struct mlx5_core_dev *mdev = priv->mdev;
+   u16 pfc_prevention_critical;
+   u16 pfc_prevention_minor;
+   int err;
+
+   if (!MLX5_CAP_PCAM_FEATURE((priv)->mdev, pfcc_mask))
+   return -EOPNOTSUPP;
+
+   pfc_prevention_critical = (pfc_preven->mode == 
ETH_PFC_PREVENTION_DEFAULT) ?
+  MLX5E_PFC_PREVEN_CRITICAL_DEFAULT_MSEC :
+  MLX5E_PFC_PREVEN_CRITICAL_AUTO_MSEC;
+   pfc_prevention_minor = (pfc_prevention_critical ==
+   MLX5E_PFC_PREVEN_CRITICAL_DEFAULT_MSEC) ?
+   MLX5E_PFC_PREVEN_MINOR_DEFAULT_MSEC :
+   MLX5E_PFC_PREVEN_MINOR_AUTO_MSEC;
+   err = mlx5_set_port_pfc_prevention(mdev, pfc_prevention_critical,
+  pfc_prevention_minor);
+
+   return err;
+}
+
 int mlx5e_ethtool_get_ts_info(struct mlx5e_priv *priv,
  struct ethtool_ts_info *info)
 {
@@ -1696,6 +1745,8 @@ static int mlx5e_flash_device(struct net_device *dev,
.set_tunable   = mlx5e_set_tunable,
.get_pauseparam= mlx5e_get_pauseparam,
.set_pauseparam= mlx5e_set_pauseparam,
+   .get_pfc_prevention_mode = mlx5e_get_pfc_prevention_mode,
+   .set_pfc_prevention_mode = mlx5e_set_pfc_prevention_mode,
.get_ts_info   = mlx5e_get_ts_info,
.set_phys_id   = mlx5e_set_phys_id,
.get_wol   = mlx5e_get_wol,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/port.c 
b/drivers/net/ethernet/mellanox/mlx5/core/port.c
index c37d00c..d2f7357 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/port.c
@@ -483,6 +483,16 @@ int mlx5_core_query_ib_ppcnt(struct mlx5_core_dev *dev,
 }
 EXPORT_SYMBOL_GPL(mlx5_core_query_ib_ppcnt);
 
+static int mlx5_query_pfcc_reg(struct mlx5_core_dev *dev, u32 *out, u32 
out_size)
+{
+   u32 in[MLX5_ST_SZ_DW(pfcc_reg)] = {0};
+
+   MLX5_SET(pfcc_reg, in, local_port, 1);
+
+   return mlx5_core_access_reg(dev, in, sizeof(in), out,
+   out_size, MLX5_REG_PFCC, 0, 0);
+}
+
 int mlx5_set_port_pause(struct mlx5_core_dev *dev, u32 rx_pause, u32 tx_pause)
 {
u32 in[MLX5_ST_SZ_DW(pfcc_reg)] = {0};
@@ -500,13 +510,10 @@ int mlx5_set_port_pause(struct mlx5_core_dev *dev, u32 
rx_pause, u32 tx_pause)
 int mlx5_query_port_pause(struct mlx5_core_dev *dev,
  u32 *rx_pause, u32 *tx_pause)
 {
-   u32 in[MLX5_ST_SZ_DW(pfcc_reg)] = {0};
u32 out[MLX5_ST_SZ_DW(pfcc_reg)];
int err;
 
-   MLX5_SET(pfcc_reg, in, local_port, 1);
-   err = mlx5_core_access_reg(dev, in, sizeof(in),

Re: [PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Sarah Newman

On 11/15/2017 11:27 AM, Sarah Newman wrote:
> Current memory and CPU usage for managing bridge fdb entries is unbounded.
> Add a parameter max_fdb_count, controlled from sysfs, which places an upper
> limit on the number of entries. Defaults to 1024.
> 
> When max_fdb_count is met or exceeded, whether traffic is sent out a
> given port should depend on its flooding behavior.
> 
> This may instead be mitigated by filtering mac address entries in the
> PREROUTING chain of the ebtables nat table, but this is only practical
> when mac addresses are known in advance.
> 
> Signed-off-by: Sarah Newman 

I would love to improve this patch, but have limited time to devote to this...
What I would try first would be to maintain a data structure roughly ordered
based on both number of times an address was observed as well as age and evict
the least used, oldest entry when max_fdb_count was reached.

--Sarah

[PATCH] net: bridge: add max_fdb_count

2017-11-15 Thread Sarah Newman

Current memory and CPU usage for managing bridge fdb entries is unbounded.
Add a parameter max_fdb_count, controlled from sysfs, which places an upper
limit on the number of entries. Defaults to 1024.

When max_fdb_count is met or exceeded, whether traffic is sent out a
given port should depend on its flooding behavior.

This may instead be mitigated by filtering mac address entries in the
PREROUTING chain of the ebtables nat table, but this is only practical
when mac addresses are known in advance.

Signed-off-by: Sarah Newman 
---
 net/bridge/br_device.c   |  2 ++
 net/bridge/br_fdb.c  | 25 -
 net/bridge/br_private.h  |  3 +++
 net/bridge/br_sysfs_br.c | 24 
 4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index af5b8c8..aa7a7f4 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -432,6 +432,8 @@ void br_dev_setup(struct net_device *dev)
br->bridge_hello_time = br->hello_time = 2 * HZ;
br->bridge_forward_delay = br->forward_delay = 15 * HZ;
br->bridge_ageing_time = br->ageing_time = BR_DEFAULT_AGEING_TIME;
+   br->max_fdb_count = 1024;
+   br->fdb_count = 0;
dev->max_mtu = ETH_MAX_MTU;
 
br_netfilter_rtable_init(br);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 4ea5c8b..0422d48 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -179,6 +179,7 @@ static void fdb_delete(struct net_bridge *br, struct 
net_bridge_fdb_entry *f)
 
hlist_del_init_rcu(>hlist);
fdb_notify(br, f, RTM_DELNEIGH);
+   br->fdb_count -= 1;
call_rcu(>rcu, fdb_rcu_free);
 }
 
@@ -478,7 +479,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
return num;
 }
 
-static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
+static struct net_bridge_fdb_entry *fdb_create(struct net_bridge *br,
+  struct hlist_head *head,
   struct net_bridge_port *source,
   const unsigned char *addr,
   __u16 vid,
@@ -498,6 +500,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
hlist_head *head,
fdb->added_by_external_learn = 0;
fdb->offloaded = 0;
fdb->updated = fdb->used = jiffies;
+   br->fdb_count += 1;
hlist_add_head_rcu(>hlist, head);
}
return fdb;
@@ -524,7 +527,7 @@ static int fdb_insert(struct net_bridge *br, struct 
net_bridge_port *source,
fdb_delete(br, fdb);
}
 
-   fdb = fdb_create(head, source, addr, vid, 1, 1);
+   fdb = fdb_create(br, head, source, addr, vid, 1, 1);
if (!fdb)
return -ENOMEM;
 
@@ -556,6 +559,10 @@ void br_fdb_update(struct net_bridge *br, struct 
net_bridge_port *source,
if (hold_time(br) == 0)
return;
 
+   /* Place maximum on number of learned entries. */
+   if (br->max_fdb_count <= br->fdb_count)
+   return;
+
/* ignore packets unless we are using this port */
if (!(source->state == BR_STATE_LEARNING ||
  source->state == BR_STATE_FORWARDING))
@@ -591,7 +598,7 @@ void br_fdb_update(struct net_bridge *br, struct 
net_bridge_port *source,
} else {
spin_lock(>hash_lock);
if (likely(!fdb_find_rcu(head, addr, vid))) {
-   fdb = fdb_create(head, source, addr, vid, 0, 0);
+   fdb = fdb_create(br, head, source, addr, vid, 0, 0);
if (fdb) {
if (unlikely(added_by_user))
fdb->added_by_user = 1;
@@ -787,7 +794,7 @@ static int fdb_add_entry(struct net_bridge *br, struct 
net_bridge_port *source,
if (!(flags & NLM_F_CREATE))
return -ENOENT;
 
-   fdb = fdb_create(head, source, addr, vid, 0, 0);
+   fdb = fdb_create(br, head, source, addr, vid, 0, 0);
if (!fdb)
return -ENOMEM;
 
@@ -1081,7 +1088,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, 
struct net_bridge_port *p,
head = >hash[br_mac_hash(addr, vid)];
fdb = br_fdb_find(br, addr, vid);
if (!fdb) {
-   fdb = fdb_create(head, p, addr, vid, 0, 0);
+   fdb = fdb_create(br, head, p, addr, vid, 0, 0);
if (!fdb) {
err = -ENOMEM;
goto err_unlock;
@@ -1147,3 +1154,11 @@ void br_fdb_offloaded_set(struct net_bridge *br, struct 
net_bridge_port *p,
 
spin_unlock_bh(>hash_lock);
 }
+
+int br_set_max_fdb_count(struct net_bridge *br, unsigned long max_fdb_count)
+{
+   spin_lock_bh(>lock);
+   br->max_fdb_count =

Re: [PATCH v3 1/8] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2017-11-15 Thread David Daney


On 11/15/2017 11:18 AM, Rob Herring wrote:

On Thu, Nov 09, 2017 at 11:29:08AM -0800, David Daney wrote:

From: Carlos Munoz 

Add bindings for Common Ethernet Interface (BGX) block.

Signed-off-by: Carlos Munoz 
Signed-off-by: Steven J. Hill 
Signed-off-by: David Daney 
---
  .../devicetree/bindings/net/cavium-bgx.txt | 61 ++
  1 file changed, 61 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt

diff --git a/Documentation/devicetree/bindings/net/cavium-bgx.txt 
b/Documentation/devicetree/bindings/net/cavium-bgx.txt
new file mode 100644
index ..6b1f8b994c20
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/cavium-bgx.txt
@@ -0,0 +1,61 @@
+* Common Ethernet Interface (BGX) block
+
+Properties:
+
+- compatible: "cavium,octeon-7890-bgx": Compatibility with all cn7xxx SOCs.
+
+- reg: The base address of the BGX block.
+
+- #address-cells: Must be <1>.
+
+- #size-cells: Must be <0>.  BGX addresses have no size component.
+
+A BGX block has several children, each representing an Ethernet
+interface.
+
+
+* Ethernet Interface (BGX port) connects to PKI/PKO
+
+Properties:
+
+- compatible: "cavium,octeon-7890-bgx-port": Compatibility with all
+ cn7xxx SOCs.
+
+ "cavium,octeon-7360-xcv": Compatibility with cn73xx SOCs
+ for RGMII.
+
+- reg: The index of the interface within the BGX block.


There's no memory mapped region associated with the sub-blocks?


No.  Many of the sub block control bits are co-located in the same 
registers.






+
+Optional properties:
+
+- local-mac-address: Mac address for the interface.
+
+- phy-handle: phandle to the phy node connected to the interface.
+
+- phy-mode: described in ethernet.txt.
+
+- fixed-link: described in fixed-link.txt.
+
+Example:
+
+   ethernet-mac-nexus@11800e000 {
+   compatible = "cavium,octeon-7890-bgx";
+   reg = <0x00011800 0xe000 0x 0x0100>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   ethernet-mac@0 {


ethernet@0



Good point.  I will fix that.





+   compatible = "cavium,octeon-7360-xcv";
+   reg = <0>;
+   local-mac-address = [ 00 01 23 45 67 89 ];
+   phy-handle = <>;
+   phy-mode = "rgmii-rxid"
+   };
+   ethernet-mac@1 {


ditto.

Otherwise,

Acked-by: Rob Herring 


Thanks,
David Daney



+   compatible = "cavium,octeon-7890-bgx-port";
+   reg = <1>;
+   local-mac-address = [ 00 01 23 45 67 8a ];
+   phy-handle = <>;
+   phy-mode = "sgmii"
+   };
+   };
--
2.13.6

Re: [PATCH v3 1/8] dt-bindings: Add Cavium Octeon Common Ethernet Interface.

2017-11-15 Thread Rob Herring

On Thu, Nov 09, 2017 at 11:29:08AM -0800, David Daney wrote:
> From: Carlos Munoz 
> 
> Add bindings for Common Ethernet Interface (BGX) block.
> 
> Signed-off-by: Carlos Munoz 
> Signed-off-by: Steven J. Hill 
> Signed-off-by: David Daney 
> ---
>  .../devicetree/bindings/net/cavium-bgx.txt | 61 
> ++
>  1 file changed, 61 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/cavium-bgx.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/cavium-bgx.txt 
> b/Documentation/devicetree/bindings/net/cavium-bgx.txt
> new file mode 100644
> index ..6b1f8b994c20
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/cavium-bgx.txt
> @@ -0,0 +1,61 @@
> +* Common Ethernet Interface (BGX) block
> +
> +Properties:
> +
> +- compatible: "cavium,octeon-7890-bgx": Compatibility with all cn7xxx SOCs.
> +
> +- reg: The base address of the BGX block.
> +
> +- #address-cells: Must be <1>.
> +
> +- #size-cells: Must be <0>.  BGX addresses have no size component.
> +
> +A BGX block has several children, each representing an Ethernet
> +interface.
> +
> +
> +* Ethernet Interface (BGX port) connects to PKI/PKO
> +
> +Properties:
> +
> +- compatible: "cavium,octeon-7890-bgx-port": Compatibility with all
> +   cn7xxx SOCs.
> +
> +   "cavium,octeon-7360-xcv": Compatibility with cn73xx SOCs
> +   for RGMII.
> +
> +- reg: The index of the interface within the BGX block.

There's no memory mapped region associated with the sub-blocks?

> +
> +Optional properties:
> +
> +- local-mac-address: Mac address for the interface.
> +
> +- phy-handle: phandle to the phy node connected to the interface.
> +
> +- phy-mode: described in ethernet.txt.
> +
> +- fixed-link: described in fixed-link.txt.
> +
> +Example:
> +
> + ethernet-mac-nexus@11800e000 {
> + compatible = "cavium,octeon-7890-bgx";
> + reg = <0x00011800 0xe000 0x 0x0100>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + ethernet-mac@0 {

ethernet@0

> + compatible = "cavium,octeon-7360-xcv";
> + reg = <0>;
> + local-mac-address = [ 00 01 23 45 67 89 ];
> + phy-handle = <>;
> + phy-mode = "rgmii-rxid"
> + };
> + ethernet-mac@1 {

ditto.

Otherwise,

Acked-by: Rob Herring 

> + compatible = "cavium,octeon-7890-bgx-port";
> + reg = <1>;
> + local-mac-address = [ 00 01 23 45 67 8a ];
> + phy-handle = <>;
> + phy-mode = "sgmii"
> + };
> + };
> -- 
> 2.13.6
>

Re: [RFC] [PATCH] netns: Fix race in virtual interface bringup

2017-11-15 Thread Dan Rue

Adding CC netdev

Can someone comment on the expected behavior of this test case?

Here's the isolated test:

ip netns del tst_net_ns0
ip netns del tst_net_ns1
ip netns add tst_net_ns0
ip netns add tst_net_ns1
ip netns exec tst_net_ns0 ip link add veth0 type veth peer name veth1
ip netns exec tst_net_ns0 ip link set veth1 netns tst_net_ns1

ip netns exec tst_net_ns0 ifconfig veth0 inet6 add fd00::2/64
ip netns exec tst_net_ns1 ifconfig veth1 inet6 add fd00::3/64
ip netns exec tst_net_ns0 ifconfig veth0 up
ip netns exec tst_net_ns1 ifconfig veth1 up

#sleep 2

ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3

This is essentially what LTP is running. Sometimes, on some systems,
ping6 fails with "connect: Cannot assign requested address". Adding a
"sleep 2" always fixes it (but we'd obviously like to avoid a hard coded
sleep in the test).

Questions:
1) Is the behavior of "ifconfig up" intentionally asynchronous (I
believe so, based on dmesg)? If so, what is the correct way to find out
when the interface is available?

Thank you!
Dan

On Thu, Nov 09, 2017 at 02:38:41PM -0600, Dan Rue wrote:
> Symptoms (+ command, error):
> netns_comm_ip_ipv6_ioctl:
> + ip netns exec tst_net_ns1 ping6 -q -c2 -I veth1 fd00::2
> connect: Cannot assign requested address
> 
> netns_comm_ip_ipv6_netlink:
> + ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3
> connect: Cannot assign requested address
> 
> netns_comm_ns_exec_ipv6_ioctl:
> + ns_exec 6689 net ping6 -q -c2 -I veth0 fd00::3
> connect: Cannot assign requested address
> 
> netns_comm_ns_exec_ipv6_netlin:
> + ns_exec 6891 net ping6 -q -c2 -I veth0 fd00::3
> connect: Cannot assign requested address
> 
> The error is coming from ping6, which is trying to get an IP address for
> veth0 (due to -I veth0), but cannot. Waiting for two seconds fixes the
> test in my testcases. 1 second is not long enough.
> 
> dmesg shows the following during the test:
> 
> [Nov 7 15:39] LTP: starting netns_comm_ip_ipv6_ioctl (netns_comm.sh ip 
> ipv6 ioctl)
> [  +0.302401] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
> [  +0.048059] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 
> Signed-off-by: Dan Rue 
> ---
> 
> We've periodically hit this problem across many arm64 kernels and boards, and
> it seems to be caused by "ping6" running before the virtual interface is
> actually ready. "sleep 2" works around the issue and proves that it is a race
> condition, but I would prefer something faster and deterministic. Please
> suggest a better implementation.
> 
> Also, is it correct that "ifconfig veth0 up" returns before the interface is
> actually ready?
> 
> See also this isolated test script:
> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3
> 
>  testcases/kernel/containers/netns/netns_helper.sh | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/testcases/kernel/containers/netns/netns_helper.sh 
> b/testcases/kernel/containers/netns/netns_helper.sh
> index a95cdf206..99172c0c0 100755
> --- a/testcases/kernel/containers/netns/netns_helper.sh
> +++ b/testcases/kernel/containers/netns/netns_helper.sh
> @@ -285,6 +285,7 @@ netns_set_ip()
>   tst_brkm TBROK "enabling veth1 device failed"
>   ;;
>   esac
> + sleep 2
>  }
>  
>  netns_ns_exec_cleanup()
> -- 
> 2.13.6
>

Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-11-15 Thread Eric W. Biederman

Kees Cook  writes:

> On Wed, Nov 1, 2017 at 5:48 AM, Eric W. Biederman  
> wrote:
>> Eric Dumazet  writes:
>>
>>> On Tue, 2017-10-31 at 09:14 -0700, Kees Cook wrote:
 Some protocols do not correctly wipe the contents of the on-stack
 struct sockaddr_storage sent down into recvmsg() (e.g. SCTP), and leak
 kernel stack contents to userspace. This wipes it unconditionally before
 per-protocol handlers run.

 Note that leaks like this are mitigated by building with
 CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y

 Reported-by: Alexander Potapenko 
 Cc: "David S. Miller" 
 Cc: netdev@vger.kernel.org
 Signed-off-by: Kees Cook 
 ---
  net/socket.c | 1 +
  1 file changed, 1 insertion(+)

 diff --git a/net/socket.c b/net/socket.c
 index c729625eb5d3..34183f4fbdf8 100644
 --- a/net/socket.c
 +++ b/net/socket.c
 @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, 
 struct user_msghdr __user *msg,
  struct sockaddr __user *uaddr;
  int __user *uaddr_len = COMPAT_NAMELEN(msg);

 +memset(, 0, sizeof(addr));
  msg_sys->msg_name = 

>>>
>>> This kind of patch comes every year.
>>>
>>> Standard answer is : We fix the buggy protocol, we do not make
>>> everything slower just because we are lazy.
>>>
>>> struct sockaddr is 128 bytes, but IPV4 only uses a fraction of it.
>>>
>>> Also memset() is using long word stores, so next 4-byte or 2-byte stores
>>> on same location hit a performance problem on x86.
>>>
>>> By adding all these defensive programming, we would give strong
>>> incentives to bypass the kernel for networking. That would be bad.
>>
>> In this case it looks like the root cause is something in sctp
>> not filling in the ipv6 sin6_scope_id.
>>
>> Which is not only a leak but a correctness bug.
>>
>> I ran the reproducer test program and while none of the leak checkers
>> are telling me anything I have gotten as far as seeing that the returned
>> length is correct and sometimes nonsense.
>>
>> Hmm.
>>
>> At a quick look it looks like all that is necessary is to do this:
>>
>> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> index 51c488769590..6301913d0516 100644
>> --- a/net/sctp/ipv6.c
>> +++ b/net/sctp/ipv6.c
>> @@ -807,9 +807,10 @@ static void sctp_inet6_skb_msgname(struct sk_buff *skb, 
>> char *msgname,
>> addr->v6.sin6_flowinfo = 0;
>> addr->v6.sin6_port = sh->source;
>> addr->v6.sin6_addr = ipv6_hdr(skb)->saddr;
>> -   if (ipv6_addr_type(>v6.sin6_addr) & 
>> IPV6_ADDR_LINKLOCAL) {
>> +   if (ipv6_addr_type(>v6.sin6_addr) & 
>> IPV6_ADDR_LINKLOCAL)
>> addr->v6.sin6_scope_id = sctp_v6_skb_iif(skb);
>> -   }
>> +   else
>> +   addr->v6.sin6_scope_id = 0;
>> }
>>
>> *addr_len = sctp_v6_addr_to_user(sctp_sk(skb->sk), addr);
>>
>
> It looks like this never landed anywhere? Eric, are you able to resend
> this as a full patch?

I will take a look.  I have not conducted a thorough review to make
certain that is everything.  I was hoping someone else would pick that
change up and run with it.   However the change seems obviously correct
as is, so I don't have any problem sending just this bit.

Eric

Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-11-15 Thread Martin KaFai Lau

On Tue, Nov 14, 2017 at 01:59:03PM -0800, Shaohua Li wrote:
> On Tue, Nov 14, 2017 at 11:13:10AM -0800, Tom Herbert wrote:
> > On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li  wrote:
> > > On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
> > >> On Fri, Aug 18, 2017 at 3:27 PM, David Miller  
> > >> wrote:
> > >> > From: Martin KaFai Lau 
> > >> > Date: Fri, 18 Aug 2017 13:51:36 -0700
> > >> >
> > >> >> It seems like that middle box specifically drops TCP_RST if it
> > >> >> does not know anything about this flow.  Since the flowlabel of the 
> > >> >> TCP_RST
> > >> >> (sent in TW state) is always different, it always lands to a 
> > >> >> different middle
> > >> >> box.  All of these TCP_RST cannot be delivered.
> > >> >
> > >> > This really is illegal behavior.  The flow label is not a flow _KEY_
> > >> > by any definition whatsoever.
> > >> >
> > >> > Flow labels are an optimization, not a determinant for flow matching
> > >> > particularly for proper TCP state processing.
> > >> >
> > >> > I'd rather you invest all of this energy getting that vendor to fix
> > >> > their kit.
> > >> >
> > >> We're now seeing several router vendors recommending people to not use
> > >> flow labels for ECMP hashing. This is precisely because when a flow
> > >> label changes, network devices that maintain state (firewalls, NAT,
> > >> load balancers) can't deal with packets being rerouted so connections
> > >> are dropped. Unfortunately, the need for packets of a flow to always
> > >> follow the same path has become an implicit requirement that I think
> > >> we need follow at least as the default behavior.
> > >>
> > >> Martin: is there any change you could resurrect these patches? In
> > >> order to solve the general problem of making routing consistent, I
> > >> believe we want to keep sk_tx_hash consistent for the connection from
> > >> which a consistent flow label can be derived. To avoid the overhead of
> > >> a hash field in sk_common, maybe we could initially set a connection
> > >> hash to a five-tuple hash for a flow instead of a random value? So in
> > >> TW state the consistent hash can be computed on the fly.
> > >
> > > Hi Tom,
> > > Do we really need to use the five-tupe hash? There are several places 
> > > using
> > > current random hash, which looks more lightweight. To fix issue, we only 
> > > need
> > > to make sure reset packet include the correct flowlabel. Like what my 
> > > previous
> > > patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash 
> > > and use
> > > it reset packet. In this way we can use the random hash and not add extra 
> > > field
> > > in sock.
> > >
> > Shaohua,
> > 
> > But that patch discards the full txhash in TW. So it's not just a
> > problem with the flow label. sk_tx_hash can also be used for route
> > selection in ECMP, port selection we're doing tunneling, etc. The
> > general solution should maintains tx_hash or be able to reconstruct it
> > in any state, flow label fix is a point solution.
> 
> Hi Tom,
> 
> do you want to keep sk_rethink_txhash() then? If we changed the hash to random
> number, we can't reconstruct it for sure.
A followup question on rethink.  Does it mean we need
a new sysctl (persistent_txhash) to avoid sk_rethink_txhash() together
such that it keeps the routing decision consistent (e.g. flowlabel) ?

> 
> Thanks,
> Shaohua

Re: SRIOV switchdev mode BoF minutes

2017-11-15 Thread Alexander Duyck

On Tue, Nov 14, 2017 at 8:02 PM, Jakub Kicinski
 wrote:
> On Tue, 14 Nov 2017 19:04:36 -0800, Alexander Duyck wrote:
>> On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski
>>  wrote:
>> > On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote:
>> >> >> We basically need to do some feasability research to see if we can
>> >> >> actually meet all the requirements for switchdev on i40e. We have been
>> >> >> getting mixed messages where we are given a great many "yes, but" type
>> >> >> answers. For i40e we are looking into it but I don't have high
>> >> >> confidence in our ability to actually support it in hardare/firmware.
>> >> >> If it were as easy as you have been led to believe, we would have done
>> >> >> it months ago when we were researching the requirements to support 
>> >> >> switchdev
>> >> >
>> >> > wait, Sridhar made seven rounds of his submission (this is the v7
>> >> > pointer [1]) and you
>> >> > still don't know if what you were attempting to push upstream can
>> >> > work, something is
>> >> > weird here, can you clarify? Jeff?
>> >>
>> >> Not weird so much as stubborn. The patches were being pushed based on
>> >> the assumption that the community would accept a NIC generating port
>> >> representors that didn't necessarily pass traffic, and then even when
>> >> we had them passing traffic the PF still wasn't configured to handle
>> >> being the default destination for traffic without any rules
>> >> associated, instead VFs would directly send to the outside world.
>> >
>> > Perhaps the way forward is to lift the requirement on passing traffic,
>> > as long as the limitation is clearly expressed to the users.
>>
>> No, I am not arguing for that because then SwitchDev will fall into
>> disarray. If we want to have a strict definition for what is SwitchDev
>> and what isn't I am okay with that. It gives us a definition of what
>> our hardware needs to do in order to support it and without that we
>> are going to get hardware that just bends the rules to claim support
>> for it.
>
> Let me make sure we understand each other.  The switchdev SR-IOV mode is
> what happens when user requests DEVLINK_ESWITCH_MODE_SWITCHDEV.  Are you
> saying you are opposed to adding DEVLINK_ESWITCH_MODE_VEPA?

I wouldn't say I am opposed to that idea. We just need to clearly
define what MODE_VEPA is. I would say that even in MODE_VEPA we would
be passing traffic. The limitation though is that we wouldn't have the
same mechanisms in place to route the traffic.

The big issue with VEPA is that the traffic is routed to an external
entity before it makes a hairpin turn and comes back. As such we don't
have the actual origin of the packet to work with other than MAC and
VLAN. As far as directing a packet to a specific port the only way we
really have of doing that is to direct it to the MAC/VLAN pair for the
VF. This is one of the reasons why I am thinking source mode macvlan
is the solution to go with for something like this. Basically the
source mode macvlan can get pretty close to identifying the origin of
any packet that came from the VF assuming it is programmed with all
the MAC entries belonging to the VF. The only case where this doesn't
work is the "trusted" legacy mode VF that is running in promiscuous
with anti-spoof disabled.

>> All I am asking for is for us to not close the door to the possibility
>> of adding features to legacy SR-IOV. I am hoping to use a source
>> macvlan based approach to make it so that we can support "port
>> representors" for devices that can't support full SwitchDev. The idea
>> would be to use them to get as close to SwitchDev level support on
>> legacy devices as possible without using full SwitchDev. That should
>> solve a good part of the issue, but I am pretty certain I need to be
>> able to extend legacy SR-IOV in order to support it. I had talked with
>> Jiri at netdev 2.1 about it back when we had submitted the v7 patches,
>> and the decision was to look at doing "port representors" but don't
>> associate them with SwitchDev. I was out on Sabbatical for most of the
>> summer and I am just now starting on the macvlan work I had planned. I
>> hope to have it done before the next netdev and then we can discuss it
>> there if it needs more discussion than what we can have on the mailing
>> list.
>
> I don't know what you mean with the macvlan based approach.  Could you
> perhaps describe it in more detail?  Will it allow users to configure
> forwarding and queueing with existing, standard tools and APIs?

So there are a few issues with our devices doing SwitchDev mode that I
am trying to address.

One of the issues is that we have no direct way to figure out where
the packets are coming from as I described above. So instead of us
implementing multiple approaches for the same thing my thought was to
look at using source mode macvlan which does filtering on the source
MAC address instead of the destination. It

[PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-15 Thread Yonghong Song

Uprobe is a tracing mechanism for userspace programs.
Typical uprobe will incur overhead of two traps.
First trap is caused by replaced trap insn, and
the second trap is to execute the original displaced
insn in user space.

To reduce the overhead, kernel provides hooks
for architectures to emulate the original insn
and skip the second trap. In x86, emulation
is done for certain branch insns.

This patch extends the emulation to "push "
insns. These insns are typical in the beginning
of the function. For example, bcc
in https://github.com/iovisor/bcc repo provides
tools to measure funclantency, detect memleak, etc.
The tools will place uprobes in the beginning of
function and possibly uretprobes at the end of function.
This patch is able to reduce the trap overhead for
uprobe from 2 to 1.

Without this patch, uretprobe will typically incur
three traps. With this patch, if the function starts
with "push" insn, the number of traps can be
reduced from 3 to 2.

An experiment was conducted on two local VMs,
fedora 26 64-bit VM and 32-bit VM, both 4 processors
and 4GB memory, booted with latest tip repo (and this patch).
The host is MacBook with intel i7 processor.

The test program looks like
  #include 
  #include 
  #include 
  #include 

  static void test() __attribute__((noinline));
  void test() {}
  int main() {
struct timeval start, end;

gettimeofday(, NULL);
for (int i = 0; i < 100; i++) {
  test();
}
gettimeofday(, NULL);

printf("%ld\n", ((end.tv_sec * 100 + end.tv_usec)
 - (start.tv_sec * 100 + start.tv_usec)));
return 0;
  }

The program is compiled without optimization, and
the first insn for function "test" is "push %rbp".
The host is relatively idle.

Before the test run, the uprobe is inserted as below for uprobe:
  echo 'p :' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
and for uretprobe:
  echo 'r :' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable

Unit: microsecond(usec) per loop iteration

x86_64  W/ this patch   W/O this patch
uprobe  1.553.1
uretprobe   2.0 3.6

x86_32  W/ this patch   W/O this patch
uprobe  1.413.5
uretprobe   1.754.0

You can see that this patch significantly reduced the overhead,
50% for uprobe and 44% for uretprobe on x86_64, and even more
on x86_32.

Signed-off-by: Yonghong Song 
---
 arch/x86/include/asm/uprobes.h |   4 ++
 arch/x86/kernel/uprobes.c  | 107 +++--
 2 files changed, 107 insertions(+), 4 deletions(-)

Changelogs:
v3 -> v4:
  . Revert most of v3 change as 32bit emulation is not really working
on x86_64 platform as among other issues, function emulate_push_stack()
needs to account for 32bit app on 64bit platform.
A separate effort is ongoing to address this issue.
v2 -> v3:
  . Do not emulate 32bit application on x86_64 platforms
v1 -> v2:
  . Address Oleg's comments

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 74f4c2f..d8bfa98 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -53,6 +53,10 @@ struct arch_uprobe {
u8  fixups;
u8  ilen;
}   defparam;
+   struct {
+   u8  reg_offset; /* to the start of pt_regs */
+   u8  ilen;
+   }   push;
};
 };
 
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index a3755d2..85c7ef2 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -528,11 +528,11 @@ static int default_pre_xol_op(struct arch_uprobe 
*auprobe, struct pt_regs *regs)
return 0;
 }
 
-static int push_ret_address(struct pt_regs *regs, unsigned long ip)
+static int emulate_push_stack(struct pt_regs *regs, unsigned long val)
 {
unsigned long new_sp = regs->sp - sizeof_long();
 
-   if (copy_to_user((void __user *)new_sp, , sizeof_long()))
+   if (copy_to_user((void __user *)new_sp, , sizeof_long()))
return -EFAULT;
 
regs->sp = new_sp;
@@ -566,7 +566,7 @@ static int default_post_xol_op(struct arch_uprobe *auprobe, 
struct pt_regs *regs
regs->ip += correction;
} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
regs->sp += sizeof_long(); /* Pop incorrect return address */
-   if (push_ret_address(regs, utask->vaddr + 
auprobe->defparam.ilen))
+   if (emulate_push_stack(regs, utask->vaddr + 
auprobe->defparam.ilen))
return -ERESTART;
}
/* popf; tell the caller to not touch TF */
@@ -655,7 +655,7 @@ static bool branch_emulate_op(struct arch_uprobe *auprobe, 
struct pt_regs *regs)

Re: [PATCH iproute2/net-next v2]tc: B.W limits can now be specified in %.

2017-11-15 Thread Stephen Hemminger

On Wed, 15 Nov 2017 07:06:21 +0530
Nishanth Devarajan  wrote:

> int parse_percent_rate(char *rate, const char *str, char *dev)
You aren't modifyin dev so it should be const char *
> +{
> + long max_rate_bits;
> + int ret, saved_errno;
> + double perc, rate_bits;
> + char *str_perc, *p;
> +
> + if (!dev[0]) {
> + fprintf(stderr, "No device specified; specify device to rate 
> limit by percentage\n");
> + return -1;
> + }
> +
> + if (read_prop(dev, "speed", _rate_bits))
> + return -1;

If speed is unknown, then many will devices will return -1.
You need to handle that.

Since speed reported by kernel is in mbits per second, it
would make sense to rename max_rate_bits to mbit.

> +
> + ret = sscanf(str, "%m[0-9.%]", _perc);
> + if (ret != 1)
> + goto malf;
> +
> + /* Make sure there's only one percent sign and it's at the end */
> + perc = strtod(str_perc, );
> + if (*p != '%' || *(p++) != '\0')
> + goto malf;

There already is parse_percent in tc/q_netem.c.
Please move that to util and use that instead of coding here.

> +
> + saved_errno = errno;
> + free(str_perc);
> +
> + if (perc > 100.0 || perc < 0.0 || saved_errno == ERANGE) {
> + fprintf(stderr, "Invalid rate specified; should be between 
> [0,100]%% but is %s\n", str);
> + return -1;
> + }
> +
> + rate_bits = (perc * max_rate_bits) / 100.0;
> +
> + ret = snprintf(rate, 20, "%lf", rate_bits);
> + if (ret <= 0 || ret >= 20) {
> + fprintf(stderr, "Unable to parse calculated rate\n");
> + return -1;
> + }
> +
> + return 0;
> +
> +malf:
> + fprintf(stderr, "Specified rate value could not be read or is 
> malformed\n");
> + return -1;
> +}

Re: [RFC PATCH 06/17] net: sched: explicit locking in gso_cpu fallback

2017-11-15 Thread Willem de Bruijn

On Wed, Nov 15, 2017 at 10:11 AM, John Fastabend
 wrote:
> On 11/14/2017 04:41 PM, Willem de Bruijn wrote:
>>>  /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
>>>  static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
>>>  {
>>> -   struct sk_buff *skb = sch->gso_skb;
>>> +   struct sk_buff *skb = skb_peek(>gso_skb);
>>>
>>> if (skb) {
>>> -   sch->gso_skb = NULL;
>>> +   skb = __skb_dequeue(>gso_skb);
>>> qdisc_qstats_backlog_dec(sch, skb);
>>> sch->q.qlen--;
>>
>> In lockless qdiscs, can this race, so that __skb_dequeue returns NULL?
>> Same for its use in qdisc_peek_dequeued.
>>
>
> Yes, agree if this was used in lockless qdisc it could race. However,
> I don't think it is actually used in the lockless cases yet. For pfifo
> fast __skb_array_peek is used.

Oh right. That will be easy to miss when other qdiscs are converted
to lockless. Perhaps another location to add lockdep annotations.

Related: what happens when pfifo_fast is used as a leaf in a non-lockless
qdisc hierarchy, say htb? The individual leaves will still have
TCQ_F_NOLOCK, so will try to take the qdisc lock in dequeue_skb
and other locations, but that is already held?

Thanks for revising and resubmitting the patchset, btw!

Re: [PATCH net-next v2 0/2] retire IPX and NCPFS

2017-11-15 Thread Stephen Hemminger

On Wed, 15 Nov 2017 09:58:33 +0900 (KST)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Tue, 14 Nov 2017 08:37:13 -0800
> 
> > These are both old decrepit protocols that need to be sent
> > to pasture.  
> 
> These need to go to gregkh and his staging/ tree, not net-next.

Ok, just wanted to get acceptance from netdev for moving

[PATCH net-next] netfilter: add ifdef around ctnetlink_proto_size

2017-11-15 Thread Pablo Neira Ayuso

From: Arnd Bergmann 

This function is no longer marked 'inline', so we now get a warning
when it is unused:

net/netfilter/nf_conntrack_netlink.c:536:15: error: 'ctnetlink_proto_size' 
defined but not used [-Werror=unused-function]

We could mark it inline again, mark it __maybe_unused, or add an #ifdef
around the definition. I'm picking the third approach here since that
seems to be what the rest of the file has.

Fixes: 5caaed151a68 ("netfilter: conntrack: don't cache nlattr_tuple_size 
result in nla_size")
Signed-off-by: Arnd Bergmann 
Signed-off-by: Pablo Neira Ayuso 
---
@David: please place this fix into your net-next tree, it is calming
down a compilation warning. Thanks!

 net/netfilter/nf_conntrack_netlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index 71a43ed19a0f..96277706b379 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -533,6 +533,7 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 portid, u32 
seq, u32 type,
return -1;
 }
 
+#if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || 
defined(CONFIG_NF_CONNTRACK_EVENTS)
 static size_t ctnetlink_proto_size(const struct nf_conn *ct)
 {
const struct nf_conntrack_l3proto *l3proto;
@@ -552,6 +553,7 @@ static size_t ctnetlink_proto_size(const struct nf_conn *ct)
 
return len + len4;
 }
+#endif
 
 static inline size_t ctnetlink_acct_size(const struct nf_conn *ct)
 {
-- 
2.11.0

[PATCH 6/6] bpf: add new test test_many_kprobe

2017-11-15 Thread Song Liu

The test compares old text based kprobe API with PERF_TYPE_PROBE.

Here is a sample output of this test:

Creating 1000 kprobes with text-based API takes 6.979683 seconds
Cleaning 1000 kprobes with text-based API takes 84.897687 seconds
Creating 1000 kprobes with PERF_TYPE_PROBE (function name) takes 5.077558 
seconds
Cleaning 1000 kprobes with PERF_TYPE_PROBE (function name) takes 81.241354 
seconds
Creating 1000 kprobes with PERF_TYPE_PROBE (function addr) takes 5.218255 
seconds
Cleaning 1000 kprobes with PERF_TYPE_PROBE (function addr) takes 80.010731 
seconds

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/Makefile|   3 +
 samples/bpf/bpf_load.c  |   5 +-
 samples/bpf/bpf_load.h  |   4 +
 samples/bpf/test_many_kprobe_user.c | 184 
 4 files changed, 193 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9b4a66e..ec92f35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -42,6 +42,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_monitor
 hostprogs-y += syscall_tp
+hostprogs-y += test_many_kprobe
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -87,6 +88,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+test_many_kprobe-objs := bpf_load.o $(LIBBPF) test_many_kprobe_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -172,6 +174,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_test_many_kprobe += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index dc6d843..ab514c1 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -637,9 +637,8 @@ void read_trace_pipe(void)
}
 }
 
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
+struct ksym syms[MAX_SYMS];
+int sym_cnt;
 
 static int ksym_cmp(const void *p1, const void *p2)
 {
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index e7a8a21..16bc263 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -67,6 +67,10 @@ static inline __u64 ptr_to_u64(const void *ptr)
return (__u64) (unsigned long) ptr;
 }
 
+#define MAX_SYMS 30
+extern struct ksym syms[MAX_SYMS];
+extern int sym_cnt;
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
diff --git a/samples/bpf/test_many_kprobe_user.c 
b/samples/bpf/test_many_kprobe_user.c
new file mode 100644
index 000..70b680e
--- /dev/null
+++ b/samples/bpf/test_many_kprobe_user.c
@@ -0,0 +1,184 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define MAX_KPROBES 1000
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+int kprobes[MAX_KPROBES] = {0};
+int kprobe_count;
+int perf_event_fds[MAX_KPROBES];
+const char license[] = "GPL";
+
+static __u64 time_get_ns(void)
+{
+   struct timespec ts;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   return ts.tv_sec * 10ull + ts.tv_nsec;
+}
+
+static int kprobe_api(char *func, void *addr, bool use_new_api)
+{
+   int efd;
+   struct perf_event_attr attr = {};
+   struct probe_desc pd;
+   char buf[256];
+   int err, id;
+
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (use_new_api) {
+   attr.type = PERF_TYPE_PROBE;
+   if (func) {
+   pd.func = ptr_to_u64(func);
+   pd.offset = 0;
+   } else {
+   pd.func = 0;
+   pd.offset = ptr_to_u64(addr);
+   }
+
+   attr.probe_desc = ptr_to_u64();
+   } else {
+   attr.type = PERF_TYPE_TRACEPOINT;
+   snprintf(buf, sizeof(buf),
+"echo 'p:%s %s' >> 
/sys/kernel/debug/tracing/kprobe_events",
+func, func);
+   err =

[PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-15 Thread Song Liu

A new perf type PERF_TYPE_PROBE is added to allow creating [k,u]probe
with perf_event_open. These [k,u]probe are associated with the file
decriptor created by perf_event_open, thus are easy to clean when
the file descriptor is destroyed.

Struct probe_desc and two flags, is_uprobe and is_return, are added
to describe the probe being created with perf_event_open.

Note: We use type __u64 for pointer probe_desc instead of __aligned_u64.
The reason here is to avoid changing the size of struct perf_event_attr,
and breaking new-kernel-old-utility scenario. To avoid alignment problem
with the pointer, we will (in the following patches) copy probe_desc to
__aligned_u64 before using it as pointer.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/perf_event.h | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 362493a..cc42d59 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -33,6 +33,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_PROBE = 6,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +300,29 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN64
+/*
+ * Describe a kprobe or uprobe for PERF_TYPE_PROBE.
+ * perf_event_attr.probe_desc will point to this structure. is_uprobe
+ * and is_return are used to differentiate different types of probe
+ * (k/u, probe/retprobe).
+ *
+ * The two unions should be used as follows:
+ * For uprobe: use path and offset;
+ * For kprobe: if func is empty, use addr
+ * if func is not emtpy, use func and offset
+ */
+struct probe_desc {
+   union {
+   __aligned_u64   func;
+   __aligned_u64   path;
+   };
+   union {
+   __aligned_u64   addr;
+   __u64   offset;
+   };
+};
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -320,7 +344,10 @@ struct perf_event_attr {
/*
 * Type specific configuration information.
 */
-   __u64   config;
+   union {
+   __u64   config;
+   __u64   probe_desc; /* ptr to struct probe_desc */
+   };
 
union {
__u64   sample_period;
@@ -370,7 +397,11 @@ struct perf_event_attr {
context_switch :  1, /* context switch data */
write_backward :  1, /* Write ring buffer from 
end to beginning */
namespaces :  1, /* include namespaces data 
*/
-   __reserved_1   : 35;
+
+   /* For PERF_TYPE_PROBE */
+   is_uprobe  :  1, /* 0: kprobe, 1: uprobe */
+   is_return  :  1, /* 0: probe, 1: retprobe */
+   __reserved_1   : 33;
 
union {
__u32   wakeup_events;/* wakeup every n events */
-- 
2.9.5

Re: [PATCH][v3] uprobes/x86: emulate push insns for uprobe on x86

2017-11-15 Thread Yonghong Song




On 11/15/17 9:07 AM, Oleg Nesterov wrote:

On 11/15, Oleg Nesterov wrote:


So please, check if uprobe_init_insn() fails or not in this case. After that
we will know whether your patch needs the additional is_64bit_mm() check in
push_setup_xol_ops() or not.


OK, I did the check for you.

uprobe_init_insn() doesn't fail but insn_init(x86_64 => 0) parse it as
single-byte insn with OPCODE1 == 0x41, so push_setup_xol_ops() doesn't
need to worry about compat tasks.

In short, your "V2" should be fine except you can factor out
auprobe->push.ilen initialization (as you did in V3). Please send V4.


Thanks a lot! I am just about to use inline asm or binary rewriter to 
create such a code for testing...


I will send V4 shortly.



Oleg.

[PATCH 3/6] perf: implement kprobe support to PERF_TYPE_PROBE

2017-11-15 Thread Song Liu

A new pmu, perf_probe, is created for PERF_TYPE_PROBE. Based on
input from perf_event_open(), perf_probe creates a kprobe (or
kretprobe) for the perf_event. This kprobe is private to this
perf_event, and thus not added to global lists, and not
available in tracefs.

Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe()  are added to created and destroy these
local trace_kprobe.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  2 +
 kernel/events/core.c| 41 +--
 kernel/trace/trace_event_perf.c | 81 
 kernel/trace/trace_kprobe.c | 91 +
 kernel/trace/trace_probe.h  |  7 
 5 files changed, 211 insertions(+), 11 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2bcb4dc..743e68d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -494,6 +494,8 @@ extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
 extern int  perf_trace_add(struct perf_event *event, int flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
+extern int  perf_probe_init(struct perf_event *event);
+extern void perf_probe_destroy(struct perf_event *event);
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 81dd57b..95c6610 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7966,6 +7966,28 @@ static int perf_tp_event_init(struct perf_event *event)
return 0;
 }
 
+static int perf_probe_event_init(struct perf_event *event)
+{
+   int err;
+
+   if (event->attr.type != PERF_TYPE_PROBE)
+   return -ENOENT;
+
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   err = perf_probe_init(event);
+   if (err)
+   return err;
+
+   event->destroy = perf_probe_destroy;
+
+   return 0;
+}
+
 static struct pmu perf_tracepoint = {
.task_ctx_nr= perf_sw_context,
 
@@ -7977,9 +7999,20 @@ static struct pmu perf_tracepoint = {
.read   = perf_swevent_read,
 };
 
+static struct pmu perf_probe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_probe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+};
+
 static inline void perf_tp_register(void)
 {
perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
+   perf_pmu_register(_probe, "probe", PERF_TYPE_PROBE);
 }
 
 static void perf_event_free_filter(struct perf_event *event)
@@ -8061,7 +8094,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
bool is_kprobe, is_tracepoint, is_syscall_tp;
struct bpf_prog *prog;
 
-   if (event->attr.type != PERF_TYPE_TRACEPOINT)
+   if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+   event->attr.type != PERF_TYPE_PROBE)
return perf_event_set_bpf_handler(event, prog_fd);
 
if (event->tp_event->prog)
@@ -8533,8 +8567,9 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg)
char *filter_str;
int ret = -EINVAL;
 
-   if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
-   !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+   if (((event->attr.type != PERF_TYPE_TRACEPOINT &&
+ event->attr.type != PERF_TYPE_PROBE) ||
+!IS_ENABLED(CONFIG_EVENT_TRACING)) &&
!has_addr_filter(event))
return -EINVAL;
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 13ba2d3..bf9b99b 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include "trace.h"
+#include "trace_probe.h"
 
 static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
 
@@ -229,6 +230,74 @@ int perf_trace_init(struct perf_event *p_event)
return ret;
 }
 
+#ifdef CONFIG_KPROBE_EVENTS
+static int perf_probe_create_kprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret;
+
+   tp_event = create_local_trace_kprobe(
+   name, (void *)(unsigned long)(pd->addr), pd->offset,
+   p_event->attr.is_return);
+   if (IS_ERR(tp_event))
+   return PTR_ERR(tp_event);
+   ret =

[PATCH 4/6] perf: implement uprobe support to PERF_TYPE_PROBE

2017-11-15 Thread Song Liu

This patch adds uprobe support to perf_probe with similar pattern
as previous patch (for kprobe).

Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 kernel/trace/trace_event_perf.c | 48 +-
 kernel/trace/trace_probe.h  |  4 ++
 kernel/trace/trace_uprobe.c | 90 -
 3 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index bf9b99b..4e4de84 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -256,6 +256,39 @@ static int perf_probe_create_kprobe(struct perf_event 
*p_event,
 }
 #endif /* CONFIG_KPROBE_EVENTS */
 
+#ifdef CONFIG_UPROBE_EVENTS
+static int perf_probe_create_uprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret;
+
+   if (!name)
+   return -EINVAL;
+   tp_event = create_local_trace_uprobe(
+   name, pd->offset, p_event->attr.is_return);
+   if (IS_ERR(tp_event))
+   return PTR_ERR(tp_event);
+   /*
+* local trace_uprobe need to hold event_mutex to call
+* uprobe_buffer_enable() and uprobe_buffer_disable().
+* event_mutex is not required for local trace_kprobes.
+*/
+   mutex_lock(_mutex);
+   ret = perf_trace_event_init(tp_event, p_event);
+   if (ret)
+   destroy_local_trace_uprobe(tp_event);
+   mutex_unlock(_mutex);
+   return ret;
+}
+#else
+static int perf_probe_create_uprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   return -EOPNOTSUPP;
+}
+#endif /* CONFIG_KPROBE_EVENTS */
+
 int perf_probe_init(struct perf_event *p_event)
 {
struct probe_desc pd;
@@ -292,7 +325,7 @@ int perf_probe_init(struct perf_event *p_event)
if (!p_event->attr.is_uprobe)
ret = perf_probe_create_kprobe(p_event, , name);
else
-   ret = -EOPNOTSUPP;
+   ret = perf_probe_create_uprobe(p_event, , name);
 out:
kfree(name);
return ret;
@@ -308,13 +341,26 @@ void perf_trace_destroy(struct perf_event *p_event)
 
 void perf_probe_destroy(struct perf_event *p_event)
 {
+   /*
+* local trace_uprobe need to hold event_mutex to call
+* uprobe_buffer_enable() and uprobe_buffer_disable().
+* event_mutex is not required for local trace_kprobes.
+*/
+   if (p_event->attr.is_uprobe)
+   mutex_lock(_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+   if (p_event->attr.is_uprobe)
+   mutex_unlock(_mutex);
 
if (!p_event->attr.is_uprobe) {
 #ifdef CONFIG_KPROBE_EVENTS
destroy_local_trace_kprobe(p_event->tp_event);
 #endif
+   } else {
+#ifdef CONFIG_UPROBE_EVENTS
+   destroy_local_trace_uprobe(p_event->tp_event);
+#endif
}
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 910ae1b..86b5925 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -417,4 +417,8 @@ extern struct trace_event_call *
 create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
  bool is_return);
 extern void destroy_local_trace_kprobe(struct trace_event_call *event_call);
+
+extern struct trace_event_call *
+create_local_trace_uprobe(char *name, unsigned long offs, bool is_return);
+extern void destroy_local_trace_uprobe(struct trace_event_call *event_call);
 #endif
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 4525e02..4d805d2 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -31,8 +31,8 @@
 #define UPROBE_EVENT_SYSTEM"uprobes"
 
 struct uprobe_trace_entry_head {
-   struct trace_entry  ent;
-   unsigned long   vaddr[];
+   struct trace_entry  ent;
+   unsigned long   vaddr[];
 };
 
 #define SIZEOF_TRACE_ENTRY(is_return)  \
@@ -1293,16 +1293,25 @@ static struct trace_event_functions uprobe_funcs = {
.trace  = print_uprobe_event
 };
 
-static int register_uprobe_event(struct trace_uprobe *tu)
+static inline void init_trace_event_call(struct trace_uprobe *tu,
+struct trace_event_call *call)
 {
-   struct trace_event_call *call = >tp.call;
-   int ret;
-
-   /* Initialize trace_event_call */
INIT_LIST_HEAD(>class->fields);
call->event.funcs = _funcs;
call->class->define_fields = uprobe_event_define_fields;
 
+

[PATCH] bcc: Try use new API to create [k,u]probe with perf_event_open

2017-11-15 Thread Song Liu

New kernel API allows creating [k,u]probe with perf_event_open.
This patch tries to use the new API. If the new API doesn't work,
we fall back to old API.

bpf_detach_probe() looks up the event being removed. If the event
is not found, we skip the clean up procedure.

Signed-off-by: Song Liu 
---
 src/cc/libbpf.c | 224 +++-
 1 file changed, 155 insertions(+), 69 deletions(-)

diff --git a/src/cc/libbpf.c b/src/cc/libbpf.c
index 77413df..d7be0a9 100644
--- a/src/cc/libbpf.c
+++ b/src/cc/libbpf.c
@@ -520,38 +520,66 @@ int bpf_attach_socket(int sock, int prog) {
   return setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, , sizeof(prog));
 }
 
+/*
+ * new kernel API allows creating [k,u]probe with perf_event_open, which
+ * makes it easier to clean up the [k,u]probe. This function tries to
+ * create pfd with the new API.
+ */
+static int bpf_try_perf_event_open_with_probe(struct probe_desc *pd, int pid,
+int cpu, int group_fd, int is_uprobe, int is_return)
+{
+  struct perf_event_attr attr = {};
+
+  attr.type = PERF_TYPE_PROBE;
+  attr.probe_desc = ptr_to_u64(pd);
+  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+  attr.sample_period = 1;
+  attr.wakeup_events = 1;
+  attr.is_uprobe = is_uprobe;
+  attr.is_return = is_return;
+  return syscall(__NR_perf_event_open, , pid, cpu, group_fd,
+ PERF_FLAG_FD_CLOEXEC);
+}
+
 static int bpf_attach_tracing_event(int progfd, const char *event_path,
-struct perf_reader *reader, int pid, int cpu, int group_fd) {
-  int efd, pfd;
+struct perf_reader *reader, int pid, int cpu, int group_fd, int pfd) {
+  int efd;
   ssize_t bytes;
   char buf[256];
   struct perf_event_attr attr = {};
 
-  snprintf(buf, sizeof(buf), "%s/id", event_path);
-  efd = open(buf, O_RDONLY, 0);
-  if (efd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-return -1;
-  }
+  /*
+   * Only look up id and call perf_event_open when
+   * bpf_try_perf_event_open_with_probe() didn't returns valid pfd.
+   */
+  if (pfd < 0) {
+snprintf(buf, sizeof(buf), "%s/id", event_path);
+efd = open(buf, O_RDONLY, 0);
+if (efd < 0) {
+  fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+  return -1;
+}
 
-  bytes = read(efd, buf, sizeof(buf));
-  if (bytes <= 0 || bytes >= sizeof(buf)) {
-fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+bytes = read(efd, buf, sizeof(buf));
+if (bytes <= 0 || bytes >= sizeof(buf)) {
+  fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+  close(efd);
+  return -1;
+}
 close(efd);
-return -1;
-  }
-  close(efd);
-  buf[bytes] = '\0';
-  attr.config = strtol(buf, NULL, 0);
-  attr.type = PERF_TYPE_TRACEPOINT;
-  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
-  attr.sample_period = 1;
-  attr.wakeup_events = 1;
-  pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
-  if (pfd < 0) {
-fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
-return -1;
+buf[bytes] = '\0';
+attr.config = strtol(buf, NULL, 0);
+attr.type = PERF_TYPE_TRACEPOINT;
+attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+attr.sample_period = 1;
+attr.wakeup_events = 1;
+pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
+if (pfd < 0) {
+  fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
+  return -1;
+}
   }
+
   perf_reader_set_fd(reader, pfd);
 
   if (perf_reader_mmap(reader, attr.type, attr.sample_type) < 0)
@@ -579,31 +607,41 @@ void * bpf_attach_kprobe(int progfd, enum 
bpf_probe_attach_type attach_type, con
   char event_alias[128];
   struct perf_reader *reader = NULL;
   static char *event_type = "kprobe";
+  struct probe_desc pd;
+  int pfd;
 
   reader = perf_reader_new(cb, NULL, NULL, cb_cookie, 
probe_perf_reader_page_cnt);
   if (!reader)
 goto error;
 
-  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", 
event_type);
-  kfd = open(buf, O_WRONLY | O_APPEND, 0);
-  if (kfd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-goto error;
-  }
+  /* try use new API to create kprobe */
+  pd.func = ptr_to_u64((void *)fn_name);
+  pd.offset = 0;
+  pfd = bpf_try_perf_event_open_with_probe(, pid, cpu, group_fd, 0,
+   attach_type != BPF_PROBE_ENTRY);
 
-  snprintf(event_alias, sizeof(event_alias), "%s_bcc_%d", ev_name, getpid());
-  snprintf(buf, sizeof(buf), "%c:%ss/%s %s", attach_type==BPF_PROBE_ENTRY ? 
'p' : 'r',
-   event_type, event_alias, fn_name);
-  if (write(kfd, buf, strlen(buf)) < 0) {
-if (errno == EINVAL)
-  fprintf(stderr, "check dmesg output for possible cause\n");
+  if (pfd < 0) {
+snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", 
event_type);
+kfd =

[PATCH 5/6] bpf: add option for bpf_load.c to use PERF_TYPE_PROBE

2017-11-15 Thread Song Liu

Function load_and_attach() is updated to be able to create kprobes
with either old text based API, or the new PERF_TYPE_PROBE API.

A global flag use_perf_type_probe is added to select between the
two APIs.

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/bpf_load.c | 56 --
 samples/bpf/bpf_load.h |  8 
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 2325d7a..dc6d843 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -42,6 +41,7 @@ int prog_array_fd = -1;
 
 struct bpf_map_data map_data[MAX_MAPS];
 int map_data_count = 0;
+bool use_perf_type_probe = true;
 
 static int populate_prog_array(const char *event, int prog_fd)
 {
@@ -70,8 +70,9 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
-   int fd, efd, err, id;
+   int fd, efd, err, id = -1;
struct perf_event_attr attr = {};
+   struct probe_desc pd;
 
attr.type = PERF_TYPE_TRACEPOINT;
attr.sample_type = PERF_SAMPLE_RAW;
@@ -128,7 +129,7 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}
 
-   if (is_kprobe || is_kretprobe) {
+   if (!use_perf_type_probe && (is_kprobe || is_kretprobe)) {
if (is_kprobe)
event += 7;
else
@@ -169,27 +170,42 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
strcat(buf, "/id");
}
 
-   efd = open(buf, O_RDONLY, 0);
-   if (efd < 0) {
-   printf("failed to open event %s\n", event);
-   return -1;
-   }
-
-   err = read(efd, buf, sizeof(buf));
-   if (err < 0 || err >= sizeof(buf)) {
-   printf("read from '%s' failed '%s'\n", event, strerror(errno));
-   return -1;
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe)) {
+   attr.type = PERF_TYPE_PROBE;
+   pd.func = ptr_to_u64(event + strlen(is_kprobe ? "kprobe/"
+   : "kretprobe/"));
+   pd.offset = 0;
+   attr.is_return  = !!is_kretprobe;
+   attr.probe_desc = ptr_to_u64();
+   } else {
+   efd = open(buf, O_RDONLY, 0);
+   if (efd < 0) {
+   printf("failed to open event %s\n", event);
+   return -1;
+   }
+   err = read(efd, buf, sizeof(buf));
+   if (err < 0 || err >= sizeof(buf)) {
+   printf("read from '%s' failed '%s'\n", event,
+  strerror(errno));
+   return -1;
+   }
+   close(efd);
+   buf[err] = 0;
+   id = atoi(buf);
+   attr.config = id;
}
 
-   close(efd);
-
-   buf[err] = 0;
-   id = atoi(buf);
-   attr.config = id;
-
efd = sys_perf_event_open(, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 
0);
if (efd < 0) {
-   printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe))
+   printf("k%sprobe %s fd %d err %s\n",
+  is_kprobe ? "" : "ret",
+  event + strlen(is_kprobe ? "kprobe/"
+ : "kretprobe/"),
+  efd, strerror(errno));
+   else
+   printf("event %d fd %d err %s\n", id, efd,
+  strerror(errno));
return -1;
}
event_fd[prog_cnt - 1] = efd;
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a42..e7a8a21 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,6 +2,7 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
+#include 
 #include "libbpf.h"
 
 #define MAX_MAPS 32
@@ -38,6 +39,8 @@ extern int map_fd[MAX_MAPS];
 extern struct bpf_map_data map_data[MAX_MAPS];
 extern int map_data_count;
 
+extern bool use_perf_type_probe;
+
 /* parses elf file compiled by llvm .c->.o
  * . parses 'maps' section and creates maps via BPF syscall
  * . parses 'license' section and passes it to syscall
@@ -59,6 +62,11 @@ struct ksym {
char *name;
 };
 
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+   return (__u64) (unsigned long) ptr;
+}
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
-- 
2.9.5

[PATCH] perf_event_open.2: add new type PERF_TYPE_PROBE

2017-11-15 Thread Song Liu

A new type PERF_TYPE_PROBE is being added to perf_event_attr. This
patch adds information about this type.

Note: the following two flags are also added to the man page. They
are from perf_event.h in latest kernel repo. However, they are not
related to PERF_TYPE_PROBE. Therefore, their usage are not included
in this patch.

  write_backward :  1
  namespaces :  1

Signed-off-by: Song Liu 
---
 man2/perf_event_open.2 | 82 --
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index c91da3f..abeec0a 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -205,7 +205,12 @@ for the event being created.
 struct perf_event_attr {
 __u32 type; /* Type of event */
 __u32 size; /* Size of attribute structure */
-__u64 config;   /* Type-specific configuration */
+
+/* Type-specific configuration */
+union {
+__u64 config;
+__u64 probe_desc; /* ptr to struct probe_desc */
+};
 
 union {
 __u64 sample_period;/* Period of sampling */
@@ -244,8 +249,13 @@ struct perf_event_attr {
due to exec */
   use_clockid:  1,  /* use clockid for time fields */
   context_switch :  1,  /* context switch data */
+  write_backward :  1,  /* Write ring buffer from end to beginning */
+  namespaces :  1,  /* include namespaces data */
 
-  __reserved_1   : 37;
+  /* For PERF_TYPE_PROBE */
+  is_uprobe  :  1,  /* 0 for kprobe, 1 for uprobe */
+  is_return  :  1,  /* 0 for [k,u]probe, 1 for [k,u]retprobe */
+  __reserved_1   : 33;
 
 union {
 __u32 wakeup_events;/* wakeup every n events */
@@ -336,6 +346,13 @@ field.
 For instance,
 .I /sys/bus/event_source/devices/cpu/type
 contains the value for the core CPU PMU, which is usually 4.
+.TP
+.BR PERF_TYPE_PROBE " (since Linux 4.TBD)"
+This indicates a kprobe or uprobe should be created and
+attached to the file descriptor.
+See fields
+.IR probe_desc ", " is_uprobe ", and " is_return
+for more details.
 .RE
 .TP
 .I "size"
@@ -627,6 +644,67 @@ then leave
 .I config
 set to zero.
 Its parameters are set in other places.
+.PP
+If
+.I type
+is
+.BR PERF_TYPE_PROBE ,
+.I probe_desc
+is used instead of
+.IR config .
+.RE
+.TP
+.I probe_desc
+The
+.I probe_desc
+field is used with
+.I type
+of
+.BR PERF_TYPE_PROBE ,
+to save a pointer to struct probe_desc:
+.PP
+.in +8n
+.EX
+struct probe_desc {
+union {
+__aligned_u64 func;
+__aligned_u64 path;
+};
+union {
+__aligned_u64 addr;
+__u64 offset;
+};
+};
+.EE
+Different fields of struct probe_desc are used to describe kprobes
+and uprobes. For kprobes: use
+.I func
+and
+.IR offset ,
+or use
+.I addr
+and leave
+.I func
+as NULL. For uprobe: use
+.I path
+and
+.IR offset .
+.RE
+.TP
+.IR is_uprobe ", " is_return
+These two bits are used with
+.I type
+of
+.BR PERF_TYPE_PROBE ,
+to specify type of the probe:
+.PP
+.in +8n
+.EX
+is_uprobe == 0, is_return == 0: kprobe
+is_uprobe == 0, is_return == 1: kretprobe
+is_uprobe == 1, is_return == 0: uprobe
+is_uprobe == 1, is_return == 1: uretprobe
+.EE
 .RE
 .TP
 .IR sample_period ", " sample_freq
-- 
2.9.5

[PATCH 2/6] perf: copy new perf_event.h to tools/include/uapi

2017-11-15 Thread Song Liu

perf_event.h is updated in previous patch, this patch applies same
changes to the tools/ version. This is part is put in a separate
patch in case the two files are back ported separately.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/perf_event.h | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 362493a..cc42d59 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -33,6 +33,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_PROBE = 6,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +300,29 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN64
+/*
+ * Describe a kprobe or uprobe for PERF_TYPE_PROBE.
+ * perf_event_attr.probe_desc will point to this structure. is_uprobe
+ * and is_return are used to differentiate different types of probe
+ * (k/u, probe/retprobe).
+ *
+ * The two unions should be used as follows:
+ * For uprobe: use path and offset;
+ * For kprobe: if func is empty, use addr
+ * if func is not emtpy, use func and offset
+ */
+struct probe_desc {
+   union {
+   __aligned_u64   func;
+   __aligned_u64   path;
+   };
+   union {
+   __aligned_u64   addr;
+   __u64   offset;
+   };
+};
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -320,7 +344,10 @@ struct perf_event_attr {
/*
 * Type specific configuration information.
 */
-   __u64   config;
+   union {
+   __u64   config;
+   __u64   probe_desc; /* ptr to struct probe_desc */
+   };
 
union {
__u64   sample_period;
@@ -370,7 +397,11 @@ struct perf_event_attr {
context_switch :  1, /* context switch data */
write_backward :  1, /* Write ring buffer from 
end to beginning */
namespaces :  1, /* include namespaces data 
*/
-   __reserved_1   : 35;
+
+   /* For PERF_TYPE_PROBE */
+   is_uprobe  :  1, /* 0: kprobe, 1: uprobe */
+   is_return  :  1, /* 0: probe, 1: retprobe */
+   __reserved_1   : 33;
 
union {
__u32   wakeup_events;/* wakeup every n events */
-- 
2.9.5

[PATCH 0/6] enable creating [k,u]probe with perf_event_open

2017-11-15 Thread Song Liu

Changes RFC v2 to PATCH v1:
  Check type PERF_TYPE_PROBE in perf_event_set_filter().
  Rebase on to tip perf/core.

Changes RFC v1 to RFC v2:
  Fix build issue reported by kbuild test bot by adding ifdef of
  CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.

RFC v1 cover letter:

This is to follow up the discussion over "new kprobe api" at Linux
Plumbers 2017:

https://www.linuxplumbersconf.org/2017/ocw/proposals/4808

With current kernel, user space tools can only create/destroy [k,u]probes
with a text-based API (kprobe_events and uprobe_events in tracefs). This
approach relies on user space to clean up the [k,u]probe after using them.
However, this is not easy for user space to clean up properly.

To solve this problem, we introduce a file descriptor based API.
Specifically, we extended perf_event_open to create [k,u]probe, and attach
this [k,u]probe to the file descriptor created by perf_event_open. These
[k,u]probe are associated with this file descriptor, so they are not
available in tracefs.

We reuse large portion of existing trace_kprobe and trace_uprobe code.
Currently, the file descriptor API does not support arguments as the
text-based API does. This should not be a problem, as user of the file
decriptor based API read data through other methods (bpf, etc.).

I also include a patch to to bcc, and a patch to man-page perf_even_open.
Please see the list below. A fork of bcc with this patch is also available
on github:

  https://github.com/liu-song-6/bcc/tree/perf_event_opn

Thanks,
Song

man-pages patch:
  perf_event_open.2: add new type PERF_TYPE_PROBE

bcc patch:
  bcc: Try use new API to create [k,u]probe with perf_event_open

kernel patches:

Song Liu (6):
  perf: Add new type PERF_TYPE_PROBE
  perf: copy new perf_event.h to tools/include/uapi
  perf: implement kprobe support to PERF_TYPE_PROBE
  perf: implement uprobe support to PERF_TYPE_PROBE
  bpf: add option for bpf_load.c to use PERF_TYPE_PROBE
  bpf: add new test test_many_kprobe

 include/linux/trace_events.h  |   2 +
 include/uapi/linux/perf_event.h   |  35 ++-
 kernel/events/core.c  |  41 +++-
 kernel/trace/trace_event_perf.c   | 127 +++
 kernel/trace/trace_kprobe.c   |  91 +++--
 kernel/trace/trace_probe.h|  11 ++
 kernel/trace/trace_uprobe.c   |  90 +++--
 samples/bpf/Makefile  |   3 +
 samples/bpf/bpf_load.c|  61 ++-
 samples/bpf/bpf_load.h|  12 +++
 samples/bpf/test_many_kprobe_user.c   | 184 ++
 tools/include/uapi/linux/perf_event.h |  35 ++-
 12 files changed, 644 insertions(+), 48 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

--
2.9.5

Re: [PATCH][v3] uprobes/x86: emulate push insns for uprobe on x86

2017-11-15 Thread Oleg Nesterov

On 11/15, Oleg Nesterov wrote:
>
> So please, check if uprobe_init_insn() fails or not in this case. After that
> we will know whether your patch needs the additional is_64bit_mm() check in
> push_setup_xol_ops() or not.

OK, I did the check for you.

uprobe_init_insn() doesn't fail but insn_init(x86_64 => 0) parse it as
single-byte insn with OPCODE1 == 0x41, so push_setup_xol_ops() doesn't
need to worry about compat tasks.

In short, your "V2" should be fine except you can factor out
auprobe->push.ilen initialization (as you did in V3). Please send V4.

Oleg.

Re: question lan9303: DSA concurrency and locking,

2017-11-15 Thread Vivien Didelot

Hi,

>> Does DSA offer any protection against concurrent calls of
>> dsa_switch_ops?

This is something I thought about for a while. Since DSA offers an
abstraction of different net stack entry points to its drivers, like
netlink (bridge, etc.) or ioctl (ethtool), it would make sense to add a
mutex at the DSA layer.

It would naturally go in the dsa_switch structure, and be (un)locked by
DSA core. But a switch fabric might be composed of multiple devices, so
this locking must happen at the dsa_switch_tree level. The entry points
to the DSA core are the dsa_port structure, either accessed via a master
interface's dsa_ptr pointer, or via a slave interface's private data.

So ideally the locking of the control path must occur when notifying an
operation to every device of the tree, i.e. in dsa_port_notify:

mutex_lock(>lock);
err = raw_notifier_call_chain(>nh, e, v);
mutex_lock(>lock);

Unfortunately the code is not ready for that yet, because not all calls
to ds->ops->foo are centralized yet. But we are slowly going that way.
In the meantime, DSA drivers handle locking themselves when necessary.


Thanks,

Vivien

1 2 >

1 - 100 of 188 matches

Mail list logo