Re: [net-next 00/14][pull request] 100GbE Intel Wired LAN Driver Updates 2016-08-29

2016-08-29 Thread David Miller
From: Jeff Kirsher 
Date: Mon, 29 Aug 2016 02:13:32 -0700

> This series contains updates to fm10k only.

Pulled, thanks Jeff.


did you receive my previous email ?

2016-08-29 Thread Friedrich Mayrhofer


Hello,

This is the second time i am sending you this mail.
I, Friedrich Mayrhofer and my wife has Donate $ 1,000,000.00 USD 
to You, Email  Me personally for more details.

Regards.

Friedrich Mayrhofer


Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Sargun Dhillon
On Mon, Aug 29, 2016 at 02:49:17PM -0700, Alexei Starovoitov wrote:
> On 8/29/16 12:24 PM, Tejun Heo wrote:
> >Hello, Sargun.
> >
> >On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:
> >>It would be a separate hook per LSM hook. Why wouldn't we want a separate 
> >>bpf
> >>hook per lsm hook? I think if one program has to handle them all, the first
> >>program would be looking up the hook program in a bpf prog array. If you 
> >>think
> >>it's better to have this logic in the BPF program, that makes sense.
> >>
> >>I had a version of this patch that allowed you to attach a prog array 
> >>instead,
> >>but I think that it's cleaner attaching a program per lsm hook. In addition,
> >>there's a performance impact that comes from these hooks, so I wouldn't 
> >>want to
> >>execute unneccessary code if it's avoidable.
> >
> >Hmm... it doesn't really matter how the backend part looks like and if
> >we need to implement per-call hooks to lower runtime overhead, sure.
> >I was mostly worried about the approach propagating through the
> >userland visible interface.
> >
> >>The prog array approach also makes stacking filters difficult. If people 
> >>want
> >>multiple filters per hook, the orchestrator would have to rewrite the 
> >>existing
> >>filters to be cooperative.
> >
> >I'm not really sure "stacking" in the kernel side is a good idea.
> >Please see below.
> >
> >>>I'm not convinced about the approach.  It's an approach which pretty
> >>>much requires future extensions while being rigid.  Not a good
> >>>combination.
> >>
> >>Do you have an alternative recommendation? Maybe just a set of 5 u64s
> >>as the context object along with the hook ID?
> >
> >cgroup fs doesn't seem like the right interface for this but if it
> >were I'd go for named hook IDs instead of opaque numbers.
> >
> >>>Unless this is properly delegatable, IOW, it's safe to fully delegate
> >>>to a lesser security domain for all operations including program
> >>>loading and assignment (I can't see how that'd be the case), making it
> >>>an explicit controller doens't work in terms of userland interface.
> >>>It's fine for bpf / lsm / whatever to attach to cgroups by extending
> >>>struct cgroup itself or implementing an implicit controller but to be
> >>>visible as an explicit controller it must be able to follow cgroup
> >>>interface rules including delegation.  If not, it's best to come
> >>>through the interface which enforces the required permission checks
> >>>and then talk to cgroup from there.  This was also an issue with
> >>>network cgroup bpf programs that Daniel Mack is working on.  Please
> >>>chat with him.
> >>
> >>Program assignment is possible by lesser security domains. Program loading 
> >>is
> >>limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to
> >>CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that 
> >>Checmate
> >>BPF programs can leak kernel pointers.
> >
> >That doesn't make much sense to me.  Delegation doesn't mean much if a
> >delegatee can't load its own program (and I don't see how one can
> >delegate kernel pointer access to !root).  Also, unless there's
> >per-program fine control on who can load it, it seems pretty dangerous
> >to let anyone load any program.
> >
> >>Could we potentially restrict it to only CAP_MAC_OVERRIDE, while still 
> >>meeting
> >>cgroup delegation requirements?
> >
> >Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
> >that "load this program" and "attach this program to the cgroup
> >identified by this fd" through the same interface and permission
> >checks?  cgroup participating in bpf operations is all fine but
> >splitting the userland interface across two domains seems like a bad
> >idea.
> >
> >>Filters which are higher up in the heirarchy will still be enforced during
> >>delegation. This was an explicit design, as the "Orchestrator in 
> >>Orchestrator"
> >>use case needs to be supported.
> >
> >Given that program loading is restricted to root, wouldn't it be an a
> >lot more efficient approach to let userland multiplex multiple
> >programs?  Walking the tree executing bpf programs each time one of
> >these operations runs can be pretty expensive.  Imagine a tree like
> >the following.
> >
> > A - B - C
> >   \ D
> >
> >Let's say program is currently loaded on D.  If someone wants to add a
> >program on B, userland can load the program on B, combine B's and D's
> >program and compile them into a single program and load it on D.  The
> >only thing kernel would need to do in terms of hierarchy is finding
> >what's the closest program to execute.  In the above example, C would
> >need to use B's program and that can be determined on program
> >assignment time rather than on each operation.
> 
> I think that's exactly what Daniel's patches are doing and imo
> it makes sense to keep this style for lsm as well
> and also apply the concept of hook_id.
> Daniel adds two commands to bpf syscall to 

[PATCH net v3 3/9] net: ethernet: mediatek: fix API usage with skb_free_frag

2016-08-29 Thread sean.wang
From: Sean Wang 

use skb_free_frag() instead of legacy put_page()

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 33a890c..0f0072e 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -868,7 +868,7 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
/* receive data */
skb = build_skb(data, ring->frag_size);
if (unlikely(!skb)) {
-   put_page(virt_to_head_page(new_data));
+   skb_free_frag(new_data);
netdev->stats.rx_dropped++;
goto release_desc;
}
-- 
1.9.1



[PATCH net v3 6/9] net: ethernet: mediatek: fix issue of driver removal with interface is up

2016-08-29 Thread sean.wang
From: Sean Wang 

mtk_stop() must be called to stop for freeing DMA
resources acquired and restoring state changed by mtk_open()
firstly when module removal.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 5bfca65..8dcd008 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1903,6 +1903,14 @@ err_free_dev:
 static int mtk_remove(struct platform_device *pdev)
 {
struct mtk_eth *eth = platform_get_drvdata(pdev);
+   int i;
+
+   /* stop all devices to make sure that dma is properly shut down */
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->netdev[i])
+   continue;
+   mtk_stop(eth->netdev[i]);
+   }
 
clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
-- 
1.9.1



[PATCH net v3 9/9] net: ethernet: mediatek: fix error handling inside mtk_mdio_init

2016-08-29 Thread sean.wang
From: Sean Wang 

Return -ENODEV if the MDIO bus is disabled in the device tree.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 2fc48bb..dfc12ab 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -304,7 +304,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
}
 
if (!of_device_is_available(mii_np)) {
-   ret = 0;
+   ret = -ENODEV;
goto err_put_node;
}
 
-- 
1.9.1



[PATCH net v3 4/9] net: ethernet: mediatek: remove redundant free_irq for devm_request_irq allocated irq

2016-08-29 Thread sean.wang
From: Sean Wang 

these irqs are not used for shared irq and disabled during ethernet stops.
irq requested by devm_request_irq is safe to be freed automatically on
driver detach.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 0f0072e..8d87748 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1510,8 +1510,6 @@ static void mtk_uninit(struct net_device *dev)
phy_disconnect(mac->phy_dev);
mtk_mdio_cleanup(eth);
mtk_irq_disable(eth, ~0);
-   free_irq(eth->irq[1], dev);
-   free_irq(eth->irq[2], dev);
 }
 
 static int mtk_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
-- 
1.9.1



[PATCH net v3 7/9] net: ethernet: mediatek: fix the missing of_node_put() after node is used done inside mtk_mdio_init

2016-08-29 Thread sean.wang
From: Sean Wang 

This patch adds the missing of_node_put() after finishing the usage
of of_get_child_by_name.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8dcd008..f77e173 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -324,6 +324,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
err = of_mdiobus_register(eth->mii_bus, mii_np);
if (err)
goto err_free_bus;
+   of_node_put(mii_np);
 
return 0;
 
-- 
1.9.1



[PATCH net v3 1/9] net: ethernet: mediatek: fix fails from TX housekeeping due to incorrect port setup

2016-08-29 Thread sean.wang
From: Sean Wang 

which net device the SKB is complete for depends on the forward port
on txd4 on the corresponding TX descriptor, but the information isn't
set up well in case of  SKB fragments that would lead to watchdog timeout
from the upper layer, so fix it up.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 1801fd8..6e4a6ca 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -587,14 +587,15 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
dma_addr_t mapped_addr;
unsigned int nr_frags;
int i, n_desc = 1;
-   u32 txd4 = 0;
+   u32 txd4 = 0, fport;
 
itxd = ring->next_free;
if (itxd == ring->last_free)
return -ENOMEM;
 
/* set the forward port */
-   txd4 |= (mac->id + 1) << TX_DMA_FPORT_SHIFT;
+   fport = (mac->id + 1) << TX_DMA_FPORT_SHIFT;
+   txd4 |= fport;
 
tx_buf = mtk_desc_to_tx_buf(ring, itxd);
memset(tx_buf, 0, sizeof(*tx_buf));
@@ -652,7 +653,7 @@ static int mtk_tx_map(struct sk_buff *skb, struct 
net_device *dev,
WRITE_ONCE(txd->txd3, (TX_DMA_SWC |
   TX_DMA_PLEN0(frag_map_size) |
   last_frag * TX_DMA_LS0));
-   WRITE_ONCE(txd->txd4, 0);
+   WRITE_ONCE(txd->txd4, fport);
 
tx_buf->skb = (struct sk_buff *)MTK_DMA_DUMMY_DESC;
tx_buf = mtk_desc_to_tx_buf(ring, txd);
-- 
1.9.1



[PATCH net v3 0/9] net: ethernet: mediatek: a couple of fixes

2016-08-29 Thread sean.wang
From: Sean Wang 

a couple of fixes come out from integrating with linux-4.8 rc1
they all are verified and workable on linux-4.8 rc1

Changes since v1:
- usage of loops to work out if all required clock are ready instead
of tedious coding
- remove redundant pinctrl setup that is already done by core driver
thanks for careful and patient reviewing by Andrew Lunn
- splitting distinct changes into the separate patches
- change variable naming from err to ret for readable coding

Changes since v2:
- restore to original clock disabling sequence that is changed 
accidentally in the last version
- refine the commit log that would cause misunderstanding what has 
been done in the changes
- refine the commit log that would cause footnote losing due to 
improper delimiter use

Sean Wang (9):
  net: ethernet: mediatek: fix fails from TX housekeeping due to
incorrect port setup
  net: ethernet: mediatek: fix incorrect return value of devm_clk_get
with EPROBE_DEFER
  net: ethernet: mediatek: fix API usage with skb_free_frag
  net: ethernet: mediatek: remove redundant free_irq for
devm_request_irq allocated irq
  net: ethernet: mediatek: fix logic unbalance between probe and remove
  net: ethernet: mediatek: fix issue of driver removal with interface is
up
  net: ethernet: mediatek: fix the missing of_node_put() after node is
used done inside mtk_mdio_init
  net: ethernet: mediatek: use devm_mdiobus_alloc instead of
mdiobus_alloc inside mtk_mdio_init
  net: ethernet: mediatek: fix error handling inside mtk_mdio_init

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 82 +++--
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 22 +---
 2 files changed, 56 insertions(+), 48 deletions(-)

-- 
1.9.1



[PATCH net v3 2/9] net: ethernet: mediatek: fix incorrect return value of devm_clk_get with EPROBE_DEFER

2016-08-29 Thread sean.wang
From: Sean Wang 

1) If the return value of devm_clk_get is EPROBE_DEFER, we should
defer probing the driver. The change is verified and works based
on 4.8-rc1 staying with the latest clk-next code for MT7623.
2) Changing with the usage of loops to work out if all clocks
required are fine

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 39 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 22 ++--
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 6e4a6ca..33a890c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -50,6 +50,10 @@ static const struct mtk_ethtool_stats {
MTK_ETHTOOL_STAT(rx_flow_control_packets),
 };
 
+static const char * const mtk_clks_source_name[] = {
+   "ethif", "esw", "gp1", "gp2"
+};
+
 void mtk_w32(struct mtk_eth *eth, u32 val, unsigned reg)
 {
__raw_writel(val, eth->base + reg);
@@ -1811,6 +1815,7 @@ static int mtk_probe(struct platform_device *pdev)
if (!eth)
return -ENOMEM;
 
+   eth->dev = >dev;
eth->base = devm_ioremap_resource(>dev, res);
if (IS_ERR(eth->base))
return PTR_ERR(eth->base);
@@ -1845,21 +1850,21 @@ static int mtk_probe(struct platform_device *pdev)
return -ENXIO;
}
}
+   for (i = 0; i < ARRAY_SIZE(eth->clks); i++) {
+   eth->clks[i] = devm_clk_get(eth->dev,
+   mtk_clks_source_name[i]);
+   if (IS_ERR(eth->clks[i])) {
+   if (PTR_ERR(eth->clks[i]) == -EPROBE_DEFER)
+   return -EPROBE_DEFER;
+   return -ENODEV;
+   }
+   }
 
-   eth->clk_ethif = devm_clk_get(>dev, "ethif");
-   eth->clk_esw = devm_clk_get(>dev, "esw");
-   eth->clk_gp1 = devm_clk_get(>dev, "gp1");
-   eth->clk_gp2 = devm_clk_get(>dev, "gp2");
-   if (IS_ERR(eth->clk_esw) || IS_ERR(eth->clk_gp1) ||
-   IS_ERR(eth->clk_gp2) || IS_ERR(eth->clk_ethif))
-   return -ENODEV;
-
-   clk_prepare_enable(eth->clk_ethif);
-   clk_prepare_enable(eth->clk_esw);
-   clk_prepare_enable(eth->clk_gp1);
-   clk_prepare_enable(eth->clk_gp2);
+   clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
+   clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
 
-   eth->dev = >dev;
eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
INIT_WORK(>pending_work, mtk_pending_work);
 
@@ -1902,10 +1907,10 @@ static int mtk_remove(struct platform_device *pdev)
 {
struct mtk_eth *eth = platform_get_drvdata(pdev);
 
-   clk_disable_unprepare(eth->clk_ethif);
-   clk_disable_unprepare(eth->clk_esw);
-   clk_disable_unprepare(eth->clk_gp1);
-   clk_disable_unprepare(eth->clk_gp2);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
 
netif_napi_del(>tx_napi);
netif_napi_del(>rx_napi);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index f82e3ac..6e1ade7 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -290,6 +290,17 @@ enum mtk_tx_flags {
MTK_TX_FLAGS_PAGE0  = 0x02,
 };
 
+/* This enum allows us to identify how the clock is defined on the array of the
+ * clock in the order
+ */
+enum mtk_clks_map {
+   MTK_CLK_ETHIF,
+   MTK_CLK_ESW,
+   MTK_CLK_GP1,
+   MTK_CLK_GP2,
+   MTK_CLK_MAX
+};
+
 /* struct mtk_tx_buf - This struct holds the pointers to the memory pointed at
  * by the TX descriptors
  * @skb:   The SKB pointer of the packet being sent
@@ -370,10 +381,7 @@ struct mtk_rx_ring {
  * @scratch_ring:  Newer SoCs need memory for a second HW managed TX ring
  * @phy_scratch_ring:  physical address of scratch_ring
  * @scratch_head:  The scratch memory that scratch_ring points to.
- * @clk_ethif: The ethif clock
- * @clk_esw:   The switch clock
- * @clk_gp1:   The gmac1 clock
- * @clk_gp2:   The gmac2 clock
+ * @clks:  clock array for all clocks required
  * @mii_bus:   If there is a bus we need to create an instance for it
  * @pending_work:  The workqueue used to reset the dma ring
  */
@@ -400,10 +408,8 @@ struct mtk_eth {
struct mtk_tx_dma   *scratch_ring;
dma_addr_t  phy_scratch_ring;
  

[PATCH net v3 8/9] net: ethernet: mediatek: use devm_mdiobus_alloc instead of mdiobus_alloc inside mtk_mdio_init

2016-08-29 Thread sean.wang
From: Sean Wang 

a lot of parts in the driver uses devm_* APIs to gain benefits from the
device resource management, so devm_mdiobus_alloc is also used instead
of mdiobus_alloc to have more elegant code flow.

Using common code provided by the devm_* helps to
1) have simplified the code flow as [1] says
2) decrease the risk of incorrect error handling by human
3) only a few drivers used it since it was proposed on linux 3.16,
so just hope to promote for this.

Ref:
[1] https://patchwork.ozlabs.org/patch/344093/

Signed-off-by: Sean Wang 
Reviewed-by: Andrew Lunn 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index f77e173..2fc48bb 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -295,7 +295,7 @@ err_phy:
 static int mtk_mdio_init(struct mtk_eth *eth)
 {
struct device_node *mii_np;
-   int err;
+   int ret;
 
mii_np = of_get_child_by_name(eth->dev->of_node, "mdio-bus");
if (!mii_np) {
@@ -304,13 +304,13 @@ static int mtk_mdio_init(struct mtk_eth *eth)
}
 
if (!of_device_is_available(mii_np)) {
-   err = 0;
+   ret = 0;
goto err_put_node;
}
 
-   eth->mii_bus = mdiobus_alloc();
+   eth->mii_bus = devm_mdiobus_alloc(eth->dev);
if (!eth->mii_bus) {
-   err = -ENOMEM;
+   ret = -ENOMEM;
goto err_put_node;
}
 
@@ -321,20 +321,11 @@ static int mtk_mdio_init(struct mtk_eth *eth)
eth->mii_bus->parent = eth->dev;
 
snprintf(eth->mii_bus->id, MII_BUS_ID_SIZE, "%s", mii_np->name);
-   err = of_mdiobus_register(eth->mii_bus, mii_np);
-   if (err)
-   goto err_free_bus;
-   of_node_put(mii_np);
-
-   return 0;
-
-err_free_bus:
-   mdiobus_free(eth->mii_bus);
+   ret = of_mdiobus_register(eth->mii_bus, mii_np);
 
 err_put_node:
of_node_put(mii_np);
-   eth->mii_bus = NULL;
-   return err;
+   return ret;
 }
 
 static void mtk_mdio_cleanup(struct mtk_eth *eth)
@@ -343,8 +334,6 @@ static void mtk_mdio_cleanup(struct mtk_eth *eth)
return;
 
mdiobus_unregister(eth->mii_bus);
-   of_node_put(eth->mii_bus->dev.of_node);
-   mdiobus_free(eth->mii_bus);
 }
 
 static inline void mtk_irq_disable(struct mtk_eth *eth, u32 mask)
-- 
1.9.1



[PATCH net v3 5/9] net: ethernet: mediatek: fix logic unbalance between probe and remove

2016-08-29 Thread sean.wang
From: Sean Wang 

original mdio_cleanup is not in the symmetric place against where
mdio_init is, so relocate mdio_cleanup to the right one.

Signed-off-by: Sean Wang 
Acked-by: John Crispin 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8d87748..5bfca65 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1508,7 +1508,6 @@ static void mtk_uninit(struct net_device *dev)
struct mtk_eth *eth = mac->hw;
 
phy_disconnect(mac->phy_dev);
-   mtk_mdio_cleanup(eth);
mtk_irq_disable(eth, ~0);
 }
 
@@ -1913,6 +1912,7 @@ static int mtk_remove(struct platform_device *pdev)
netif_napi_del(>tx_napi);
netif_napi_del(>rx_napi);
mtk_cleanup(eth);
+   mtk_mdio_cleanup(eth);
 
return 0;
 }
-- 
1.9.1



Re: [PATCH net 1/2] ipv6: add missing netconf notif when 'all' is updated

2016-08-29 Thread YOSHIFUJI Hideaki


Nicolas Dichtel wrote:
> The 'default' value was not advertised.
> 
> Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding 
> status")
> Signed-off-by: Nicolas Dichtel 
> ---
>  net/ipv6/addrconf.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index f418d2eaeddd..299f0656e87f 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -778,7 +778,14 @@ static int addrconf_fixup_forwarding(struct ctl_table 
> *table, int *p, int newf)
>   }
>  
>   if (p == >ipv6.devconf_all->forwarding) {
> + int old_dftl = net->ipv6.devconf_dflt->forwarding;

dflt, not dftl?

> +
>   net->ipv6.devconf_dflt->forwarding = newf;
> + if ((!newf) ^ (!old_dftl))
> + inet6_netconf_notify_devconf(net, NETCONFA_FORWARDING,
> +  NETCONFA_IFINDEX_DEFAULT,
> +  net->ipv6.devconf_dflt);
> +
>   addrconf_forward_change(net, newf);
>   if ((!newf) ^ (!old))
>   inet6_netconf_notify_devconf(net, NETCONFA_FORWARDING,
> 

-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION


Re: [PATCH] net/mlx4_en: protect ring->xdp_prog with rcu_read_lock

2016-08-29 Thread Tariq Toukan

Hi Brenden,

The solution direction should be XDP specific that does not hurt the 
regular flow.


On 26/08/2016 11:38 PM, Brenden Blanco wrote:

Depending on the preempt mode, the bpf_prog stored in xdp_prog may be
freed despite the use of call_rcu inside bpf_prog_put. The situation is
possible when running in PREEMPT_RCU=y mode, for instance, since the rcu
callback for destroying the bpf prog can run even during the bh handling
in the mlx4 rx path.

Several options were considered before this patch was settled on:

Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all
of the rings are updated with the new program.
This approach has the disadvantage that as the number of rings
increases, the speed of udpate will slow down significantly due to
napi_synchronize's msleep(1).
I prefer this option as it doesn't hurt the data path. A delay in a 
control command can be tolerated.

Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh.
The action of the bpf_prog_put_bh would be to then call bpf_prog_put
later. Those drivers that consume a bpf prog in a bh context (like mlx4)
would then use the bpf_prog_put_bh instead when the ring is up. This has
the problem of complexity, in maintaining proper refcnts and rcu lists,
and would likely be harder to review. In addition, this approach to
freeing must be exclusive with other frees of the bpf prog, for instance
a _bh prog must not be referenced from a prog array that is consumed by
a non-_bh prog.

The placement of rcu_read_lock in this patch is functionally the same as
putting an rcu_read_lock in napi_poll. Actually doing so could be a
potentially controversial change, but would bring the implementation in
line with sk_busy_loop (though of course the nature of those two paths
is substantially different), and would also avoid future copy/paste
problems with future supporters of XDP. Still, this patch does not take
that opinionated option.
So you decided to add a lock for all non-XDP flows, which are 99% of the 
cases.

We should avoid this.


Testing was done with kernels in either PREEMPT_RCU=y or
CONFIG_PREEMPT_VOLUNTARY=y+PREEMPT_RCU=n modes, with neither exhibiting
any drawback. With PREEMPT_RCU=n, the extra call to rcu_read_lock did
not show up in the perf report whatsoever, and with PREEMPT_RCU=y the
overhead of rcu_read_lock (according to perf) was the same before/after.
In the rx path, rcu_read_lock is eventually called for every packet
from netif_receive_skb_internal, so the napi poll call's rcu_read_lock
is easily amortized.

For now, I don't agree with this fix.
Let me think about the options you suggested.
I also need to do my perf tests.

Regards,
Tariq



Re: [PATCH net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-08-29 Thread Brendan Gregg
On Mon, Aug 29, 2016 at 5:19 AM, Peter Zijlstra  wrote:
>
> On Fri, Aug 26, 2016 at 07:31:18PM -0700, Alexei Starovoitov wrote:
> > Hi Peter, Dave,
> >
> > this patch set is a follow up to the discussion:
> > https://lkml.org/lkml/2016/8/4/304
> > It turned out to be simpler than what we discussed.
> >
> > Patches 1-3 is a bpf-side prep for the main patch 4
> > that adds bpf program as an overflow_handler to sw and hw perf_events.
> > Peter, please review.
> >
> > Patches 5 and 6 are tests/examples from myself and Brendan.
>
> Brendan, so this works for you without extra hacks required?

Yes, thanks for checking, I've done both IP and stack sampling so far with it.

Brendan


Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-29 Thread Ding Tianhong
On 2016/8/12 2:24, Jörn Engel wrote:
> On Wed, Aug 10, 2016 at 05:58:38PM -0700, Jay Vosburgh wrote:
>> Jörn Engel  wrote:
>>> On Wed, Aug 10, 2016 at 02:26:49PM -0700, Jörn Engel wrote:

 Having to set one more parameter is a bit annoying.  It would have to be
 documented in a prominent place and people would still often miss it.
 So I wonder if we can make the interface a little nicer.

 Options:
 - If there are no slaves yet and the first slave added is tun, we trust
   the users to know what they are doing.  Automatically set
   bond->params.fail_over_mac = BOND_FOM_KEEPMAC
   Maybe do a printk to inform the user in case of a mistake.
>>
>>  I don't think this is feasible, as I don't see a reliable way to
>> test for a slave being a tun device (ARPHRD_NONE is not just tun, and we
>> cannot check the ops as they are not statically built into the kernel).
>> I'm also not sure that heuristics are the proper way to enable this
>> functionality in general.
> 
> I was looking for a slightly more generic thing than "is this device
> tun?".  Something along the lines of "is this device L3 only?".  We can
> always introduce a new flag and have tun set the flag.  Naïve me thought
> ARPHRD_NONE might already match what I was looking for.
> 

I think there is no such flag to distinguish the tun device, if you insistent 
on to support new flag
for Tun device, you could send a patch and we could review it, otherwise 
BOND_FOM_KEEPMAC is enough to
fix this problem.

Thanks
Ding

> But if such an approach causes problems for others, it is a non-starter.
> 
 - If we get an error and the slave device is tun, do a printk giving the
   user enough information to find this parameter.
>>
>>  This could probably be done as a change the existing logic, e.g.,
>>
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index 1f276fa30ba6..019c1a689aae 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1443,6 +1443,9 @@ int bond_enslave(struct net_device *bond_dev, struct 
>> net_device *slave_dev)
>>  res = -EOPNOTSUPP;
>>  goto err_undo_flags;
>>  }
>> +} else if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP &&
>> +   bond->params.fail_over_mac != BOND_FOM_KEEPMAC) {
>> +netdev_err(bond_dev, "The slave device 
>> specified does not support setting the MAC address, but fail_over_mac is not 
>> set to keepmac\n");
>>  }
>>  }
>>  
>>  I haven't tested this, and I'm not sure it will get all corner
>> cases correct, but this should basically cover it.
> 
> Nit: Indentation is wrong (two tabs instead of one).
> 
> It should provide enough information for anyone that reads kernel messages.
> Works for me.
> 
> [588380.721349] bond1: Adding slave tun0
> [588380.721402] bond1: The slave device specified does not support setting 
> the MAC address
> [588380.721404] bond1: The slave device specified does not support setting 
> the MAC address, but fail_over_mac is not set to keepmac
> 
> Jörn
> 
> --
> It is easier to lead men to combat, stirring up their passion, than to
> restrain them and direct them toward the patient labours of peace.
> -- Andre Gide
> 
> .
> 




[net] i40e: Fix kernel panic on enable/disable LLDP

2016-08-29 Thread Jeff Kirsher
From: Dave Ertman 

If DCB is configured on the link partner switch with an
unsupported traffic class configuration (e.g. non-contiguous TCs),
the driver is flagging DCB as disabled.  But, for future DCB
LLDPDUs, the driver was checking if the interface was DCB capable
instead of enabled.  This was causing a kernel panic when LLDP
was enabled/disabled on the link partner switch.

This patch corrects the situation by having the LLDP event handler
check the correct flag in the pf structure.  It also cleans up the
setting and clearing of the enabled flag for other checks.

Signed-off-by: Dave Ertman 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 828ed28..d0b3a1b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5113,9 +5113,13 @@ static int i40e_init_pf_dcb(struct i40e_pf *pf)
   DCB_CAP_DCBX_VER_IEEE;
 
pf->flags |= I40E_FLAG_DCB_CAPABLE;
-   /* Enable DCB tagging only when more than one TC */
+   /* Enable DCB tagging only when more than one TC
+* or explicitly disable if only one TC
+*/
if (i40e_dcb_get_num_tc(>local_dcbx_config) > 1)
pf->flags |= I40E_FLAG_DCB_ENABLED;
+   else
+   pf->flags &= ~I40E_FLAG_DCB_ENABLED;
dev_dbg(>pdev->dev,
"DCBX offload is supported for this PF.\n");
}
@@ -5716,7 +5720,7 @@ static int i40e_handle_lldp_event(struct i40e_pf *pf,
u8 type;
 
/* Not DCB capable or capability disabled */
-   if (!(pf->flags & I40E_FLAG_DCB_CAPABLE))
+   if (!(pf->flags & I40E_FLAG_DCB_ENABLED))
return ret;
 
/* Ignore if event is not for Nearest Bridge */
@@ -7896,6 +7900,7 @@ static int i40e_init_interrupt_scheme(struct i40e_pf *pf)
 #endif
   I40E_FLAG_RSS_ENABLED|
   I40E_FLAG_DCB_CAPABLE|
+  I40E_FLAG_DCB_ENABLED|
   I40E_FLAG_SRIOV_ENABLED  |
   I40E_FLAG_FD_SB_ENABLED  |
   I40E_FLAG_FD_ATR_ENABLED |
@@ -10502,6 +10507,7 @@ static void i40e_determine_queue_usage(struct i40e_pf 
*pf)
   I40E_FLAG_FD_SB_ENABLED  |
   I40E_FLAG_FD_ATR_ENABLED |
   I40E_FLAG_DCB_CAPABLE|
+  I40E_FLAG_DCB_ENABLED|
   I40E_FLAG_SRIOV_ENABLED  |
   I40E_FLAG_VMDQ_ENABLED);
} else if (!(pf->flags & (I40E_FLAG_RSS_ENABLED |
@@ -10525,7 +10531,8 @@ static void i40e_determine_queue_usage(struct i40e_pf 
*pf)
/* Not enough queues for all TCs */
if ((pf->flags & I40E_FLAG_DCB_CAPABLE) &&
(queues_left < I40E_MAX_TRAFFIC_CLASS)) {
-   pf->flags &= ~I40E_FLAG_DCB_CAPABLE;
+   pf->flags &= ~(I40E_FLAG_DCB_CAPABLE |
+   I40E_FLAG_DCB_ENABLED);
dev_info(>pdev->dev, "not enough queues for DCB. 
DCB is disabled.\n");
}
pf->num_lan_qps = max_t(int, pf->rss_size_max,
@@ -10922,7 +10929,7 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
err = i40e_init_pf_dcb(pf);
if (err) {
dev_info(>dev, "DCB init failed %d, disabled\n", err);
-   pf->flags &= ~I40E_FLAG_DCB_CAPABLE;
+   pf->flags &= ~(I40E_FLAG_DCB_CAPABLE & I40E_FLAG_DCB_ENABLED);
/* Continue without DCB enabled */
}
 #endif /* CONFIG_I40E_DCB */
-- 
2.7.4



Re: [PATCH net-next 3/6] bpf: perf_event progs should only use preallocated maps

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann 


Re: [PATCH net-next 2/6] bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
 struct pt_regs regs;
  __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
   ... ctx->sample_period ...
   ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov 


Two things I noticed below, otherwise for BPF bits:

Acked-by: Daniel Borkmann 

[...]


+static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type 
type,
+   enum bpf_reg_type *reg_type)
+{
+   if (off < 0 || off >= sizeof(struct bpf_perf_event_data))
+   return false;
+   if (type != BPF_READ)
+   return false;
+   if (off % size != 0)
+   return false;
+   if (off == offsetof(struct bpf_perf_event_data, sample_period) &&
+   size != sizeof(u64))
+   return false;
+   if (size != sizeof(long))
+   return false;


Wouldn't this one rather need to be:

if (off == offsetof(struct bpf_perf_event_data, sample_period) {
if (size != sizeof(u64))
return false;
} else {
if (size != sizeof(long))
return false;
}

Otherwise on 32bit accessing sample_period might fail?


+   return true;
+}
+
+static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+ int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog)
+{
+   struct bpf_insn *insn = insn_buf;
+
+   switch (ctx_off) {
+   case offsetof(struct bpf_perf_event_data, sample_period):


Would be good to add a test as we usually have done:

BUILD_BUG_ON(FIELD_SIZEOF(struct perf_sample_data, period) != 8);


+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, data)),
+ dst_reg, src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
data));
+   *insn++ = BPF_LDX_MEM(BPF_DW, dst_reg, dst_reg,
+ offsetof(struct perf_sample_data, 
period));
+   break;
+   default:
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct 
bpf_perf_event_data_kern, regs)),
+ dst_reg, src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
regs));
+   *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(sizeof(long)),
+ dst_reg, dst_reg, ctx_off);
+   break;
+   }
+   return insn - insn_buf;
+}
+


Re: [PATCH net-next 1/6] bpf: support 8-byte metafield access

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 04:31 AM, Alexei Starovoitov wrote:

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann 


[NetDev] [ANNOUNCE] Netdev 1.2 weekly updates (30th August, 2016)

2016-08-29 Thread Hajime Tazaki

Hello folks,

I hope you're doing well.
Here is an weekly update of Netdev 1.2 Tokyo.

The early bird registration is still available.  Please
don't miss the discount ticket - and your early registration
will be definitely helpful to prepare the conference.

http://netdevconf.org/1.2/registration.html


== Newly accepted sessions ==

Here is newly accepted sessions in the last week.  We're
going to announce more sessions once we confirmed and stay
tune for more exciting sessions !

Full list of accepted sessions is available here.

http://netdevconf.org/1.2/accepted-sessions.html

* Talk

- The adventures of a Suricate in eBPF land
  by Eric Leblond

- Accelerating Linux IP Virtual Server with OpenNPU
  by Gilad Ben-Yossef

- Advanced programmability and recent updates with tc's cls_bpf
  by Daniel Borkmann

- Linux Forwarding Stack Fastpath
  by Nishit Shah, Jagdish Motwani

- VNIC offloads fact or fiction?
  by Stephen Hemminger

* Tutorial

- Introduction to switchdev SR-IOV offloads
  by Or Gerlitz, Hadar Hen-Zion, Amir Vadai, Rony Efraim


== Sponsors ==

Zen Load Balancer becomes a Bronze sponsor. Thank you for
the support !

Our sponsors:
- Platinum
Verizon, Facebook, Cumulus Networks
- Gold
Mojatatu Networks, VMWare, Google, NTT, LinkedIn (new)
- Silver
NetApp, IIJ, Netronome, SolarFlare, Mellanox, Sophos (new)
- Bronze
Zen Load Balancer (new)

Twitter: https://twitter.com/netdev01
Web: http://netdevconf.org/1.2/

== Others ==

Be prepared for your travel ! Hotel and travel information
are available on the web pages.

http://netdevconf.org/1.2/travel.html
http://netdevconf.org/1.2/hotel.html


Looking forward to seeing you in Tokyo very soon.

-- Hajime



Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Andy Lutomirski
On Aug 29, 2016 3:19 PM, "Mickaël Salaün"  wrote:
>
>
> On 29/08/2016 23:49, Alexei Starovoitov wrote:
> > On 8/29/16 12:24 PM, Tejun Heo wrote:
> >> Hello, Sargun.
> >>
> >> On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:
> >>> It would be a separate hook per LSM hook. Why wouldn't we want a
> >>> separate bpf
> >>> hook per lsm hook? I think if one program has to handle them all, the
> >>> first
> >>> program would be looking up the hook program in a bpf prog array. If
> >>> you think
> >>> it's better to have this logic in the BPF program, that makes sense.
> >>>
> >>> I had a version of this patch that allowed you to attach a prog array
> >>> instead,
> >>> but I think that it's cleaner attaching a program per lsm hook. In
> >>> addition,
> >>> there's a performance impact that comes from these hooks, so I
> >>> wouldn't want to
> >>> execute unneccessary code if it's avoidable.
> >>
> >> Hmm... it doesn't really matter how the backend part looks like and if
> >> we need to implement per-call hooks to lower runtime overhead, sure.
> >> I was mostly worried about the approach propagating through the
> >> userland visible interface.
> >>
> >>> The prog array approach also makes stacking filters difficult. If
> >>> people want
> >>> multiple filters per hook, the orchestrator would have to rewrite the
> >>> existing
> >>> filters to be cooperative.
> >>
> >> I'm not really sure "stacking" in the kernel side is a good idea.
> >> Please see below.
> >>
>  I'm not convinced about the approach.  It's an approach which pretty
>  much requires future extensions while being rigid.  Not a good
>  combination.
> >>>
> >>> Do you have an alternative recommendation? Maybe just a set of 5 u64s
> >>> as the context object along with the hook ID?
> >>
> >> cgroup fs doesn't seem like the right interface for this but if it
> >> were I'd go for named hook IDs instead of opaque numbers.
> >>
>  Unless this is properly delegatable, IOW, it's safe to fully delegate
>  to a lesser security domain for all operations including program
>  loading and assignment (I can't see how that'd be the case), making it
>  an explicit controller doens't work in terms of userland interface.
>  It's fine for bpf / lsm / whatever to attach to cgroups by extending
>  struct cgroup itself or implementing an implicit controller but to be
>  visible as an explicit controller it must be able to follow cgroup
>  interface rules including delegation.  If not, it's best to come
>  through the interface which enforces the required permission checks
>  and then talk to cgroup from there.  This was also an issue with
>  network cgroup bpf programs that Daniel Mack is working on.  Please
>  chat with him.
> >>>
> >>> Program assignment is possible by lesser security domains. Program
> >>> loading is
> >>> limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to
> >>> CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that
> >>> Checmate
> >>> BPF programs can leak kernel pointers.
> >>
> >> That doesn't make much sense to me.  Delegation doesn't mean much if a
> >> delegatee can't load its own program (and I don't see how one can
> >> delegate kernel pointer access to !root).  Also, unless there's
> >> per-program fine control on who can load it, it seems pretty dangerous
> >> to let anyone load any program.
> >>
> >>> Could we potentially restrict it to only CAP_MAC_OVERRIDE, while
> >>> still meeting
> >>> cgroup delegation requirements?
> >>
> >> Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
> >> that "load this program" and "attach this program to the cgroup
> >> identified by this fd" through the same interface and permission
> >> checks?  cgroup participating in bpf operations is all fine but
> >> splitting the userland interface across two domains seems like a bad
> >> idea.
> >>
> >>> Filters which are higher up in the heirarchy will still be enforced
> >>> during
> >>> delegation. This was an explicit design, as the "Orchestrator in
> >>> Orchestrator"
> >>> use case needs to be supported.
> >>
> >> Given that program loading is restricted to root, wouldn't it be an a
> >> lot more efficient approach to let userland multiplex multiple
> >> programs?  Walking the tree executing bpf programs each time one of
> >> these operations runs can be pretty expensive.  Imagine a tree like
> >> the following.
> >>
> >> A - B - C
> >>   \ D
> >>
> >> Let's say program is currently loaded on D.  If someone wants to add a
> >> program on B, userland can load the program on B, combine B's and D's
> >> program and compile them into a single program and load it on D.  The
> >> only thing kernel would need to do in terms of hierarchy is finding
> >> what's the closest program to execute.  In the above example, C would
> >> need to use B's program and that can be determined on program
> >> assignment time 

Re: [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 09:58 PM, Daniel Mack wrote:

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
---
  net/core/filter.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index bc04e5c..163f75b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,11 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;

+   err = cgroup_bpf_run_filter(sk, skb,
+   BPF_ATTACH_TYPE_CGROUP_INET_INGRESS);


Maybe just BPF_CGROUP_INET_{IN,E}GRESS (seems less cluttered, and we know
these were set via bpf(2) as attach_type anyway)?


+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;





Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-08-29 Thread Sargun Dhillon
On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
> This patch adds two sets of eBPF program pointers to struct cgroup.
> One for such that are directly pinned to a cgroup, and one for such
> that are effective for it.
> 
> To illustrate the logic behind that, assume the following example
> cgroup hierarchy.
> 
>   A - B - C
> \ D - E
> 
> If only B has a program attached, it will be effective for B, C, D
> and E. If D then attaches a program itself, that will be effective for
> both D and E, and the program in B will only affect B and C. Only one
> program of a given type is effective for a cgroup.
> 
How does this work when running and orchestrator within an orchestrator? The 
Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is 
observing the traffic, and there is an orchestrator within that also need to 
run 
it.

In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
and so on. Is it possible to allow this, either by flattening out the 
datastructure (copy a ref to the bpf programs to C and E) or something similar?


> Attaching and detaching programs will be done through the bpf(2)
> syscall. For now, ingress and egress inet socket filtering are the
> only supported use-cases.
> 
> Signed-off-by: Daniel Mack 
> ---
>  include/linux/bpf-cgroup.h  |  70 +++
>  include/linux/cgroup-defs.h |   4 ++
>  init/Kconfig|  12 
>  kernel/bpf/Makefile |   1 +
>  kernel/bpf/cgroup.c | 165 
> 
>  kernel/cgroup.c |  18 +
>  6 files changed, 270 insertions(+)
>  create mode 100644 include/linux/bpf-cgroup.h
>  create mode 100644 kernel/bpf/cgroup.c
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> new file mode 100644
> index 000..a5a25c1
> --- /dev/null
> +++ b/include/linux/bpf-cgroup.h
> @@ -0,0 +1,70 @@
> +#ifndef _BPF_CGROUP_H
> +#define _BPF_CGROUP_H
> +
> +#include 
> +#include 
> +
> +struct sock;
> +struct cgroup;
> +struct sk_buff;
> +
> +#ifdef CONFIG_CGROUP_BPF
> +
> +extern struct static_key_false cgroup_bpf_enabled_key;
> +#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
> +
> +struct cgroup_bpf {
> + /*
> +  * Store two sets of bpf_prog pointers, one for programs that are
> +  * pinned directly to this cgroup, and one for those that are effective
> +  * when this cgroup is accessed.
> +  */
> + struct bpf_prog *prog[__MAX_BPF_ATTACH_TYPE];
> + struct bpf_prog *effective[__MAX_BPF_ATTACH_TYPE];
> +};
> +
> +void cgroup_bpf_put(struct cgroup *cgrp);
> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
> +
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +  struct cgroup *parent,
> +  struct bpf_prog *prog,
> +  enum bpf_attach_type type);
> +
> +/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
> +void cgroup_bpf_update(struct cgroup *cgrp,
> +struct bpf_prog *prog,
> +enum bpf_attach_type type);
> +
> +int __cgroup_bpf_run_filter(struct sock *sk,
> + struct sk_buff *skb,
> + enum bpf_attach_type type);
> +
> +/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
> +static inline int cgroup_bpf_run_filter(struct sock *sk,
> + struct sk_buff *skb,
> + enum bpf_attach_type type)
> +{
> + if (cgroup_bpf_enabled)
> + return __cgroup_bpf_run_filter(sk, skb, type);
> +
> + return 0;
> +}
> +
> +#else
> +
> +struct cgroup_bpf {};
> +static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
> +static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
> +   struct cgroup *parent) {}
> +
> +static inline int cgroup_bpf_run_filter(struct sock *sk,
> + struct sk_buff *skb,
> + enum bpf_attach_type type)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_BPF */
> +
> +#endif /* _BPF_CGROUP_H */
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 5b17de6..861b467 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -300,6 +301,9 @@ struct cgroup {
>   /* used to schedule release agent */
>   struct work_struct release_agent_work;
>  
> + /* used to store eBPF programs */
> + struct cgroup_bpf bpf;
> +
>   /* ids of the ancestors at each level including self */
>   int ancestor_ids[];
>  };
> diff --git a/init/Kconfig b/init/Kconfig
> index cac3f09..5a89c83 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,18 @@ config CGROUP_PERF
>  
> 

Re: [PATCH net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-08-29 Thread Alexei Starovoitov
On Mon, Aug 29, 2016 at 12:58:00PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 07:31:18PM -0700, Alexei Starovoitov wrote:
> > Hi Peter, Dave,
> > 
> > this patch set is a follow up to the discussion:
> > https://lkml.org/lkml/2016/8/4/304
> 
> Please don't use lkml.org links, that site is broken too often.
> 
> The canonical reference is:
> 
>   https://lkml.kernel.org/r/$MSGID
> 
> That allows the kernel.org people to ensure the links stays valid by
> redirecting to a functional archive if the current one (marc.info) were
> to drop off the intarweb.
> 
> Also, by including the msg-id, people can find the emails in their local
> archives.

Makes sense though I couldn't figure out how to find that msg-id.
This one
https://lkml.kernel.org/r/20160804142853.GO6862%20()%20twins%20!%20programming%20!%20kicks-ass%20!%20net
works, but really ugly.



Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 09:58 PM, Daniel Mack wrote:

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack 
syscall


^^^ slipped in?


---
  include/uapi/linux/bpf.h |  9 ++
  kernel/bpf/syscall.c | 83 
  2 files changed, 92 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1d5db42..4cc2dcf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
  };

  enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;/* BPF_ATTACH_TYPE_* */
+   __u64   attach_flags;


Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.


+   };
  } __attribute__((aligned(8)));

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..cc4d603 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,79 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
  }

+#ifdef CONFIG_CGROUP_BPF


#define BPF_PROG_ATTACH_LAST_FIELD attach_type
#define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD


+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   /* Flags are unused for now */
+   if (attr->attach_flags != 0)
+   return -EINVAL;


Correct way would be to:

if (CHECK_ATTR(BPF_PROG_ATTACH))
return -EINVAL;


+
+   switch (attr->attach_type) {
+   case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+   case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+   struct cgroup *cgrp;
+
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+
+   break;
+   }
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   /* Flags are unused for now */
+   if (attr->attach_flags != 0)
+   return -EINVAL;


if (CHECK_ATTR(BPF_PROG_DETACH))
return -EINVAL;


+   switch (attr->attach_type) {
+   case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+   case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+   struct cgroup *cgrp;
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+
+   break;
+   }
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
  {
union bpf_attr attr = {};
@@ -888,6 +961,16 @@ 

Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-29 Thread Jörn Engel
Ping.

Not sure if we reached some conclusion yet.  If we did, I must have
missed it.

Jörn

--
If you managed to confuse the compiler, you lost your human readers
a long time ago.
-- hummassa


Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 09:58 PM, Daniel Mack wrote:

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

   A - B - C
 \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack 

[...]

+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type)
+{
+   struct bpf_prog *old_prog, *effective;
+   struct cgroup_subsys_state *pos;
+
+   old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+   if (prog)
+   static_branch_inc(_bpf_enabled_key);
+
+   if (old_prog) {
+   bpf_prog_put(old_prog);
+   static_branch_dec(_bpf_enabled_key);
+   }
+
+   effective = (!prog && parent) ?
+   rcu_dereference_protected(parent->bpf.effective[type],
+ lockdep_is_held(_mutex)) :
+   prog;
+
+   css_for_each_descendant_pre(pos, >self) {
+   struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+   /* skip the subtree if the descendant has its own program */
+   if (desc->bpf.prog[type] && desc != cgrp)
+   pos = css_rightmost_descendant(pos);
+   else
+   rcu_assign_pointer(desc->bpf.effective[type],
+  effective);
+   }


Shouldn't the old_prog reference only be released right here at the end
instead of above (otherwise this could race)?


+}




[PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics

2016-08-29 Thread Kirill Tkhai
Many variables of statistics type are made percpu in kernel. This allows
to do not make them atomic or to do not use synchronization. The result
value is calculated as sum of values on every possible cpu.

The problem is this scales bad. The calculations may took a lot of time.
For example, some machine configurations have many possible cpus like below:

"smpboot: Allowing 192 CPUs, 160 hotplug CPUs"

There are only 32 real cpus, but 192 possible cpus.

I had a report about very slow getifaddrs() on older kernel, when there are
possible only 590 getifaddrs calls/second on Xeon(R) CPU E5-2667 v3 @ 3.20GHz.

The patchset aims to begin solving of this problem. It makes possible to
iterate over present cpus mask instead of possible. When cpu is going down,
a statistics is being moved to an alive cpu. It's made in CPU_DYING callback,
which happens when machine is stopped. So, iteration  over present cpus mask
is safe under preemption disabled.

Patchset could exclude even offline cpus, but I didn't do that, because
the main problem seems to be possible cpus. Also, this would require to
do some changes in kernel/cpu.c, so I'd like to hear people opinion about
expediency of this before.

One more question is whether the whole kernel needs the same possibility
and the patchset should be more generic.

Please, comment!

For the above configuration the patchset improves the below test in 2.9 times:

#define _GNU_SOURCE /* To get defns of NI_MAXSERV and NI_MAXHOST */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int do_gia()
{
   struct ifaddrs *ifaddr, *ifa;
   int family, s, n;
   char host[NI_MAXHOST];

   if (getifaddrs() == -1) {
   perror("getifaddrs");
   exit(EXIT_FAILURE);
   }

   /* touch the data  */
   for (ifa = ifaddr, n = 0; ifa != NULL; ifa = ifa->ifa_next, n++) {
   if (ifa->ifa_addr == NULL)
   continue;
   family = ifa->ifa_addr->sa_family;
   }
   freeifaddrs(ifaddr);
}

int main(int argc, char *argv[])
{
int i;
for(i=0; i<1; i++)
   do_gia();
}

---

Kirill Tkhai (2):
  net: Implement net_stats callbacks
  net: Iterate over present cpus only during ipstats calculation


 include/net/stats.h |9 ++
 net/core/Makefile   |1 +
 net/core/stats.c|   83 +++
 net/ipv6/addrconf.c |4 ++
 net/ipv6/af_inet6.c |   56 ++
 5 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 include/net/stats.h
 create mode 100644 net/core/stats.c

--
Signed-off-by: Kirill Tkhai 


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-29 Thread Miklos Szeredi
On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi  wrote:
>
> On Wed, Aug 24, 2016 at 11:40 PM, Hannes Frederic Sowa
>  wrote:
> > On 24.08.2016 16:24, Nikolay Borisov wrote:
> >> Hello,
> >>
> >> I hit the following BUG:
> >>
> >> [1851513.239831] [ cut here ]
> >> [1851513.240079] kernel BUG at net/unix/garbage.c:149!
> >> [1851513.240313] invalid opcode:  [#1] SMP
> >> [1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G   O
> >> 4.4.14-clouder3 #26
> >> [1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 
> >> 04/14/2015
> >> [1851513.248966] task: 883b0f6f ti: 880189cf task.ti: 
> >> 880189cf
> >> [1851513.249361] RIP: 0010:[]  [] 
> >> unix_notinflight+0x8d/0x90
> >> [1851513.249846] RSP: 0018:880189cf3cf8  EFLAGS: 00010246
> >> [1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: 
> >> 8807f9967330
> >> [1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: 
> >> 82073f10
> >> [1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: 
> >> 00018021
> >> [1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: 
> >> 883fbaf5
> >> [1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: 
> >> 881a7c7792c0
> >> [1851513.252083] FS:  7f4f19573720() GS:883fff3a() 
> >> knlGS:
> >> [1851513.252481] CS:  0010 DS:  ES:  CR0: 80050033
> >> [1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: 
> >> 001406e0
> >> [1851513.253116] Stack:
> >> [1851513.253345]   880189cf3d40 880189cf3d28 
> >> 815f4383
> >> [1851513.254022]  8839ee11a800 8839ee11a800 880189cf3d60 
> >> 815f53b8
> >> [1851513.254685]   883406788de0  
> >> 
> >> [1851513.255360] Call Trace:
> >> [1851513.255594]  [] unix_detach_fds.isra.19+0x43/0x50
> >> [1851513.255851]  [] unix_destruct_scm+0x48/0x80
> >> [1851513.256090]  [] skb_release_head_state+0x4f/0xb0
> >> [1851513.256328]  [] skb_release_all+0x12/0x30
> >> [1851513.256564]  [] kfree_skb+0x32/0xa0
> >> [1851513.256810]  [] unix_release_sock+0x1e4/0x2c0
> >> [1851513.257046]  [] unix_release+0x20/0x30
> >> [1851513.257284]  [] sock_release+0x1f/0x80
> >> [1851513.257521]  [] sock_close+0x12/0x20
> >> [1851513.257769]  [] __fput+0xea/0x1f0
> >> [1851513.258005]  [] fput+0xe/0x10
> >> [1851513.258244]  [] task_work_run+0x7f/0xb0
> >> [1851513.258488]  [] exit_to_usermode_loop+0xc0/0xd0
> >> [1851513.258728]  [] syscall_return_slowpath+0x80/0xf0
> >> [1851513.258983]  [] int_ret_from_sys_call+0x25/0x9f
> >> [1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 
> >> 02 00 00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb 
> >> b8 <0f> 0b 90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54
> >> [1851513.268473] RIP  [] unix_notinflight+0x8d/0x90
> >> [1851513.268793]  RSP 
> >>
> >> That's essentially BUG_ON(list_empty(>link));
> >>
> >> I see that all the code involving the ->link member hasn't really been
> >> touched since it was introduced in 2007. So this must be a latent bug.
> >> This is the first time I've observed it. The state
> >> of the struct unix_sock can be found here http://sprunge.us/WCMW . 
> >> Evidently,
> >> there are no inflight sockets.
>
> Weird.  If it was the BUG_ON(!list_empty(>link)) I'd understand,
> because the code in scan children looks fishy: what prevents "embryos"
> from fledging to full socket status and going in-flight?
>
> But this one, I cannot imagine any scenario.
>
> Can we have access to the crashdump?

crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
130
crash> p unix_tot_inflight
unix_tot_inflight = $2 = 135

We've lost track of a total of five inflight sockets, so it's not a
one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
the solution.

Btw. I notice that this is a patched kernel.  Nothing in there that
could be relevant to this bug?

Thanks,
Miklos


[PATCH RFC 2/2] net: Iterate over present cpus only during ipstats calculation

2016-08-29 Thread Kirill Tkhai
Use net_stats callback to iterate only present cpus mask.
This gives a signify performance growth on configurations
with large number of possible cpus.

Signed-off-by: Kirill Tkhai 
---
 net/ipv6/addrconf.c |4 +++-
 net/ipv6/af_inet6.c |   56 +++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index df8425f..2ab088d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4970,10 +4970,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
memset(buff, 0, sizeof(buff));
buff[0] = IPSTATS_MIB_MAX;
 
-   for_each_possible_cpu(c) {
+   preempt_disable();
+   for_each_present_cpu(c) {
for (i = 1; i < IPSTATS_MIB_MAX; i++)
buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
}
+   preempt_enable();
 
memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64));
memset([IPSTATS_MIB_MAX], 0, pad);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 46ad699..1c7a431 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -61,6 +61,7 @@
 #include 
 #endif
 #include 
+#include 
 
 #include 
 #include 
@@ -785,6 +786,55 @@ static void ipv6_cleanup_mibs(struct net *net)
kfree(net->mib.icmpv6msg_statistics);
 }
 
+static void in6_move_stats(int src_cpu, void __percpu *mib, int max)
+{
+   int i, target_cpu;
+   u64 *ptr, v;
+
+   target_cpu = cpumask_first(cpu_online_mask);
+
+   BUG_ON(target_cpu > nr_cpu_ids || src_cpu != smp_processor_id());
+
+   for (i = 1; i < max; i++) {
+   ptr = ((u64 *)per_cpu_ptr(mib, src_cpu)) + i;
+   v = *ptr;
+   *ptr = 0;
+
+   ptr = ((u64 *)per_cpu_ptr(mib, target_cpu)) + i;
+   *ptr += v;
+   }
+}
+
+static int ipv6_stats_cb(int cpu)
+{
+   struct net_device *dev;
+   struct inet6_dev *idev;
+   struct net *net;
+
+   WARN_ON(!irqs_disabled());
+
+   rcu_read_lock();
+   write_lock(_base_lock);
+
+   for_each_net(net) {
+   if (!maybe_get_net(net))
+   continue;
+   for_each_netdev(net, dev) {
+   idev = in6_dev_get(dev);
+   if (!idev)
+   continue;
+   in6_move_stats(cpu, idev->stats.ipv6, IPSTATS_MIB_MAX);
+   in6_dev_put(idev);
+   }
+   put_net(net);
+   }
+
+   write_unlock(_base_lock);
+   rcu_read_unlock();
+
+   return 0;
+}
+
 static int __net_init inet6_net_init(struct net *net)
 {
int err = 0;
@@ -986,6 +1036,10 @@ static int __init inet6_init(void)
if (err)
goto pingv6_fail;
 
+   err = register_net_stats_cb(ipv6_stats_cb);
+   if (err)
+   goto net_stats_fail;
+
err = calipso_init();
if (err)
goto calipso_fail;
@@ -1003,6 +1057,8 @@ static int __init inet6_init(void)
calipso_exit();
 #endif
 calipso_fail:
+   unregister_net_stats_cb(ipv6_stats_cb);
+net_stats_fail:
pingv6_exit();
 pingv6_fail:
ipv6_packet_cleanup();



Re: [net-next RFC v2 8/9] samples/bpf: Add limit_connections, remap_bind checmate examples / tests

2016-08-29 Thread Alexei Starovoitov
On Mon, Aug 29, 2016 at 04:47:46AM -0700, Sargun Dhillon wrote:
> 1) limit_connections
> This program performs connection limiting using a probablistic
> datastructure. It ensures that for a given 2-tuple, there will never be
> more than 10 connections. The parameters themselves are adjustable
> to allow for trading off memory usage vs. collision likelihood. The
> reason for not refcnting 2-tuples using atomic counters is the lack of
> a safe free mechanism.
> 
> In order to run this program, you may need to bump your ulimit -l.
> 
> 2) remap_bind
> This program rewrites binds from 6789 to 12345. It is meant to mimic
> the usage of DNAT.

these two are great examples of what lsm+bpf can be capable of.
Thanks!

> Signed-off-by: Sargun Dhillon 
> ---
>  samples/bpf/Makefile  |  10 ++
>  samples/bpf/bpf_helpers.h |   2 +
>  samples/bpf/bpf_load.c|  11 +-
>  samples/bpf/checmate_limit_connections_kern.c | 146 
> ++
>  samples/bpf/checmate_limit_connections_user.c | 113 
>  samples/bpf/checmate_remap_bind_kern.c|  28 +
>  samples/bpf/checmate_remap_bind_user.c|  82 +++
>  7 files changed, 389 insertions(+), 3 deletions(-)
>  create mode 100644 samples/bpf/checmate_limit_connections_kern.c
>  create mode 100644 samples/bpf/checmate_limit_connections_user.c
>  create mode 100644 samples/bpf/checmate_remap_bind_kern.c
>  create mode 100644 samples/bpf/checmate_remap_bind_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 5d2c178..ee5de8c 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -25,6 +25,8 @@ hostprogs-y += test_cgrp2_array_pin
>  hostprogs-y += xdp1
>  hostprogs-y += xdp2
>  hostprogs-y += test_current_task_under_cgroup
> +hostprogs-y += checmate_remap_bind
> +hostprogs-y += checmate_limit_connections
>  
>  test_verifier-objs := test_verifier.o libbpf.o
>  test_maps-objs := test_maps.o libbpf.o
> @@ -52,6 +54,10 @@ xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
>  xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
>  test_current_task_under_cgroup-objs := bpf_load.o libbpf.o cgroup_helpers.o \
>  test_current_task_under_cgroup_user.o
> +checmate_remap_bind-objs := bpf_load.o libbpf.o cgroup_helpers.o \
> + checmate_remap_bind_user.o
> +checmate_limit_connections-objs := bpf_load.o libbpf.o cgroup_helpers.o \
> +checmate_limit_connections_user.o
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -79,6 +85,8 @@ always += test_cgrp2_tc_kern.o
>  always += xdp1_kern.o
>  always += xdp2_kern.o
>  always += test_current_task_under_cgroup_kern.o
> +always += checmate_remap_bind_kern.o
> +always += checmate_limit_connections_kern.o
>  
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  
> @@ -103,6 +111,8 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
>  HOSTLOADLIBES_xdp1 += -lelf
>  HOSTLOADLIBES_xdp2 += -lelf
>  HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
> +HOSTLOADLIBES_checmate_remap_bind += -lelf
> +HOSTLOADLIBES_checmate_limit_connections += -lelf
>  
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
> cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
> index bbdf62a..da97ced 100644
> --- a/samples/bpf/bpf_helpers.h
> +++ b/samples/bpf/bpf_helpers.h
> @@ -55,6 +55,8 @@ static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, 
> int size) =
>   (void *) BPF_FUNC_skb_get_tunnel_opt;
>  static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
>   (void *) BPF_FUNC_skb_set_tunnel_opt;
> +static int (*bpf_probe_write_checmate)(void *ctx, void *dst, void *src, int 
> len) =
> + (void *) BPF_FUNC_probe_write_checmate;
>  
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
> index 0cfda23..e12460a 100644
> --- a/samples/bpf/bpf_load.c
> +++ b/samples/bpf/bpf_load.c
> @@ -51,6 +51,7 @@ static int load_and_attach(const char *event, struct 
> bpf_insn *prog, int size)
>   bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
>   bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
>   bool is_xdp = strncmp(event, "xdp", 3) == 0;
> + bool is_checmate = strncmp(event, "checmate", 8) == 0;
>   enum bpf_prog_type prog_type;
>   char buf[256];
>   int fd, efd, err, id;
> @@ -69,6 +70,8 @@ static int load_and_attach(const char *event, struct 
> bpf_insn *prog, int size)
>   prog_type = BPF_PROG_TYPE_TRACEPOINT;
>   } else if (is_xdp) {
>   prog_type = BPF_PROG_TYPE_XDP;
> + } else if (is_checmate) {
> + prog_type = 

Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs

2016-08-29 Thread Sargun Dhillon
On Tue, Aug 30, 2016 at 12:03:23AM +0200, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
> >If the cgroup associated with the receiving socket has an eBPF
> >programs installed, run them from __dev_queue_xmit().
> >
> >eBPF programs used in this context are expected to either return 1 to
> >let the packet pass, or != 1 to drop them. The programs have access to
> >the full skb, including the MAC headers.
> >
> >Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> >for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> >the feature is unused.
> >
> >Signed-off-by: Daniel Mack 
> >---
> >  net/core/dev.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index a75df86..17484e6 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -141,6 +141,7 @@
> >  #include 
> >  #include 
> >  #include 
> >+#include 
> >
> >  #include "net-sysfs.h"
> >
> >@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
> >*accel_priv)
> > if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> > __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
> >
> >+rc = cgroup_bpf_run_filter(skb->sk, skb,
> >+   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
> >+if (rc)
> >+return rc;
> 
> This would leak the whole skb by the way.
> 
> Apart from that, could this be modeled w/o affecting the forwarding path (at 
> some
> local output point where we know to have a valid socket)? Then you could also 
> drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of 
> what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to 
> be
> just zeroes since not available at that point?
> 
> > /* Disable soft irqs for various locks below. Also
> >  * stops preemption for RCU.
> >  */
> >
Given this patchset only effects AF_INET, and AF_INET6, why not put the hooks 
at 
ip_output, and ip6_output


Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Mickaël Salaün

On 29/08/2016 23:49, Alexei Starovoitov wrote:
> On 8/29/16 12:24 PM, Tejun Heo wrote:
>> Hello, Sargun.
>>
>> On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:
>>> It would be a separate hook per LSM hook. Why wouldn't we want a
>>> separate bpf
>>> hook per lsm hook? I think if one program has to handle them all, the
>>> first
>>> program would be looking up the hook program in a bpf prog array. If
>>> you think
>>> it's better to have this logic in the BPF program, that makes sense.
>>>
>>> I had a version of this patch that allowed you to attach a prog array
>>> instead,
>>> but I think that it's cleaner attaching a program per lsm hook. In
>>> addition,
>>> there's a performance impact that comes from these hooks, so I
>>> wouldn't want to
>>> execute unneccessary code if it's avoidable.
>>
>> Hmm... it doesn't really matter how the backend part looks like and if
>> we need to implement per-call hooks to lower runtime overhead, sure.
>> I was mostly worried about the approach propagating through the
>> userland visible interface.
>>
>>> The prog array approach also makes stacking filters difficult. If
>>> people want
>>> multiple filters per hook, the orchestrator would have to rewrite the
>>> existing
>>> filters to be cooperative.
>>
>> I'm not really sure "stacking" in the kernel side is a good idea.
>> Please see below.
>>
 I'm not convinced about the approach.  It's an approach which pretty
 much requires future extensions while being rigid.  Not a good
 combination.
>>>
>>> Do you have an alternative recommendation? Maybe just a set of 5 u64s
>>> as the context object along with the hook ID?
>>
>> cgroup fs doesn't seem like the right interface for this but if it
>> were I'd go for named hook IDs instead of opaque numbers.
>>
 Unless this is properly delegatable, IOW, it's safe to fully delegate
 to a lesser security domain for all operations including program
 loading and assignment (I can't see how that'd be the case), making it
 an explicit controller doens't work in terms of userland interface.
 It's fine for bpf / lsm / whatever to attach to cgroups by extending
 struct cgroup itself or implementing an implicit controller but to be
 visible as an explicit controller it must be able to follow cgroup
 interface rules including delegation.  If not, it's best to come
 through the interface which enforces the required permission checks
 and then talk to cgroup from there.  This was also an issue with
 network cgroup bpf programs that Daniel Mack is working on.  Please
 chat with him.
>>>
>>> Program assignment is possible by lesser security domains. Program
>>> loading is
>>> limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to
>>> CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that
>>> Checmate
>>> BPF programs can leak kernel pointers.
>>
>> That doesn't make much sense to me.  Delegation doesn't mean much if a
>> delegatee can't load its own program (and I don't see how one can
>> delegate kernel pointer access to !root).  Also, unless there's
>> per-program fine control on who can load it, it seems pretty dangerous
>> to let anyone load any program.
>>
>>> Could we potentially restrict it to only CAP_MAC_OVERRIDE, while
>>> still meeting
>>> cgroup delegation requirements?
>>
>> Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
>> that "load this program" and "attach this program to the cgroup
>> identified by this fd" through the same interface and permission
>> checks?  cgroup participating in bpf operations is all fine but
>> splitting the userland interface across two domains seems like a bad
>> idea.
>>
>>> Filters which are higher up in the heirarchy will still be enforced
>>> during
>>> delegation. This was an explicit design, as the "Orchestrator in
>>> Orchestrator"
>>> use case needs to be supported.
>>
>> Given that program loading is restricted to root, wouldn't it be an a
>> lot more efficient approach to let userland multiplex multiple
>> programs?  Walking the tree executing bpf programs each time one of
>> these operations runs can be pretty expensive.  Imagine a tree like
>> the following.
>>
>> A - B - C
>>   \ D
>>
>> Let's say program is currently loaded on D.  If someone wants to add a
>> program on B, userland can load the program on B, combine B's and D's
>> program and compile them into a single program and load it on D.  The
>> only thing kernel would need to do in terms of hierarchy is finding
>> what's the closest program to execute.  In the above example, C would
>> need to use B's program and that can be determined on program
>> assignment time rather than on each operation.
> 
> I think that's exactly what Daniel's patches are doing and imo
> it makes sense to keep this style for lsm as well
> and also apply the concept of hook_id.
> Daniel adds two commands to bpf syscall to attach/detach from cgroup
> with hook_id.
> 

Re: [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 09:58 PM, Daniel Mack wrote:

For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack 
---
  include/uapi/linux/bpf.h | 7 +++
  kernel/bpf/verifier.c| 1 +
  net/core/filter.c| 6 ++
  3 files changed, 14 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1b..1d5db42 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,13 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};


Nit: can we drop the _FILTER suffix? So just leaving it
at BPF_PROG_TYPE_CGROUP_SOCKET. Some of these use cases
might not always strictly be related to filtering, so
seems cleaner to just leave it out everywhere.


+
+enum bpf_attach_type {
+   BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
+   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
  };


#define BPF_MAX_ATTACH_TYPE __BPF_MAX_ATTACH_TYPE

And then use that in your follow-up patches for declaring
arrays, etc?



  #define BPF_PSEUDO_MAP_FD 1
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3..12ca880 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type)
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER:
return true;
default:
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..bc04e5c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
  };

+static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = {
+   .ops= _filter_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
  static int __init register_sk_filter_ops(void)
  {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_sk_filter_type);

return 0;
  }





Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 09:58 PM, Daniel Mack wrote:

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
---
  net/core/dev.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index a75df86..17484e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
  #include 
  #include 
  #include 
+#include 

  #include "net-sysfs.h"

@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);

+   rc = cgroup_bpf_run_filter(skb->sk, skb,
+  BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
+   if (rc)
+   return rc;


This would leak the whole skb by the way.

Apart from that, could this be modeled w/o affecting the forwarding path (at 
some
local output point where we know to have a valid socket)? Then you could also 
drop
the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
just zeroes since not available at that point?


/* Disable soft irqs for various locks below. Also
 * stops preemption for RCU.
 */



Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Alexei Starovoitov

On 8/29/16 12:24 PM, Tejun Heo wrote:

Hello, Sargun.

On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:

It would be a separate hook per LSM hook. Why wouldn't we want a separate bpf
hook per lsm hook? I think if one program has to handle them all, the first
program would be looking up the hook program in a bpf prog array. If you think
it's better to have this logic in the BPF program, that makes sense.

I had a version of this patch that allowed you to attach a prog array instead,
but I think that it's cleaner attaching a program per lsm hook. In addition,
there's a performance impact that comes from these hooks, so I wouldn't want to
execute unneccessary code if it's avoidable.


Hmm... it doesn't really matter how the backend part looks like and if
we need to implement per-call hooks to lower runtime overhead, sure.
I was mostly worried about the approach propagating through the
userland visible interface.


The prog array approach also makes stacking filters difficult. If people want
multiple filters per hook, the orchestrator would have to rewrite the existing
filters to be cooperative.


I'm not really sure "stacking" in the kernel side is a good idea.
Please see below.


I'm not convinced about the approach.  It's an approach which pretty
much requires future extensions while being rigid.  Not a good
combination.


Do you have an alternative recommendation? Maybe just a set of 5 u64s
as the context object along with the hook ID?


cgroup fs doesn't seem like the right interface for this but if it
were I'd go for named hook IDs instead of opaque numbers.


Unless this is properly delegatable, IOW, it's safe to fully delegate
to a lesser security domain for all operations including program
loading and assignment (I can't see how that'd be the case), making it
an explicit controller doens't work in terms of userland interface.
It's fine for bpf / lsm / whatever to attach to cgroups by extending
struct cgroup itself or implementing an implicit controller but to be
visible as an explicit controller it must be able to follow cgroup
interface rules including delegation.  If not, it's best to come
through the interface which enforces the required permission checks
and then talk to cgroup from there.  This was also an issue with
network cgroup bpf programs that Daniel Mack is working on.  Please
chat with him.


Program assignment is possible by lesser security domains. Program loading is
limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to
CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that Checmate
BPF programs can leak kernel pointers.


That doesn't make much sense to me.  Delegation doesn't mean much if a
delegatee can't load its own program (and I don't see how one can
delegate kernel pointer access to !root).  Also, unless there's
per-program fine control on who can load it, it seems pretty dangerous
to let anyone load any program.


Could we potentially restrict it to only CAP_MAC_OVERRIDE, while still meeting
cgroup delegation requirements?


Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
that "load this program" and "attach this program to the cgroup
identified by this fd" through the same interface and permission
checks?  cgroup participating in bpf operations is all fine but
splitting the userland interface across two domains seems like a bad
idea.


Filters which are higher up in the heirarchy will still be enforced during
delegation. This was an explicit design, as the "Orchestrator in Orchestrator"
use case needs to be supported.


Given that program loading is restricted to root, wouldn't it be an a
lot more efficient approach to let userland multiplex multiple
programs?  Walking the tree executing bpf programs each time one of
these operations runs can be pretty expensive.  Imagine a tree like
the following.

A - B - C
  \ D

Let's say program is currently loaded on D.  If someone wants to add a
program on B, userland can load the program on B, combine B's and D's
program and compile them into a single program and load it on D.  The
only thing kernel would need to do in terms of hierarchy is finding
what's the closest program to execute.  In the above example, C would
need to use B's program and that can be determined on program
assignment time rather than on each operation.


I think that's exactly what Daniel's patches are doing and imo
it makes sense to keep this style for lsm as well
and also apply the concept of hook_id.
Daniel adds two commands to bpf syscall to attach/detach from cgroup
with hook_id.
Initially two hooks will be for socket rx and tx.
Then all interesting lsm hooks can be added one by one.
Daniel's prog type will be bpf_prog_type_cgroup_socket_filter.
LSM's prog type will be bpf_prog_type_lsm.
And verifier can check type safety since the lsm hook_id will be
passed at the program load time.
See another thread we had with Mickael.

landlock and checmate are very similar and 

[PATCH RFC 1/2] net: Implement net_stats callbacks

2016-08-29 Thread Kirill Tkhai
net_stats callbacks are functions, which are called
during a cpu is going down. They operate on percpu
statistics and should move it from dying cpu to an
alive one.

The callbacks are called on CPU_DYING stage, when
machine is stopped, and they are executed on dying
cpu.

This allows to minimize overhead of summation of
percpu statistics on all possible cpus, and iterate
only present cpus instead. It may give a signify
growth of performance on configurations with big
number of possible cpus.

Signed-off-by: Kirill Tkhai 
---
 include/net/stats.h |9 ++
 net/core/Makefile   |1 +
 net/core/stats.c|   83 +++
 3 files changed, 93 insertions(+)
 create mode 100644 include/net/stats.h
 create mode 100644 net/core/stats.c

diff --git a/include/net/stats.h b/include/net/stats.h
new file mode 100644
index 000..7aebc56
--- /dev/null
+++ b/include/net/stats.h
@@ -0,0 +1,9 @@
+#ifndef _NET_STATS_H_
+#define _NET_STATS_H_
+
+typedef int (*net_stats_cb_t)(int cpu);
+
+extern int register_net_stats_cb(net_stats_cb_t func);
+extern int unregister_net_stats_cb(net_stats_cb_t func);
+
+#endif /* _NET_STATS_H_ */
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..c1093c3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
+obj-$(CONFIG_HOTPLUG_CPU) += stats.o
diff --git a/net/core/stats.c b/net/core/stats.c
new file mode 100644
index 000..0307e2c
--- /dev/null
+++ b/net/core/stats.c
@@ -0,0 +1,83 @@
+#include 
+#include 
+#include 
+#include 
+
+struct net_stats_cb_entry {
+   struct list_head list;
+   net_stats_cb_t func;
+};
+
+static DEFINE_SPINLOCK(net_stats_lock);
+static LIST_HEAD(net_stats_cb_list);
+
+int register_net_stats_cb(net_stats_cb_t func)
+{
+   struct net_stats_cb_entry *entry = vmalloc(sizeof(*entry));
+
+   if (!entry)
+   return -ENOMEM;
+   entry->func = func;
+   spin_lock(_stats_lock);
+   list_add_tail(>list, _stats_cb_list);
+   spin_unlock(_stats_lock);
+   return 0;
+}
+EXPORT_SYMBOL(register_net_stats_cb);
+
+int unregister_net_stats_cb(net_stats_cb_t func)
+{
+   struct net_stats_cb_entry *entry;
+   bool found = false;
+
+   spin_lock(_stats_lock);
+   list_for_each_entry(entry, _stats_cb_list, list) {
+   if (entry->func == func) {
+   found = true;
+   break;
+   }
+   }
+   spin_unlock(_stats_lock);
+
+   if (!found)
+   return -ENOENT;
+
+   list_del(>list);
+   vfree(entry);
+   return 0;
+}
+EXPORT_SYMBOL(unregister_net_stats_cb);
+
+static int net_stats_cpu_notify(struct notifier_block *nb,
+   unsigned long action, void *hcpu)
+{
+   struct net_stats_cb_entry *entry;
+   long cpu = (long)hcpu;
+   int ret;
+
+   if ((action & ~CPU_TASKS_FROZEN) == CPU_DYING) {
+   /* We call callbacks in dying stage, when machine is stopped */
+   spin_lock(_stats_lock);
+   list_for_each_entry(entry, _stats_cb_list, list) {
+   ret = entry->func(cpu);
+   if (ret)
+   break;
+   }
+   spin_unlock(_stats_lock);
+
+   if (ret)
+   return NOTIFY_BAD;
+   }
+   return NOTIFY_OK;
+}
+
+static struct notifier_block net_stats_nfb = {
+   .notifier_call = net_stats_cpu_notify,
+};
+
+static int __init net_stats_init(void)
+{
+   return register_cpu_notifier(_stats_nfb);
+}
+
+subsys_initcall(net_stats_init);



Re: [PATCH] net: ethernet: renesas: sh_eth: do not access POST registers if not exist

2016-08-29 Thread Sergei Shtylyov

Hello.

On 08/29/2016 05:41 PM, Chris Brandt wrote:


The RZ/A1 has a TSU, but since it only has one Ethernet port, it does
not have POST registers.


   I'm not sure the reason is having one port... do you have the old SH manuals 
somewhere? :-)


Yes, I used to support the SH7757.


   Good to have someone with the old SH manuals. :-)


It had dual ETHER and dual GETHER (but only the GETHER had a TSU).
Since the GETHER had 2 ports, and the same TSU is used for both, there is an 
option where as you put in CAM entries for multicast frame and such, you can 
select if you would like CAM entry packets received on one port to be 
automatically relayed over to the other port for processing (ie, bridge 
network).
The POST1-POST4 registers are what you use to select what CAM entries you want 
relayed.


   SH7757 is not the only platform with TSU, there's e.g. R8A7740 ARM SoC 
which only has 1 GETHER channel...



For example:

   Register: CAM Entry Table POST1 Register (TSU_POST1)
   Bits: 31 to 28
   Bit name: POST0[3:0]
Description: These bits set the conditions for referring to CAM entry table 0. 
By
 setting multiple bits to 1, multiple conditions can be selected.

 POST0[3]: CAM entry table 0 is referred to in port 0 reception
 POST0[2]: CAM entry table 0 is referred to in port 0 to 1 relay.
 POST0[1]: CAM entry table 0 is referred to in port 1 reception.
 POST0[0]: CAM entry table 0 is referred to in port 1 to 0 relay


So, as you can see, having the POST registers means nothing if you only have 1 
Ethernet port
attached to the TSU.


   That probably means we have more issues on other SoCs having TSU...


Note, if you look at function sh_eth_tsu_get_post_bit, the relay functionality 
is never used
anyway because each entry will only ever be set to "port 0 reception" or "port 1 
reception".


   Yes, seeing this...


.shift_rd0  = 1,
 };

@@ -2460,6 +2461,9 @@ static void sh_eth_tsu_enable_cam_entry_post(struct 
net_device *ndev,
u32 tmp;
void *reg_offset;

+   if (mdp->cd->tsu_no_post)
+   return;
+
reg_offset = sh_eth_tsu_get_post_reg_offset(mdp, entry);


   I'd check check for SH_ETH_OFFSET_INVALID in the above function and return 
NULL if so; then
we can check for NULL here...


   I hadn't thought it thru it seems... I meant that the check includes 
WARN_ON() but we generally should avoid hitting this code, so another check is 
needed beforehand, like that one you added...



That was similar to my first fix. But, then I got to thinking that if a new RZ 
comes out
with dual Ethernet, the designers might add in the POST registers back in the 
TSU. And in
that case, it would be nice to reuse the sh_eth_is_rz_fast_ether array (just 
add in the
extra registers).


   It seems having something like 'dual_ch' flag would fit...


But...maybe checking for SH_ETH_OFFSET_INVALID is good for now instead of 
worrying about that.


   We just need 2 patches, I think...


Oh, and I'll have to correct your language and terminology. :-/ Should be
"if they don't exist" in the subject.


Ya, my first subject line was too long,


   For what?


so I kept trying to shorten itand then it turn into bad grammer.

I think a more clear subject is "Fix TSU without relay feature accesses"


   Or a "single channel TSU".


Chris


MBR, Sergei



Re: [RFCv2 16/16] nfp: bpf: add offload of TC direct action mode

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 08:06 PM, Jakub Kicinski wrote:

Add offload of TC in direct action mode.  We just need
to provide appropriate checks in the verifier and
a new outro block to translate the exit codes to what
data path expects

Signed-off-by: Jakub Kicinski 

[...]

+static void nfp_outro_tc_da(struct nfp_prog *nfp_prog)
+{
+   /* TC direct-action mode:


Would have made this the only supported mode, but I understand you
want to have the legacy drop/redir actions, fair enough.


+*   0,1   okNOT SUPPORTED[1]
+*   2   drop  0x22 -> drop,  count as stat1
+*   4,5 nuke  0x02 -> drop
+*   7  redir  0x44 -> redir, count as stat2
+*   * unspec  0x11 -> pass,  count as stat0
+*
+* [1] We can't support OK and RECLASSIFY because we can't tell TC
+* the exact decision made.  We are forced to support UNSPEC
+* to handle aborts so that's the only one we handle for passing
+* packets up the stack.


In da mode, RECLASSIFY is not supported, so this one could be scratched.
For the OK and UNSPEC part, couldn't both be treated the same (as in: OK /
pass to stack roughly equivalent as in sch_handle_ingress())? Or is the
issue that you cannot populate skb->tc_index when passing to stack (maybe
just fine to leave it at 0 for now)?
Just curious, does TC_ACT_REDIRECT work in this scenario?


+*/
+   /* Target for aborts */
+   nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog);
+
+   emit_br_def(nfp_prog, nfp_prog->tgt_done, 2);
+
+   emit_alu(nfp_prog, reg_a(0),
+reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS);
+   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_imm(0x11), SHF_SC_L_SHF, 16);
+
+   /* Target for normal exits */
+   nfp_prog->tgt_out = nfp_prog_current_offset(nfp_prog);
+
+   /* if R0 > 7 jump to abort */
+   emit_alu(nfp_prog, reg_none(), reg_imm(7), ALU_OP_SUB, reg_b(0));
+   emit_br(nfp_prog, BR_BLO, nfp_prog->tgt_abort, 0);
+   emit_alu(nfp_prog, reg_a(0),
+reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS);
+
+   wrp_immed(nfp_prog, reg_b(2), 0x41221211);
+   wrp_immed(nfp_prog, reg_b(3), 0x41001211);
+
+   emit_shf(nfp_prog, reg_a(1),
+reg_none(), SHF_OP_NONE, reg_b(0), SHF_SC_L_SHF, 2);
+
+   emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0));
+   emit_shf(nfp_prog, reg_a(2),
+reg_imm(0xf), SHF_OP_AND, reg_b(2), SHF_SC_R_SHF, 0);
+
+   emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0));
+   emit_shf(nfp_prog, reg_b(2),
+reg_imm(0xf), SHF_OP_AND, reg_b(3), SHF_SC_R_SHF, 0);
+
+   emit_br_def(nfp_prog, nfp_prog->tgt_done, 2);
+
+   emit_shf(nfp_prog, reg_b(2),
+reg_a(2), SHF_OP_OR, reg_b(2), SHF_SC_L_SHF, 4);
+   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_b(2), SHF_SC_L_SHF, 16);
+}


[PATCH net-next] MAINTAINERS: update to working email address

2016-08-29 Thread Andy Gospodarek
Signed-off-by: Andy Gospodarek 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 99f9527..ccf186d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2484,7 +2484,7 @@ F:include/net/bluetooth/
 BONDING DRIVER
 M: Jay Vosburgh 
 M: Veaceslav Falico 
-M: Andy Gospodarek 
+M: Andy Gospodarek 
 L: netdev@vger.kernel.org
 W: http://sourceforge.net/projects/bonding/
 S: Supported
-- 
2.1.4



Re: [RFCv2 11/16] net: cls_bpf: allow offloaded filters to update stats

2016-08-29 Thread Daniel Borkmann

On 08/26/2016 08:06 PM, Jakub Kicinski wrote:

Call into offloaded filters to update stats.

Signed-off-by: Jakub Kicinski 


Acked-by: Daniel Borkmann 


[PATCH net-next 0/3] net: dsa: add MDB support

2016-08-29 Thread Vivien Didelot
This patchset adds the switchdev MDB object support to the DSA layer.

The MDB support for the mv88e6xxx driver is very similar to the FDB
support. The FDB operations care about unicast addresses while the MDB
operations care about multicast addresses.

Both operation set load/purge/dump the Address Translation Table (ATU),
thus common code is used.

Vivien Didelot (3):
  net: dsa: add MDB support
  net: dsa: mv88e6xxx: make switchdev DB ops generic
  net: dsa: mv88e6xxx: add MDB support

 Documentation/networking/dsa/dsa.txt |  23 +
 drivers/net/dsa/mv88e6xxx/chip.c | 161 ++-
 include/net/dsa.h|  16 
 net/dsa/slave.c  |  55 
 4 files changed, 213 insertions(+), 42 deletions(-)

-- 
2.9.3



[PATCH net-next 3/3] net: dsa: mv88e6xxx: add MDB support

2016-08-29 Thread Vivien Didelot
Add support for the MDB operations. This consists of
loading/purging/dumping multicast addresses for a given port in the ATU.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 65 
 1 file changed, 65 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 93abfff..812cb47 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2240,6 +2240,15 @@ static int mv88e6xxx_port_db_dump_one(struct 
mv88e6xxx_chip *chip,
fdb->ndm_state = NUD_NOARP;
else
fdb->ndm_state = NUD_REACHABLE;
+   } else {
+   struct switchdev_obj_port_mdb *mdb;
+
+   if (!is_multicast_ether_addr(addr.mac))
+   continue;
+
+   mdb = SWITCHDEV_OBJ_PORT_MDB(obj);
+   mdb->vid = vid;
+   ether_addr_copy(mdb->addr, addr.mac);
}
 
err = cb(obj);
@@ -3988,6 +3997,58 @@ free:
return NULL;
 }
 
+static int mv88e6xxx_port_mdb_prepare(struct dsa_switch *ds, int port,
+ const struct switchdev_obj_port_mdb *mdb,
+ struct switchdev_trans *trans)
+{
+   /* We don't need any dynamic resource from the kernel (yet),
+* so skip the prepare phase.
+*/
+
+   return 0;
+}
+
+static void mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
+  const struct switchdev_obj_port_mdb *mdb,
+  struct switchdev_trans *trans)
+{
+   struct mv88e6xxx_chip *chip = ds_to_priv(ds);
+
+   mutex_lock(>reg_lock);
+   if (mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
+GLOBAL_ATU_DATA_STATE_MC_STATIC))
+   netdev_err(ds->ports[port].netdev, "failed to load multicast 
MAC address\n");
+   mutex_unlock(>reg_lock);
+}
+
+static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
+ const struct switchdev_obj_port_mdb *mdb)
+{
+   struct mv88e6xxx_chip *chip = ds_to_priv(ds);
+   int err;
+
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
+  GLOBAL_ATU_DATA_STATE_UNUSED);
+   mutex_unlock(>reg_lock);
+
+   return err;
+}
+
+static int mv88e6xxx_port_mdb_dump(struct dsa_switch *ds, int port,
+  struct switchdev_obj_port_mdb *mdb,
+  int (*cb)(struct switchdev_obj *obj))
+{
+   struct mv88e6xxx_chip *chip = ds_to_priv(ds);
+   int err;
+
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_port_db_dump(chip, port, >obj, cb);
+   mutex_unlock(>reg_lock);
+
+   return err;
+}
+
 static struct dsa_switch_ops mv88e6xxx_switch_ops = {
.probe  = mv88e6xxx_drv_probe,
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
@@ -4023,6 +4084,10 @@ static struct dsa_switch_ops mv88e6xxx_switch_ops = {
.port_fdb_add   = mv88e6xxx_port_fdb_add,
.port_fdb_del   = mv88e6xxx_port_fdb_del,
.port_fdb_dump  = mv88e6xxx_port_fdb_dump,
+   .port_mdb_prepare   = mv88e6xxx_port_mdb_prepare,
+   .port_mdb_add   = mv88e6xxx_port_mdb_add,
+   .port_mdb_del   = mv88e6xxx_port_mdb_del,
+   .port_mdb_dump  = mv88e6xxx_port_mdb_dump,
 };
 
 static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip,
-- 
2.9.3



[PATCH net-next 2/3] net: dsa: mv88e6xxx: make switchdev DB ops generic

2016-08-29 Thread Vivien Didelot
The MDB support for the mv88e6xxx driver will be very similar to the FDB
support, since it consists of loading/purging/dumping address to/from
the Address Translation Unit (ATU).

Prepare the support for MDB by making the FDB code accessing the ATU
generic. The FDB operations now provide access to the unicast addresses
while the MDB operations will provide access to the multicast addresses.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 98 ++--
 1 file changed, 55 insertions(+), 43 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 750d01d..93abfff 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2093,9 +2093,9 @@ static int _mv88e6xxx_atu_load(struct mv88e6xxx_chip 
*chip,
return _mv88e6xxx_atu_cmd(chip, entry->fid, GLOBAL_ATU_OP_LOAD_DB);
 }
 
-static int _mv88e6xxx_port_fdb_load(struct mv88e6xxx_chip *chip, int port,
-   const unsigned char *addr, u16 vid,
-   u8 state)
+static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
+   const unsigned char *addr, u16 vid,
+   u8 state)
 {
struct mv88e6xxx_atu_entry entry = { 0 };
struct mv88e6xxx_vtu_stu_entry vlan;
@@ -2134,15 +2134,12 @@ static void mv88e6xxx_port_fdb_add(struct dsa_switch 
*ds, int port,
   const struct switchdev_obj_port_fdb *fdb,
   struct switchdev_trans *trans)
 {
-   int state = is_multicast_ether_addr(fdb->addr) ?
-   GLOBAL_ATU_DATA_STATE_MC_STATIC :
-   GLOBAL_ATU_DATA_STATE_UC_STATIC;
struct mv88e6xxx_chip *chip = ds_to_priv(ds);
 
mutex_lock(>reg_lock);
-   if (_mv88e6xxx_port_fdb_load(chip, port, fdb->addr, fdb->vid, state))
-   netdev_err(ds->ports[port].netdev,
-  "failed to load MAC address\n");
+   if (mv88e6xxx_port_db_load_purge(chip, port, fdb->addr, fdb->vid,
+GLOBAL_ATU_DATA_STATE_UC_STATIC))
+   netdev_err(ds->ports[port].netdev, "failed to load unicast MAC 
address\n");
mutex_unlock(>reg_lock);
 }
 
@@ -2150,14 +2147,14 @@ static int mv88e6xxx_port_fdb_del(struct dsa_switch 
*ds, int port,
  const struct switchdev_obj_port_fdb *fdb)
 {
struct mv88e6xxx_chip *chip = ds_to_priv(ds);
-   int ret;
+   int err;
 
mutex_lock(>reg_lock);
-   ret = _mv88e6xxx_port_fdb_load(chip, port, fdb->addr, fdb->vid,
-  GLOBAL_ATU_DATA_STATE_UNUSED);
+   err = mv88e6xxx_port_db_load_purge(chip, port, fdb->addr, fdb->vid,
+  GLOBAL_ATU_DATA_STATE_UNUSED);
mutex_unlock(>reg_lock);
 
-   return ret;
+   return err;
 }
 
 static int _mv88e6xxx_atu_getnext(struct mv88e6xxx_chip *chip, u16 fid,
@@ -2205,10 +2202,10 @@ static int _mv88e6xxx_atu_getnext(struct mv88e6xxx_chip 
*chip, u16 fid,
return 0;
 }
 
-static int _mv88e6xxx_port_fdb_dump_one(struct mv88e6xxx_chip *chip,
-   u16 fid, u16 vid, int port,
-   struct switchdev_obj_port_fdb *fdb,
-   int (*cb)(struct switchdev_obj *obj))
+static int mv88e6xxx_port_db_dump_one(struct mv88e6xxx_chip *chip,
+ u16 fid, u16 vid, int port,
+ struct switchdev_obj *obj,
+ int (*cb)(struct switchdev_obj *obj))
 {
struct mv88e6xxx_atu_entry addr = {
.mac = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
@@ -,72 +2219,87 @@ static int _mv88e6xxx_port_fdb_dump_one(struct 
mv88e6xxx_chip *chip,
do {
err = _mv88e6xxx_atu_getnext(chip, fid, );
if (err)
-   break;
+   return err;
 
if (addr.state == GLOBAL_ATU_DATA_STATE_UNUSED)
break;
 
-   if (!addr.trunk && addr.portv_trunkid & BIT(port)) {
-   bool is_static = addr.state ==
-   (is_multicast_ether_addr(addr.mac) ?
-GLOBAL_ATU_DATA_STATE_MC_STATIC :
-GLOBAL_ATU_DATA_STATE_UC_STATIC);
+   if (addr.trunk || (addr.portv_trunkid & BIT(port)) == 0)
+   continue;
+
+   if (obj->id == SWITCHDEV_OBJ_ID_PORT_FDB) {
+   struct switchdev_obj_port_fdb *fdb;
 
+   if (!is_unicast_ether_addr(addr.mac))
+   continue;
+
+   fdb = 

[PATCH net-next 1/3] net: dsa: add MDB support

2016-08-29 Thread Vivien Didelot
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.

Signed-off-by: Vivien Didelot 
---
 Documentation/networking/dsa/dsa.txt | 23 +++
 include/net/dsa.h| 16 +++
 net/dsa/slave.c  | 55 
 3 files changed, 94 insertions(+)

diff --git a/Documentation/networking/dsa/dsa.txt 
b/Documentation/networking/dsa/dsa.txt
index 44ed453..6db7bc8 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -584,6 +584,29 @@ of DSA, would be the its port-based VLAN, used by the 
associated bridge device.
   function that the driver has to call for each MAC address known to be behind
   the given port. A switchdev object is used to carry the VID and FDB info.
 
+- port_mdb_prepare: bridge layer function invoked when the bridge prepares the
+  installation of a multicast group database entry. If the operation is not
+  supported, this function should return -EOPNOTSUPP to inform the bridge code
+  to fallback to a software implementation. No hardware setup must be done in
+  this function. See port_fdb_add for this and details.
+
+- port_mdb_add: bridge layer function invoked when the bridge wants to install 
+  a multicast group database entry, the switch hardware should be programmed 
+  with the specified address in the specified VLAN ID in the forwarding 
database 
+  associated with this VLAN ID.
+
+Note: VLAN ID 0 corresponds to the port private database, which, in the context
+of DSA, would be the its port-based VLAN, used by the associated bridge device.
+
+- port_mdb_del: bridge layer function invoked when the bridge wants to remove a
+  multicast group database entry, the switch hardware should be programmed to 
+  delete the specified MAC address from the specified VLAN ID if it was mapped 
+  into this port forwarding database.
+
+- port_mdb_dump: bridge layer function invoked with a switchdev callback
+  function that the driver has to call for each MAC address known to be behind
+  the given port. A switchdev object is used to carry the VID and MDB info.
+
 TODO
 
 
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 2ebeba4..39f90c4 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -234,6 +234,7 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds)
 struct switchdev_trans;
 struct switchdev_obj;
 struct switchdev_obj_port_fdb;
+struct switchdev_obj_port_mdb;
 struct switchdev_obj_port_vlan;
 
 struct dsa_switch_ops {
@@ -369,6 +370,21 @@ struct dsa_switch_ops {
int (*port_fdb_dump)(struct dsa_switch *ds, int port,
 struct switchdev_obj_port_fdb *fdb,
 int (*cb)(struct switchdev_obj *obj));
+
+   /*
+* Multicast group database
+*/
+   int (*port_mdb_prepare)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_port_mdb *mdb,
+   struct switchdev_trans *trans);
+   void(*port_mdb_add)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_port_mdb *mdb,
+   struct switchdev_trans *trans);
+   int (*port_mdb_del)(struct dsa_switch *ds, int port,
+   const struct switchdev_obj_port_mdb *mdb);
+   int (*port_mdb_dump)(struct dsa_switch *ds, int port,
+struct switchdev_obj_port_mdb *mdb,
+int (*cb)(struct switchdev_obj *obj));
 };
 
 void register_switch_driver(struct dsa_switch_ops *type);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9f6c2a2..9ecbe78 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -290,6 +290,50 @@ static int dsa_slave_port_fdb_dump(struct net_device *dev,
return -EOPNOTSUPP;
 }
 
+static int dsa_slave_port_mdb_add(struct net_device *dev,
+ const struct switchdev_obj_port_mdb *mdb,
+ struct switchdev_trans *trans)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+
+   if (switchdev_trans_ph_prepare(trans)) {
+   if (!ds->ops->port_mdb_prepare || !ds->ops->port_mdb_add)
+   return -EOPNOTSUPP;
+
+   return ds->ops->port_mdb_prepare(ds, p->port, mdb, trans);
+   }
+
+   ds->ops->port_mdb_add(ds, p->port, mdb, trans);
+
+   return 0;
+}
+
+static int dsa_slave_port_mdb_del(struct net_device *dev,
+ const struct switchdev_obj_port_mdb *mdb)
+{
+   struct dsa_slave_priv *p = netdev_priv(dev);
+   struct dsa_switch *ds = p->parent;
+
+   if (ds->ops->port_mdb_del)
+   return ds->ops->port_mdb_del(ds, p->port, mdb);
+
+   return -EOPNOTSUPP;
+}
+
+static int dsa_slave_port_mdb_dump(struct net_device *dev,

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-29 Thread Daniel Borkmann

On 08/29/2016 10:13 PM, Daniel Borkmann wrote:

On 08/27/2016 07:32 PM, Alexei Starovoitov wrote:

On Sat, Aug 27, 2016 at 12:40:04PM +0100, Jakub Kicinski wrote:

On Fri, 26 Aug 2016 16:29:05 -0700, Alexei Starovoitov wrote:

On Fri, Aug 26, 2016 at 07:06:06PM +0100, Jakub Kicinski wrote:

Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.

Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked.  For now only add most basic callback for
per-instruction pre-interpretation checks is added.  More
advanced users may also like to have per-instruction post
callback and state comparison callback.

Signed-off-by: Jakub Kicinski 


I like the apporach. Making verifier into 'bytecode parser'
that JITs can reuse is a good design choice.


+1


The only thing I would suggest is to tweak the verifier to
avoid in-place state recording. Then I think patch 8 for
clone/unclone of the program won't be needed, since verifier
will be read-only from bytecode point of view and patch 9
also will be slightly cleaner.
I think there are very few places in verifier that do this
state keeping inside insn. It was bugging me for some time.
Good time to clean that up.
Unless I misunderstand the patches 7,8,9...


Agreed, I think the verifier only modifies the program to
store pointer types in imm field.  I will try to come up
a way around this, any suggestions?  Perhaps state_equal()


probably array_of_insn_aux_data[num_insns] should do it.
Unlike reg_state that is forked on branches, this array
is only one.


This would be for struct nfp_insn_meta, right? So, struct
bpf_ext_parser_ops could become:

static const struct bpf_ext_parser_ops nfp_bpf_pops = {
 .insn_hook = nfp_verify_insn,
 .insn_size = sizeof(struct nfp_insn_meta),
};

... where bpf_parse() would prealloc that f.e. in env->insn_meta[].


(Well, actually everything can live in env->private_data.)


logic could be modified to downgrade pointers to UNKONWNs
when it detects other state had incompatible pointer type.


There is also small concern for patches 5 and 6 that add
more register state information. Potentially that extra
state can prevent states_equal() to recognize equivalent
states. Only patch 9 uses that info, right?


5 and 6 recognize more constant loads, those can only
upgrade some UNKNOWN_VALUEs to CONST_IMMs.  So yes, if the
verifier hits the CONST first and then tries with UNKNOWN
it will have to reverify the path.


Agree, was also my concern when I read patch 5 and 6. It would
not only be related to types, but also different imm values,
where the memcmp() could fail on. Potentially the latter can be
avoided by only checking types which should be sufficient. Hmm,
maybe only bpf_parse() should go through this stricter mode since
only relevant for drivers (otoh downside would be that bugs
would end up less likely to be found).


Another question is do you need all state walking that
verifier does or single linear pass through insns
would have worked?
Looks like you're only using CONST_IMM and PTR_TO_CTX
state, right?


I think I need all the parsing.  Right now I mostly need
the verification to check that exit codes are specific
CONST_IMMs.  Clang quite happily does this:

r0 <- 0
if (...)
r0 <- 1
exit


I see. Indeed then you'd need the verifier to walk all paths
to make sure constant return values.


I think this would still not cover the cases where you'd fetch
a return value/verdict from a map, but this should be ignored/
rejected for now, also since majority of programs are not written
in such a way.


If you only need yes/no check then such info can probably be
collected unconditionally during initial program load.
Like prog->cb_access flag.


One other comment wrt the header, when you move these things
there, would be good to prefix with bpf_* so that this doesn't
clash in future with other header files.




Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-29 Thread Daniel Borkmann

On 08/27/2016 07:32 PM, Alexei Starovoitov wrote:

On Sat, Aug 27, 2016 at 12:40:04PM +0100, Jakub Kicinski wrote:

On Fri, 26 Aug 2016 16:29:05 -0700, Alexei Starovoitov wrote:

On Fri, Aug 26, 2016 at 07:06:06PM +0100, Jakub Kicinski wrote:

Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.

Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked.  For now only add most basic callback for
per-instruction pre-interpretation checks is added.  More
advanced users may also like to have per-instruction post
callback and state comparison callback.

Signed-off-by: Jakub Kicinski 


I like the apporach. Making verifier into 'bytecode parser'
that JITs can reuse is a good design choice.


+1


The only thing I would suggest is to tweak the verifier to
avoid in-place state recording. Then I think patch 8 for
clone/unclone of the program won't be needed, since verifier
will be read-only from bytecode point of view and patch 9
also will be slightly cleaner.
I think there are very few places in verifier that do this
state keeping inside insn. It was bugging me for some time.
Good time to clean that up.
Unless I misunderstand the patches 7,8,9...


Agreed, I think the verifier only modifies the program to
store pointer types in imm field.  I will try to come up
a way around this, any suggestions?  Perhaps state_equal()


probably array_of_insn_aux_data[num_insns] should do it.
Unlike reg_state that is forked on branches, this array
is only one.


This would be for struct nfp_insn_meta, right? So, struct
bpf_ext_parser_ops could become:

static const struct bpf_ext_parser_ops nfp_bpf_pops = {
.insn_hook = nfp_verify_insn,
.insn_size = sizeof(struct nfp_insn_meta),
};

... where bpf_parse() would prealloc that f.e. in env->insn_meta[].


logic could be modified to downgrade pointers to UNKONWNs
when it detects other state had incompatible pointer type.


There is also small concern for patches 5 and 6 that add
more register state information. Potentially that extra
state can prevent states_equal() to recognize equivalent
states. Only patch 9 uses that info, right?


5 and 6 recognize more constant loads, those can only
upgrade some UNKNOWN_VALUEs to CONST_IMMs.  So yes, if the
verifier hits the CONST first and then tries with UNKNOWN
it will have to reverify the path.


Agree, was also my concern when I read patch 5 and 6. It would
not only be related to types, but also different imm values,
where the memcmp() could fail on. Potentially the latter can be
avoided by only checking types which should be sufficient. Hmm,
maybe only bpf_parse() should go through this stricter mode since
only relevant for drivers (otoh downside would be that bugs
would end up less likely to be found).


Another question is do you need all state walking that
verifier does or single linear pass through insns
would have worked?
Looks like you're only using CONST_IMM and PTR_TO_CTX
state, right?


I think I need all the parsing.  Right now I mostly need
the verification to check that exit codes are specific
CONST_IMMs.  Clang quite happily does this:

r0 <- 0
if (...)
r0 <- 1
exit


I see. Indeed then you'd need the verifier to walk all paths
to make sure constant return values.


I think this would still not cover the cases where you'd fetch
a return value/verdict from a map, but this should be ignored/
rejected for now, also since majority of programs are not written
in such a way.


If you only need yes/no check then such info can probably be
collected unconditionally during initial program load.
Like prog->cb_access flag.


One other comment wrt the header, when you move these things
there, would be good to prefix with bpf_* so that this doesn't
clash in future with other header files.


Re: [PATCH net-next] tcp: add tcp_add_backlog()

2016-08-29 Thread Marcelo Ricardo Leitner
On Mon, Aug 29, 2016 at 12:22:37PM -0700, Eric Dumazet wrote:
> On Mon, 2016-08-29 at 15:51 -0300, Marcelo Ricardo Leitner wrote:
> > skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> > 
> > Shouldn't __pskb_pull_tail() already fix this? As it seems the expected
> > behavior and it would have a more global effect then. For drivers not
> > using copybreak, that's needed here anyway, but maybe this help other
> > protocols/situations too.
> 
> That would be difficult, because some callers do their own truesize
> tacking (skb might be attached/charged to a socket, so changing
> skb->truesize would need to adjust the amount that was charged)
> 
> This is why pskb_expand_head() is not allowed to mess with skb->truesize
> (but in the opposite way, since we probably are increasing
> skb->truesize)
> 

Ok, makes sense.

> Not sure it is worth the pain in fast path, where packets are consumed
> so fast that their skb->truesize being slightly over estimated is not an
> issue.

Good point, thanks.



Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Tejun Heo
Hello, Sargun.

On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:
> It would be a separate hook per LSM hook. Why wouldn't we want a separate bpf 
> hook per lsm hook? I think if one program has to handle them all, the first 
> program would be looking up the hook program in a bpf prog array. If you 
> think 
> it's better to have this logic in the BPF program, that makes sense. 
> 
> I had a version of this patch that allowed you to attach a prog array 
> instead, 
> but I think that it's cleaner attaching a program per lsm hook. In addition, 
> there's a performance impact that comes from these hooks, so I wouldn't want 
> to 
> execute unneccessary code if it's avoidable.

Hmm... it doesn't really matter how the backend part looks like and if
we need to implement per-call hooks to lower runtime overhead, sure.
I was mostly worried about the approach propagating through the
userland visible interface.

> The prog array approach also makes stacking filters difficult. If people want 
> multiple filters per hook, the orchestrator would have to rewrite the 
> existing 
> filters to be cooperative.

I'm not really sure "stacking" in the kernel side is a good idea.
Please see below.

> > I'm not convinced about the approach.  It's an approach which pretty
> > much requires future extensions while being rigid.  Not a good
> > combination.
>
> Do you have an alternative recommendation? Maybe just a set of 5 u64s
> as the context object along with the hook ID?

cgroup fs doesn't seem like the right interface for this but if it
were I'd go for named hook IDs instead of opaque numbers.

> > Unless this is properly delegatable, IOW, it's safe to fully delegate
> > to a lesser security domain for all operations including program
> > loading and assignment (I can't see how that'd be the case), making it
> > an explicit controller doens't work in terms of userland interface.
> > It's fine for bpf / lsm / whatever to attach to cgroups by extending
> > struct cgroup itself or implementing an implicit controller but to be
> > visible as an explicit controller it must be able to follow cgroup
> > interface rules including delegation.  If not, it's best to come
> > through the interface which enforces the required permission checks
> > and then talk to cgroup from there.  This was also an issue with
> > network cgroup bpf programs that Daniel Mack is working on.  Please
> > chat with him.
>
> Program assignment is possible by lesser security domains. Program loading is 
> limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to 
> CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that Checmate
> BPF programs can leak kernel pointers. 

That doesn't make much sense to me.  Delegation doesn't mean much if a
delegatee can't load its own program (and I don't see how one can
delegate kernel pointer access to !root).  Also, unless there's
per-program fine control on who can load it, it seems pretty dangerous
to let anyone load any program.

> Could we potentially restrict it to only CAP_MAC_OVERRIDE, while still 
> meeting 
> cgroup delegation requirements?

Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
that "load this program" and "attach this program to the cgroup
identified by this fd" through the same interface and permission
checks?  cgroup participating in bpf operations is all fine but
splitting the userland interface across two domains seems like a bad
idea.

> Filters which are higher up in the heirarchy will still be enforced during 
> delegation. This was an explicit design, as the "Orchestrator in 
> Orchestrator" 
> use case needs to be supported.

Given that program loading is restricted to root, wouldn't it be an a
lot more efficient approach to let userland multiplex multiple
programs?  Walking the tree executing bpf programs each time one of
these operations runs can be pretty expensive.  Imagine a tree like
the following.

A - B - C
  \ D

Let's say program is currently loaded on D.  If someone wants to add a
program on B, userland can load the program on B, combine B's and D's
program and compile them into a single program and load it on D.  The
only thing kernel would need to do in terms of hierarchy is finding
what's the closest program to execute.  In the above example, C would
need to use B's program and that can be determined on program
assignment time rather than on each operation.

Thanks.

-- 
tejun


Re: [PATCH net-next] tcp: add tcp_add_backlog()

2016-08-29 Thread Eric Dumazet
On Mon, 2016-08-29 at 15:51 -0300, Marcelo Ricardo Leitner wrote:
>   skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> 
> Shouldn't __pskb_pull_tail() already fix this? As it seems the expected
> behavior and it would have a more global effect then. For drivers not
> using copybreak, that's needed here anyway, but maybe this help other
> protocols/situations too.

That would be difficult, because some callers do their own truesize
tacking (skb might be attached/charged to a socket, so changing
skb->truesize would need to adjust the amount that was charged)

This is why pskb_expand_head() is not allowed to mess with skb->truesize
(but in the opposite way, since we probably are increasing
skb->truesize)

Not sure it is worth the pain in fast path, where packets are consumed
so fast that their skb->truesize being slightly over estimated is not an
issue.





pull-request: wireless-drivers 2016-08-29

2016-08-29 Thread Kalle Valo
Hi Dave,

I'm quite backlogged after coming back from my vacation but luckily it
has been pretty quiet. Here is the first batch of patches for 4.8, quite
simple actually and not really anything special to mention. More to come
later, most probably next week. Please let me know if there are any
problems.

Kalle

The following changes since commit 184ca823481c99dadd7d946e5afd4bb921eab30d:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-08-17 
17:26:58 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git 
tags/wireless-drivers-for-davem-2016-08-29

for you to fetch changes up to bb87f02b7e4ccdb614a83cbf840524de81e9b321:

  Merge ath-current from ath.git (2016-08-29 21:39:04 +0300)



wireless-drivers fixes for 4.8

ath9k

* fix regression in client mode beacon configuration
* fix a station pointer which resulted in spurious crashes

mwifiex

* fix large amsdu packets causing firmware hang

brcmfmac

* fix deadlock when removing interface
* fix use of mutex in atomic context


Cathy Luo (1):
  mwifiex: fix large amsdu packets causing firmware hang

Felix Fietkau (2):
  ath9k: fix client mode beacon configuration
  ath9k: fix using sta->drv_priv before initializing it

Kalle Valo (1):
  Merge ath-current from ath.git

mhira...@kernel.org (2):
  brcmfmac: Check rtnl_lock is locked when removing interface
  brcmfmac: Change vif_event_lock to spinlock

 drivers/net/wireless/ath/ath9k/main.c  |9 ---
 .../broadcom/brcm80211/brcmfmac/cfg80211.c |   26 ++--
 .../broadcom/brcm80211/brcmfmac/cfg80211.h |2 +-
 .../wireless/broadcom/brcm80211/brcmfmac/core.c|2 +-
 .../net/wireless/broadcom/brcm80211/brcmfmac/p2p.c |8 +++---
 .../net/wireless/broadcom/brcm80211/brcmfmac/p2p.h |2 +-
 drivers/net/wireless/marvell/mwifiex/11n_aggr.c|3 ++-
 7 files changed, 28 insertions(+), 24 deletions(-)


[patch iproute2 v2 0/2] Add matchall support to tc

2016-08-29 Thread Jiri Pirko
From: Jiri Pirko 

Yotam says:
Add the matchall classifier support to tc and added the specific man pages.

---
v1->v2:
-fixed couple of checkpatch issues

Yotam Gigi (2):
  tc: Add support for the matchall traffic classifier.
  tc: man: Add man entry for the matchall classifier.

 man/man8/Makefile  |   2 +-
 man/man8/tc-matchall.8 |  76 ++
 man/man8/tc.8  |   5 ++
 tc/Makefile|   1 +
 tc/f_matchall.c| 143 +
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/tc-matchall.8
 create mode 100644 tc/f_matchall.c

-- 
2.5.5



[patch iproute2 v2 2/2] tc: man: Add man entry for the matchall classifier.

2016-08-29 Thread Jiri Pirko
From: Yotam Gigi 

In addition to providing information about the mathcall filter and its
configurations, the man entry contains examples for creating port
mirorring entries.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 man/man8/Makefile  |  2 +-
 man/man8/tc-matchall.8 | 76 ++
 man/man8/tc.8  |  5 
 3 files changed, 82 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/tc-matchall.8

diff --git a/man/man8/Makefile b/man/man8/Makefile
index 9badbed..9213769 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -14,7 +14,7 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tipc.8 tipc-bearer.8 tipc-link.8 tipc-media.8 tipc-nametable.8 \
tipc-node.8 tipc-socket.8 \
tc-basic.8 tc-cgroup.8 tc-flow.8 tc-flower.8 tc-fw.8 tc-route.8 \
-   tc-tcindex.8 tc-u32.8 \
+   tc-tcindex.8 tc-u32.8 tc-matchall.8 \
tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8
diff --git a/man/man8/tc-matchall.8 b/man/man8/tc-matchall.8
new file mode 100644
index 000..b927784
--- /dev/null
+++ b/man/man8/tc-matchall.8
@@ -0,0 +1,76 @@
+.TH "Match-all classifier in tc" 8 "21 Oct 2015" "iproute2" "Linux"
+
+.SH NAME
+matchall \- traffic control filter that matches every packet
+.SH SYNOPSIS
+.in +8
+.ti -8
+.BR tc " " filter " ... " matchall " [ "
+.BR skip_sw " | " skip_hw
+.R " ] [ "
+.B action
+.IR ACTION_SPEC " ] [ "
+.B classid
+.IR CLASSID " ]"
+.SH DESCRIPTION
+The
+.B matchall
+filter allows to classify every packet that flows on the port and run a
+action on it.
+.SH OPTIONS
+.TP
+.BI action " ACTION_SPEC"
+Apply an action from the generic actions framework on matching packets.
+.TP
+.BI classid " CLASSID"
+Push matching packets into the class identified by
+.IR CLASSID .
+.TP
+.BI skip_sw
+Do not process filter by software. If hardware has no offload support for this
+filter, or TC offload is not enabled for the interface, operation will fail.
+.TP
+.BI skip_hw
+Do not process filter by hardware.
+.SH EXAMPLES
+To create ingress mirroring from port eth1 to port eth2:
+.RS
+.EX
+
+tc qdisc  add dev eth1 handle : ingress
+tc filter add dev eth1 parent :   \\
+matchall skip_sw  \\
+action mirred egress mirror   \\
+dev eth2
+.EE
+.RE
+
+The first command creats an ingress qdisc with handle
+.BR :
+on device
+.BR eth1
+where the second command attaches a matchall filters on it that mirrors the
+pacets to device eth2.
+
+To create engress mirroring from port eth1 to port eth2:
+.EX
+
+tc qdisc add dev eth1 handle 1: root prio
+tc filter add dev eth1 parent 1:   \\
+matchall skip_sw   \\
+action mirred egress mirror\\
+dev eth2
+.EE
+.RE
+
+The first command creats an egress qdisc with handle
+.BR 1:
+that replaces the root qdisc on device
+.BR eth1
+where the second command attaches a matchall filters on it that mirrors the
+pacets to device eth2.
+
+
+.EE
+.SH SEE ALSO
+.BR tc (8),
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 4e99dca..7ee1c9c 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -187,6 +187,11 @@ u32
 Generic filtering on arbitrary packet data, assisted by syntax to abstract 
common operations. See
 .BR tc-u32 (8)
 for details.
+.TP
+matchall
+Traffic control filter that matches every packet. See
+.BR tc-matchall (8)
+for details.
 
 .SH CLASSLESS QDISCS
 The classless qdiscs are:
-- 
2.5.5



[patch iproute2 v2 1/2] tc: Add support for the matchall traffic classifier.

2016-08-29 Thread Jiri Pirko
From: Yotam Gigi 

The matchall classifier matches every packet and allows the user to apply
actions on it. In addition, it supports the skip_sw and skip_hw (as can
be found on u32 and flower filter) that direct the kernel to skip the
software/hardware processing of the actions.

This filter is very useful in usecases where every packet should be
matched. For example, packet mirroring (SPAN) can be setup very easily
using that filter.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 tc/Makefile |   1 +
 tc/f_matchall.c | 143 
 2 files changed, 144 insertions(+)
 create mode 100644 tc/f_matchall.c

diff --git a/tc/Makefile b/tc/Makefile
index 42747c5..8917eaf 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -67,6 +67,7 @@ TCMODULES += q_pie.o
 TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
+TCMODULES += f_matchall.o
 
 ifeq ($(TC_CONFIG_IPSET), y)
   ifeq ($(TC_CONFIG_XT), y)
diff --git a/tc/f_matchall.c b/tc/f_matchall.c
new file mode 100644
index 000..04e524e
--- /dev/null
+++ b/tc/f_matchall.c
@@ -0,0 +1,143 @@
+/*
+ * f_matchall.cMatch-all Classifier
+ *
+ * This program is free software; you can distribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Jiri Pirko , Yotam Gigi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... matchall [skip_sw | skip_hw]\n");
+   fprintf(stderr, " [ action ACTION_SPEC ] [ classid 
CLASSID ]\n");
+   fprintf(stderr, "\n");
+   fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
+   fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "   ACTION_SPEC := ... look at individual 
actions\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal input.\n");
+}
+
+static int matchall_parse_opt(struct filter_util *qu, char *handle,
+  int argc, char **argv, struct nlmsghdr *n)
+{
+   struct tcmsg *t = NLMSG_DATA(n);
+   struct rtattr *tail;
+   __u32 flags = 0;
+   long h = 0;
+
+   if (handle) {
+   h = strtol(handle, NULL, 0);
+   if (h == LONG_MIN || h == LONG_MAX) {
+   fprintf(stderr, "Illegal handle \"%s\", must be 
numeric.\n",
+   handle);
+   return -1;
+   }
+   }
+   t->tcm_handle = h;
+
+   if (argc == 0)
+   return 0;
+
+   tail = (struct rtattr *)(((void *)n)+NLMSG_ALIGN(n->nlmsg_len));
+   addattr_l(n, MAX_MSG, TCA_OPTIONS, NULL, 0);
+
+   while (argc > 0) {
+   if (matches(*argv, "classid") == 0 ||
+  strcmp(*argv, "flowid") == 0) {
+   unsigned int handle;
+
+   NEXT_ARG();
+   if (get_tc_classid(, *argv)) {
+   fprintf(stderr, "Illegal \"classid\"\n");
+   return -1;
+   }
+   addattr_l(n, MAX_MSG, TCA_MATCHALL_CLASSID, , 4);
+   } else if (matches(*argv, "action") == 0) {
+   NEXT_ARG();
+   if (parse_action(, , TCA_MATCHALL_ACT, n)) {
+   fprintf(stderr, "Illegal \"action\"\n");
+   return -1;
+   }
+   continue;
+
+   } else if (strcmp(*argv, "skip_hw") == 0) {
+   NEXT_ARG();
+   flags |= TCA_CLS_FLAGS_SKIP_HW;
+   continue;
+   } else if (strcmp(*argv, "skip_sw") == 0) {
+   NEXT_ARG();
+   flags |= TCA_CLS_FLAGS_SKIP_SW;
+   continue;
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, "What is \"%s\"?\n", *argv);
+   explain();
+   return -1;
+   }
+   argc--; argv++;
+   }
+
+   if (flags) {
+   if (!(flags ^ (TCA_CLS_FLAGS_SKIP_HW |
+  TCA_CLS_FLAGS_SKIP_SW))) {
+   fprintf(stderr,
+   "skip_hw and skip_sw are mutually exclusive\n");
+   return -1;
+   }
+   addattr_l(n, MAX_MSG, 

Re: [PATCH net-next] tcp: add tcp_add_backlog()

2016-08-29 Thread Marcelo Ricardo Leitner
On Sat, Aug 27, 2016 at 07:37:54AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> When TCP operates in lossy environments (between 1 and 10 % packet
> losses), many SACK blocks can be exchanged, and I noticed we could
> drop them on busy senders, if these SACK blocks have to be queued
> into the socket backlog.
> 
> While the main cause is the poor performance of RACK/SACK processing,
> we can try to avoid these drops of valuable information that can lead to
> spurious timeouts and retransmits.
> 
> Cause of the drops is the skb->truesize overestimation caused by :
> 
> - drivers allocating ~2048 (or more) bytes as a fragment to hold an
>   Ethernet frame.
> 
> - various pskb_may_pull() calls bringing the headers into skb->head
>   might have pulled all the frame content, but skb->truesize could
>   not be lowered, as the stack has no idea of each fragment truesize.
> 
> The backlog drops are also more visible on bidirectional flows, since
> their sk_rmem_alloc can be quite big.
> 
> Let's add some room for the backlog, as only the socket owner
> can selectively take action to lower memory needs, like collapsing
> receive queues or partial ofo pruning.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> ---
>  include/net/tcp.h   |1 +
>  net/ipv4/tcp_ipv4.c |   33 +
>  net/ipv6/tcp_ipv6.c |5 +
>  3 files changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 
> 25d64f6de69e1f639ed1531bf2d2df3f00fd76a2..5f5f09f6e019682ef29c864d2f43a8f247fcdd9a
>  100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1163,6 +1163,7 @@ static inline void tcp_prequeue_init(struct tcp_sock 
> *tp)
>  }
>  
>  bool tcp_prequeue(struct sock *sk, struct sk_buff *skb);
> +bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb);
>  
>  #undef STATE_TRACE
>  
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> ad41e8ecf796bba1bd6d9ed155ca4a57ced96844..53e80cd004b6ce401c3acbb4b243b243c5c3c4a3
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1532,6 +1532,34 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(tcp_prequeue);
>  
> +bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> +{
> + u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
> +
> + /* Only socket owner can try to collapse/prune rx queues
> +  * to reduce memory overhead, so add a little headroom here.
> +  * Few sockets backlog are possibly concurrently non empty.
> +  */
> + limit += 64*1024;
> +
> + /* In case all data was pulled from skb frags (in __pskb_pull_tail()),
> +  * we can fix skb->truesize to its real value to avoid future drops.
> +  * This is valid because skb is not yet charged to the socket.
> +  * It has been noticed pure SACK packets were sometimes dropped
> +  * (if cooked by drivers without copybreak feature).
> +  */
> + if (!skb->data_len)
> + skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));

Shouldn't __pskb_pull_tail() already fix this? As it seems the expected
behavior and it would have a more global effect then. For drivers not
using copybreak, that's needed here anyway, but maybe this help other
protocols/situations too.

Thanks,
  Marcelo


Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Sargun Dhillon
On Mon, Aug 29, 2016 at 01:01:18PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Aug 29, 2016 at 04:47:07AM -0700, Sargun Dhillon wrote:
> > This patch adds a minor LSM, Checmate. Checmate is a flexible programmable,
> > extensible minor LSM that's coupled with cgroups and BPF. It is designed to
> > enforce container-specific policies. It is also a cgroupv2 controller. By
> > itself, it doesn't do very much, but in conjunction with a orchestrator
> > complex policies can be installed on the cgroup hierarchy.
> > 
> > These cgroup programs are tied to the kernel ABI version. If one tries
> > to load a BPF program compiled against a different kernel version,
> > an error will be thrown.
> 
> First of all, please talk with people working on network cgroup bpf
> and landlock.  I don't think it's a good idea to have N different ways
> to implement cgroup-aware bpf mechanism.  There can be multiple
> consumers but there gotta be a common mechanism instead of several
> independent controllers.
>
I've talked to Daniel Mack, and Alexei. I agree with you that it makes sense 
not 
to have an infinite number of these cgroup + bpf + lsm subsystems in the 
kernel. 
I think that making sure we don't sacrifice capability is important.

> > diff --git a/include/linux/checmate.h b/include/linux/checmate.h
> > new file mode 100644
> > index 000..4c4db4a
> > --- /dev/null
> > +++ b/include/linux/checmate.h
> > @@ -0,0 +1,108 @@
> > +#ifndef _LINUX_CHECMATE_H_
> > +#define _LINUX_CHECMATE_H_ 1
> > +#include 
> > +
> > +enum checmate_hook_num {
> > +   /* CONFIG_SECURITY_NET hooks */
> > +   CHECMATE_HOOK_UNIX_STREAM_CONNECT,
> > +   CHECMATE_HOOK_UNIX_MAY_SEND,
> > +   CHECMATE_HOOK_SOCKET_CREATE,
> > +   CHECMATE_HOOK_SOCKET_POST_CREATE,
> > +   CHECMATE_HOOK_SOCKET_BIND,
> > +   CHECMATE_HOOK_SOCKET_CONNECT,
> > +   CHECMATE_HOOK_SOCKET_LISTEN,
> > +   CHECMATE_HOOK_SOCKET_ACCEPT,
> > +   CHECMATE_HOOK_SOCKET_SENDMSG,
> > +   CHECMATE_HOOK_SOCKET_RECVMSG,
> > +   CHECMATE_HOOK_SOCKET_GETSOCKNAME,
> > +   CHECMATE_HOOK_SOCKET_GETPEERNAME,
> > +   CHECMATE_HOOK_SOCKET_GETSOCKOPT,
> > +   CHECMATE_HOOK_SOCKET_SETSOCKOPT,
> > +   CHECMATE_HOOK_SOCKET_SHUTDOWN,
> > +   CHECMATE_HOOK_SOCKET_SOCK_RCV_SKB,
> > +   CHECMATE_HOOK_SK_FREE_SECURITY,
> > +   __CHECMATE_HOOK_MAX,
> > +};
> 
> Do we really want a separate hook for each call?  A logical extension
> of this would be having a separate hook per syscall which feels kinda
> horrible.
> 
It would be a separate hook per LSM hook. Why wouldn't we want a separate bpf 
hook per lsm hook? I think if one program has to handle them all, the first 
program would be looking up the hook program in a bpf prog array. If you think 
it's better to have this logic in the BPF program, that makes sense. 

I had a version of this patch that allowed you to attach a prog array instead, 
but I think that it's cleaner attaching a program per lsm hook. In addition, 
there's a performance impact that comes from these hooks, so I wouldn't want to 
execute unneccessary code if it's avoidable.

The prog array approach also makes stacking filters difficult. If people want 
multiple filters per hook, the orchestrator would have to rewrite the existing 
filters to be cooperative.

> > +/* CONFIG_SECURITY_NET contexts */
> > +struct checmate_unix_stream_connect_ctx {
> > +   struct sock *sock;
> > +   struct sock *other;
> > +   struct sock *newsk;
> > +};
> ...
> > +struct checmate_sk_free_security_ctx {
> > +   struct sock *sk;
> > +};
> ...
> > +struct checmate_ctx {
> > +   int hook;
> > +   union {
> > +/* CONFIG_SECURITY_NET contexts */
> > +   struct checmate_unix_stream_connect_ctx unix_stream_connect;
> > +   struct checmate_unix_may_send_ctx   unix_may_send;
> > +   struct checmate_socket_create_ctx   socket_create;
> > +   struct checmate_socket_bind_ctx socket_bind;
> > +   struct checmate_socket_connect_ctx  socket_connect;
> > +   struct checmate_socket_listen_ctx   socket_listen;
> > +   struct checmate_socket_accept_ctx   socket_accept;
> > +   struct checmate_socket_sendmsg_ctx  socket_sendmsg;
> > +   struct checmate_socket_recvmsg_ctx  socket_recvmsg;
> > +   struct checmate_socket_sock_rcv_skb_ctx socket_sock_rcv_skb;
> > +   struct checmate_sk_free_security_ctxsk_free_security;
> > +   };
> > +};
> 
> I'm not convinced about the approach.  It's an approach which pretty
> much requires future extensions while being rigid.  Not a good
> combination.
> 
Do you have an alternative recommendation? Maybe just a set of 5 u64s
as the context object along with the hook ID?

> > +/*
> 
> Please use /** for function comments.
> 
> > + * checmate_instance_cleanup_rcu - Cleans up a Checmate program instance
> > + * @rp: rcu_head pointer to a Checmate instance
> > + */
> > +static void checmate_instance_cleanup_rcu(struct rcu_head *rp)
> > +{
> > +   struct 

Re: [patch iproute2 1/2] tc: Add support for the matchall traffic classifier.

2016-08-29 Thread Jiri Pirko
Mon, Aug 29, 2016 at 08:03:21PM CEST, step...@networkplumber.org wrote:
>On Mon, 22 Aug 2016 13:56:25 +0200
>Jiri Pirko  wrote:
>
>> From: Yotam Gigi 
>> 
>> The matchall classifier matches every packet and allows the user to apply
>> actions on it. In addition, it supports the skip_sw and skip_hw (as can
>> be found on u32 and flower filter) that direct the kernel to skip the
>> software/hardware processing of the actions.
>> 
>> This filter is very useful in usecases where every packet should be
>> matched. For example, packet mirroring (SPAN) can be setup very easily
>> using that filter.
>> 
>> Signed-off-by: Yotam Gigi 
>> Signed-off-by: Jiri Pirko 
>
>I am not a checkpatch purist, but these two should be addressed.
>
>
>WARNING: unnecessary whitespace before a quoted newline
>#81: FILE: tc/f_matchall.c:29:
>+  fprintf(stderr, "Usage: ... matchall [skip_sw | skip_hw] \n");
>
>WARNING: braces {} are not necessary for single statement blocks
>#185: FILE: tc/f_matchall.c:133:
>+  if (tb[TCA_MATCHALL_ACT]) {
>+  tc_print_action(f, tb[TCA_MATCHALL_ACT]);
>+  }

will fix and send v2. Thanks.


Re: [PATCH net-next] sky2: use napi_complete_done

2016-08-29 Thread Eric Dumazet
On Mon, 2016-08-29 at 10:16 -0700, Stephen Hemminger wrote:
> Update the sky2 driver to pass number of packets done to NAPI.
> The driver was never updated when napi_complete_done was added.
> 
> Signed-off-by: Stephen Hemminger 
> ---

Acked-by: Eric Dumazet 




Re: [PATCH v3 net-next 1/1] net_sched: Introduce skbmod action

2016-08-29 Thread Eric Dumazet
On Mon, 2016-08-29 at 07:35 -0400, Jamal Hadi Salim wrote:
>flags = READ_ONCE(d->flags);
> rcu_read_lock();
> if (flags & SKBMOD_F_DMAC)
> ether_addr_copy(eth_hdr(skb)->h_dest, d->eth_dst);
> if (flags & SKBMOD_F_SMAC)
> ether_addr_copy(eth_hdr(skb)->h_source, d->eth_src);
> if (flags & SKBMOD_F_ETYPE)
> eth_hdr(skb)->h_proto = d->eth_type;
> if (flags & SKBMOD_F_SWAPMAC) {
> u8 tmpaddr[ETH_ALEN];
> /*XXX: I am sure we can come up with something more
> efficient */
> ether_addr_copy(tmpaddr, eth_hdr(skb)->h_dest);
> ether_addr_copy(eth_hdr(skb)->h_dest,
> eth_hdr(skb)->h_source);
> ether_addr_copy(eth_hdr(skb)->h_source, tmpaddr);
> }
> rcu_read_unlock();

You would need to store everything in an object, managed by rcu.

struct my_rcu_safe_struct {
 u32 flags;
 u8   eth_dst[ETH_ALEN];
 u8   eth_src[ETH_ALEN];
 __be16 eth_type;
};

And then allocate a new one when you need to update the infos (from
tcf_skbmod_init(())

RCU : Read Copy Update.

Then in the reader you would use 

rcu_read_lock();
myptr = rcu_dereference(d->ptr);
if (myptr) {
if (myptr->flags & SKBMOD_F_DMAC)
 ether_addr_copy(eth_hdr(skb)->h_dest, myptr->eth_dst);
 if (myptr->flags & SKBMOD_F_SMAC)
 ether_addr_copy(eth_hdr(skb)->h_source, myptr->eth_src);
 if (myptr->flags & SKBMOD_F_ETYPE)
 eth_hdr(skb)->h_proto = myptr->eth_type;
}
rcu_read_unlock();





Re: [PATCH net iproute2] devlink: Add e-switch support

2016-08-29 Thread Stephen Hemminger
On Sun, 28 Aug 2016 16:35:21 +0300
Or Gerlitz  wrote:

> Implement kernel devlink e-switch interface. Currently we allow
> to get and set the device e-switch mode.
> 
> Signed-off-by: Or Gerlitz 
> Signed-off-by: Roi Dayan 

Applied to master branch.



Re: [PATCH iproute2 1/2] tc: flower: Introduce vlan support

2016-08-29 Thread Stephen Hemminger
On Mon, 22 Aug 2016 13:24:53 +0300
Hadar Hen Zion  wrote:

> iff --git a/include/linux/tc_act/tc_vlan.h b/include/linux/tc_act/tc_vlan.h
> index 31151ff..26ae695 100644
> --- a/include/linux/tc_act/tc_vlan.h
> +++ b/include/linux/tc_act/tc_vlan.h
> @@ -16,6 +16,9 @@
>  
>  #define TCA_VLAN_ACT_POP 1
>  #define TCA_VLAN_ACT_PUSH2
> +#define VLAN_PRIO_MASK   0x7
> +#define VLAN_VID_MASK0x0fff
> +
>  
>  struct tc_vlan {
>   tc_gen;

I was going to apply this, it looks fine. But there is an issue
you need to address first.

All headers in iproute2 come from sanitized version of the kernel headers.
This header does not match the version in net-next.

net-next$ grep VLAN_PRIO_MASK usr/include/linux/tc_act/tc_vlan.h

Please update the kernel header first, then resubmit the patch.


Re: [patch iproute2 1/2] tc: Add support for the matchall traffic classifier.

2016-08-29 Thread Stephen Hemminger
On Mon, 22 Aug 2016 13:56:25 +0200
Jiri Pirko  wrote:

> From: Yotam Gigi 
> 
> The matchall classifier matches every packet and allows the user to apply
> actions on it. In addition, it supports the skip_sw and skip_hw (as can
> be found on u32 and flower filter) that direct the kernel to skip the
> software/hardware processing of the actions.
> 
> This filter is very useful in usecases where every packet should be
> matched. For example, packet mirroring (SPAN) can be setup very easily
> using that filter.
> 
> Signed-off-by: Yotam Gigi 
> Signed-off-by: Jiri Pirko 

I am not a checkpatch purist, but these two should be addressed.


WARNING: unnecessary whitespace before a quoted newline
#81: FILE: tc/f_matchall.c:29:
+   fprintf(stderr, "Usage: ... matchall [skip_sw | skip_hw] \n");

WARNING: braces {} are not necessary for single statement blocks
#185: FILE: tc/f_matchall.c:133:
+   if (tb[TCA_MATCHALL_ACT]) {
+   tc_print_action(f, tb[TCA_MATCHALL_ACT]);
+   }


Re: [PATCH iproute2 net-next v3] bridge: vlan: add support to display per-vlan statistics

2016-08-29 Thread Stephen Hemminger
On Thu, 25 Aug 2016 14:28:55 +0200
Nikolay Aleksandrov  wrote:

> This patch adds support for the stats argument to the bridge
> vlan command which will display the per-vlan statistics and the device
> each vlan belongs to with its flags. The supported command filtering
> options are dev and vid. Also the man page is updated to explain the new
> option.
> The patch uses the new RTM_GETSTATS interface with a filter_mask to dump
> all bridges and ports vlans. Later we can add support for using the
> per-device dump and filter it in the kernel instead.
> 
> Example:
> $ bridge -s vlan show
> port vlan id
> br0   1 Egress Untagged
> RX: 2536 bytes 20 packets
> TX: 2536 bytes 20 packets
>   101
> RX: 43158 bytes 50 packets
> TX: 43158 bytes 50 packets
> eth1  1 Egress Untagged
> RX: 2536 bytes 20 packets
> TX: 2536 bytes 20 packets
>   100
> RX: 0 bytes 0 packets
> TX: 0 bytes 0 packets
>   101
> RX: 43158 bytes 50 packets
> TX: 43158 bytes 50 packets
>   102
> RX: 16897 bytes 93 packets
> TX: 0 bytes 0 packets
> 
> The format is the same as bridge vlan show but with stats, even though
> under the hood the calls done to the kernel are different.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---

Applied to net-next branch


Re: [PATCH] iproute: disallow ip rule del without parameters

2016-08-29 Thread Stephen Hemminger
On Wed, 24 Aug 2016 23:43:00 +0300
"Andrey Jr. Melnikov"  wrote:

> Disallow run `ip rule del` without any parameter to avoid delete any first
> rule from table.
> 
> Signed-off-by: Andrey Jr. Melnikov 
> ---
> 
> diff --git a/ip/iprule.c b/ip/iprule.c
> index 8f24206..70562c5 100644
> --- a/ip/iprule.c
> +++ b/ip/iprule.c
> @@ -346,6 +346,11 @@ static int iprule_modify(int cmd, int argc, char **argv)
>   req.r.rtm_type = RTN_UNICAST;
>   }
>  
> + if (cmd == RTM_DELRULE && argc == 0) {
> + fprintf(stderr, "\"ip rule del\" requires arguments.\n");
> + return -1;
> + }
> +
>   while (argc > 0) {
>   if (strcmp(*argv, "not") == 0) {
>   req.r.rtm_flags |= FIB_RULE_INVERT;


Actually ip rule delete without arguments deletes all rules.
Which could be a bug or feature depending on the user.
I can imagine somebody is doing something like deleting all rules
and putting in new ones for PBR.


Re: [PATCH] net/mlx4_en: protect ring->xdp_prog with rcu_read_lock

2016-08-29 Thread Tom Herbert
On Mon, Aug 29, 2016 at 8:55 AM, Brenden Blanco  wrote:
> On Mon, Aug 29, 2016 at 05:59:26PM +0300, Tariq Toukan wrote:
>> Hi Brenden,
>>
>> The solution direction should be XDP specific that does not hurt the
>> regular flow.
> An rcu_read_lock is _already_ taken for _every_ packet. This is 1/64th of
> that.
>>
>> On 26/08/2016 11:38 PM, Brenden Blanco wrote:
>> >Depending on the preempt mode, the bpf_prog stored in xdp_prog may be
>> >freed despite the use of call_rcu inside bpf_prog_put. The situation is
>> >possible when running in PREEMPT_RCU=y mode, for instance, since the rcu
>> >callback for destroying the bpf prog can run even during the bh handling
>> >in the mlx4 rx path.
>> >
>> >Several options were considered before this patch was settled on:
>> >
>> >Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all
>> >of the rings are updated with the new program.
>> >This approach has the disadvantage that as the number of rings
>> >increases, the speed of udpate will slow down significantly due to
>> >napi_synchronize's msleep(1).
>> I prefer this option as it doesn't hurt the data path. A delay in a
>> control command can be tolerated.
>> >Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh.
>> >The action of the bpf_prog_put_bh would be to then call bpf_prog_put
>> >later. Those drivers that consume a bpf prog in a bh context (like mlx4)
>> >would then use the bpf_prog_put_bh instead when the ring is up. This has
>> >the problem of complexity, in maintaining proper refcnts and rcu lists,
>> >and would likely be harder to review. In addition, this approach to
>> >freeing must be exclusive with other frees of the bpf prog, for instance
>> >a _bh prog must not be referenced from a prog array that is consumed by
>> >a non-_bh prog.
>> >
>> >The placement of rcu_read_lock in this patch is functionally the same as
>> >putting an rcu_read_lock in napi_poll. Actually doing so could be a
>> >potentially controversial change, but would bring the implementation in
>> >line with sk_busy_loop (though of course the nature of those two paths
>> >is substantially different), and would also avoid future copy/paste
>> >problems with future supporters of XDP. Still, this patch does not take
>> >that opinionated option.
>> So you decided to add a lock for all non-XDP flows, which are 99% of
>> the cases.
>> We should avoid this.
> The whole point of rcu_read_lock architecture is to be taken in the fast
> path. There won't be a performance impact from this patch.

+1, this is nothing at all like a spinlock and really this should be
just like any other rcu like access.

Brenden, tracking down how the structure is freed needed a few steps,
please make sure the RCU requirements are well documented. Also, I'm
still not a fan of using xchg to set the program, seems that a lock
could be used in that path.

Thanks,
Tom


Re: [iproute PATCH] ip-route: Prevent some double spaces in output

2016-08-29 Thread Stephen Hemminger
On Tue, 23 Aug 2016 11:52:45 +0200
Phil Sutter  wrote:

> The code is a bit messy, as it starts with space after text and at some
> point switches to space before text. But either way, printing space
> before *and* after text almost certainly leads to printing more
> whitespace than necessary.
> 
> Signed-off-by: Phil Sutter 
> ---

Applied


Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-29 Thread Tom Herbert
On Mon, Aug 29, 2016 at 9:26 AM, Rick Jones  wrote:
> On 08/27/2016 12:41 PM, Tom Herbert wrote:
>>
>> On Fri, Aug 26, 2016 at 9:35 PM, David Miller  wrote:
>>>
>>> From: Tom Herbert 
>>> Date: Thu, 25 Aug 2016 16:43:35 -0700
>>>
 This seems like it will only confuse users even more. You've clearly
 identified an issue, let's figure out how to fix it.
>>>
>>>
>>> I kinda feel the same way about this situation.
>>
>>
>> I'm working on XFS (as the transmit analogue to RFS). We'll track
>> flows enough so that we should know when it's safe to move them.
>
>
> Is the XFS you are working on going to subsume XPS or will the two continue
> to exist in parallel a la RPS and RFS?
>
XPS selects the queue, XFS prevents changing the queues when OOO could
occur. I am thinking that XFS is only applicable when we don't have a
socket tracking OOO. Idea is to have a flow table indexed by packet
hash that gives the current TX queue for match flows. The TX queue
comes from doing XPS. We only change the queue used by flows if that
won't result in OOO as determined by tracking the queue similar to how
we do RFS.

Tom

> rick jones
>


Re: [PATCH iproute2] tipc: add peer remove functionality

2016-08-29 Thread Stephen Hemminger
On Mon, 22 Aug 2016 10:18:29 +0200
Richard Alpe  wrote:

> This enables a user to remove an offline peer from the kernel data
> structures. This could for example be useful when deliberately scaling
> in peer nodes in a cloud environment.
> 
> This functionality was first merged in:
> f9dec657e4 (Richard Alpe tipc: add peer remove functionality)
> 
> And later backed out (as the kernel counterpart was held up) in:
> 385caeb13b (Stephen Hemminger Revert "tipc: add peer remove functionality")
> 
> Signed-off-by: Richard Alpe 
> Reviewed-by: Jon Maloy 
> Reviewed-by: Ying Xue 

Applied to net-next branch



Re: [PATCH net-next] amd-xgbe: Reset running devices after resume from hibernate

2016-08-29 Thread Tom Lendacky
On 08/26/2016 03:21 AM, James Morse wrote:
> After resume from hibernate on arm64, any amd-xgbe devices that were
> running when we hibernated are reported as down, even when it is not.
> 
> Re-plugging the cables does not cause the interface to come back, the
> link must be marked as down then up via 'ip set link' using the serial
> console.
> 
> This happens because the device has been power-cycled and possibly
> re-initialised by firmware, whereas the driver's memory structures have
> been restored from the hibernate image and the two do not agree.
> 
> Schedule a restart of the device after powerup in case the world changed
> while we were asleep.
> 
> Signed-off-by: James Morse 

Acked-by: Tom Lendacky 

> ---
>  drivers/net/ethernet/amd/xgbe/xgbe-main.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-main.c 
> b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
> index 3eee3201b58f..9de078819aa6 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-main.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
> @@ -861,9 +861,15 @@ static int xgbe_resume(struct device *dev)
>   pdata->lpm_ctrl &= ~MDIO_CTRL1_LPOWER;
>   XMDIO_WRITE(pdata, MDIO_MMD_PCS, MDIO_CTRL1, pdata->lpm_ctrl);
>  
> - if (netif_running(netdev))
> + if (netif_running(netdev)) {
>   ret = xgbe_powerup(netdev, XGMAC_DRIVER_CONTEXT);
>  
> + /* Schedule a restart in case the link or phy state changed
> +  * while we were powered down.
> +  */
> + schedule_work(>restart_work);
> + }
> +
>   DBGPR("<--xgbe_resume\n");
>  
>   return ret;
> 


[PATCH net-next] sky2: use napi_complete_done

2016-08-29 Thread Stephen Hemminger
Update the sky2 driver to pass number of packets done to NAPI.
The driver was never updated when napi_complete_done was added.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/ethernet/marvell/sky2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index 467138b..f05ea56 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -3070,7 +3070,7 @@ static int sky2_poll(struct napi_struct *napi, int 
work_limit)
goto done;
}
 
-   napi_complete(napi);
+   napi_complete_done(napi, work_done);
sky2_read32(hw, B0_Y2_SP_LISR);
 done:
 
-- 
2.9.3



Re: [net-next RFC v2 9/9] doc: Add LSM / BPF Checmate docs

2016-08-29 Thread Randy Dunlap
On 08/29/16 04:47, Sargun Dhillon wrote:
> This adds documentation on how to operate, and develop against the
> Checmate LSM and Cgroup controller.
> 
> Signed-off-by: Sargun Dhillon 
> ---
>  Documentation/security/Checmate.txt | 54 
> +
>  1 file changed, 54 insertions(+)
>  create mode 100644 Documentation/security/Checmate.txt
> 
> diff --git a/Documentation/security/Checmate.txt 
> b/Documentation/security/Checmate.txt
> new file mode 100644
> index 000..d409785
> --- /dev/null
> +++ b/Documentation/security/Checmate.txt
> @@ -0,0 +1,54 @@
> +--- What is Checmate? ---
> +
> +Checmate is a flexible programmable, extensible minor LSM that's coupled with
> +cgroups and BPF. It is designed to enforce container-specific policies. By
> +default, it does not enforce any policies. It is selectable at build time
> +with CONFIG_SECURITY_CHECMATE, and it is controlled through the unified 
> cgroups
> +controller hierarchy.
> +
> +# How to use Checmate
> +In order to use Checmate, you have to enable the controller on the cgroup2
> +hierarchy. In order to prevent a centralized configuration daemon from 
> mounting
> +Checmate on the V1 hierarchy you may want to add 'cgroup_no_v1=checmate' to 
> your
> +boot command line.
> +
> +Enabling the controller:
> + mount -t cgroup2 none $MOUNT_POINT
> + cd $MOUNT_POINT
> + echo +checmate > cgroup.subtree_control
> +
> +Once you do this, immediate children of this node on the hierarchy will have 
> a
> +number of control files that begin with 'checmate.'. Each of these is mapped
> +to an LSM hook by the same name. If you read the file, it will return the
> +number of filters attached to that given hook. Details of the hooks can be
> +found in lsm_hooks.h.
> +
> +All tasks which are members of a cgroup will have no only the checmate 
> filters

s/no/not/

> +at that level enforced, but all levels above as well. If there is a need
> +to exempt a specific sub-cgroup, a program can use current_task_under_cgroup
> +along with a bpf map.
> +
> +## Adding filters:
> +If you would like to add a filter, you must compile a BPF_PROG_TYPE_CHECMATE 
> BPF
> +program. You can then write the '%d\n' formatted version of the BPF program
> +file descriptor to the relevant control file.
> +
> +## Removing filters:
> +If you would like to remove a specific filter, you can write the negative 
> file
> +descriptor of the BPF program to the control file (a la '-%d\n'). If you 
> would
> +like to do this, then it is recommended that you pin your programs.
> +
> +If you would like to remove all filters from a specific hook, simply write 
> '0'
> +to the control file. During normal operation, you shouldn't have the bpf 
> syscall
> +return '0' for a given program, please take proper precautions to work around
> +this.
> +
> +# Caveats
> +## Hook Limit:
> +Each hook is limited to having MAX_CHECMATE_INSTANCES (32) hooks per level
> +in the hierarchy. The write call will return ENOSPC if you hit this 
> condition.
> +
> +## CGroup v2 interaction with CGroup v1:
> +Because the cgroups subsystem is in transition, using the net_prio or the
> +net_classid v1 cgroups will render Checmate inoperable on all network
> +hooks that inspect sockets.
> \ No newline at end of file



-- 
~Randy


[PATCH] ath10k: Spelling and miscellaneous neatening

2016-08-29 Thread Joe Perches
Correct some trivial comment typos.
Remove unnecessary parentheses in a long line.
Convert a return; before the end of a void function definition to just ;

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ath/ath10k/ce.c|  2 +-
 drivers/net/wireless/ath/ath10k/core.c  |  2 +-
 drivers/net/wireless/ath/ath10k/htt.h   |  8 
 drivers/net/wireless/ath/ath10k/hw.c|  2 +-
 drivers/net/wireless/ath/ath10k/hw.h|  2 +-
 drivers/net/wireless/ath/ath10k/targaddrs.h |  2 +-
 drivers/net/wireless/ath/ath10k/wmi.h   | 24 
 7 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/ce.c 
b/drivers/net/wireless/ath/ath10k/ce.c
index 9fb8d74..cea9684 100644
--- a/drivers/net/wireless/ath/ath10k/ce.c
+++ b/drivers/net/wireless/ath/ath10k/ce.c
@@ -39,7 +39,7 @@
  * chooses what to send (buffer address, length). The destination
  * side keeps a supply of "anonymous receive buffers" available and
  * it handles incoming data as it arrives (when the destination
- * recieves an interrupt).
+ * receives an interrupt).
  *
  * The sender may send a simple buffer (address/length) or it may
  * send a small list of buffers.  When a small list is sent, hardware
diff --git a/drivers/net/wireless/ath/ath10k/core.c 
b/drivers/net/wireless/ath/ath10k/core.c
index e889829..1e8147a 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2118,7 +2118,7 @@ err:
/* TODO: It's probably a good idea to release device from the driver
 * but calling device_release_driver() here will cause a deadlock.
 */
-   return;
+   ;
 }
 
 int ath10k_core_register(struct ath10k *ar, u32 chip_id)
diff --git a/drivers/net/wireless/ath/ath10k/htt.h 
b/drivers/net/wireless/ath/ath10k/htt.h
index 430a83e..cedb4e8 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -595,7 +595,7 @@ enum htt_rx_mpdu_status {
/* only accept EAPOL frames */
HTT_RX_IND_MPDU_STATUS_UNAUTH_PEER,
HTT_RX_IND_MPDU_STATUS_OUT_OF_SYNC,
-   /* Non-data in promiscous mode */
+   /* Non-data in promiscuous mode */
HTT_RX_IND_MPDU_STATUS_MGMT_CTRL,
HTT_RX_IND_MPDU_STATUS_TKIP_MIC_ERR,
HTT_RX_IND_MPDU_STATUS_DECRYPT_ERR,
@@ -900,7 +900,7 @@ struct htt_rx_in_ord_ind {
  * Purpose: indicate how many 32-bit integers follow the message header
  *   - NUM_CHARS
  * Bits 31:16
- * Purpose: indicate how many 8-bit charaters follow the series of integers
+ * Purpose: indicate how many 8-bit characters follow the series of 
integers
  */
 struct htt_rx_test {
u8 num_ints;
@@ -1042,10 +1042,10 @@ struct htt_dbg_stats_wal_tx_stats {
/* illegal rate phy errors  */
__le32 illgl_rate_phy_err;
 
-   /* wal pdev continous xretry */
+   /* wal pdev continuous xretry */
__le32 pdev_cont_xretry;
 
-   /* wal pdev continous xretry */
+   /* wal pdev continuous xretry */
__le32 pdev_tx_timeout;
 
/* wal pdev resets  */
diff --git a/drivers/net/wireless/ath/ath10k/hw.c 
b/drivers/net/wireless/ath/ath10k/hw.c
index f903d46..ed917cb 100644
--- a/drivers/net/wireless/ath/ath10k/hw.c
+++ b/drivers/net/wireless/ath/ath10k/hw.c
@@ -85,7 +85,7 @@ const struct ath10k_hw_regs qca99x0_regs = {
.ce7_base_address   = 0x0004bc00,
/* Note: qca99x0 supports upto 12 Copy Engines. Other than address of
 * CE0 and CE1 no other copy engine is directly referred in the code.
-* It is not really neccessary to assign address for newly supported
+* It is not really necessary to assign address for newly supported
 * CEs in this address table.
 *  Copy Engine Address
 *  CE8 0x0004c000
diff --git a/drivers/net/wireless/ath/ath10k/hw.h 
b/drivers/net/wireless/ath/ath10k/hw.h
index e014cd7..93ca128 100644
--- a/drivers/net/wireless/ath/ath10k/hw.h
+++ b/drivers/net/wireless/ath/ath10k/hw.h
@@ -284,7 +284,7 @@ void ath10k_hw_fill_survey_time(struct ath10k *ar, struct 
survey_info *survey,
 #define QCA_REV_9377(ar) ((ar)->hw_rev == ATH10K_HW_QCA9377)
 #define QCA_REV_40XX(ar) ((ar)->hw_rev == ATH10K_HW_QCA4019)
 
-/* Known pecularities:
+/* Known peculiarities:
  *  - raw appears in nwifi decap, raw and nwifi appear in ethernet decap
  *  - raw have FCS, nwifi doesn't
  *  - ethernet frames have 802.11 header decapped and parts (base hdr, cipher
diff --git a/drivers/net/wireless/ath/ath10k/targaddrs.h 
b/drivers/net/wireless/ath/ath10k/targaddrs.h
index aaf53a8..a47cab4 100644
--- a/drivers/net/wireless/ath/ath10k/targaddrs.h
+++ b/drivers/net/wireless/ath/ath10k/targaddrs.h
@@ -405,7 +405,7 @@ Fw Mode/SubMode Mask
  * 1. target firmware would check magic number and if it's a match, firmware
  *would consider the bits[0:15] are valid and 

Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Tejun Heo
Hello,

On Mon, Aug 29, 2016 at 04:47:07AM -0700, Sargun Dhillon wrote:
> This patch adds a minor LSM, Checmate. Checmate is a flexible programmable,
> extensible minor LSM that's coupled with cgroups and BPF. It is designed to
> enforce container-specific policies. It is also a cgroupv2 controller. By
> itself, it doesn't do very much, but in conjunction with a orchestrator
> complex policies can be installed on the cgroup hierarchy.
> 
> These cgroup programs are tied to the kernel ABI version. If one tries
> to load a BPF program compiled against a different kernel version,
> an error will be thrown.

First of all, please talk with people working on network cgroup bpf
and landlock.  I don't think it's a good idea to have N different ways
to implement cgroup-aware bpf mechanism.  There can be multiple
consumers but there gotta be a common mechanism instead of several
independent controllers.

> diff --git a/include/linux/checmate.h b/include/linux/checmate.h
> new file mode 100644
> index 000..4c4db4a
> --- /dev/null
> +++ b/include/linux/checmate.h
> @@ -0,0 +1,108 @@
> +#ifndef _LINUX_CHECMATE_H_
> +#define _LINUX_CHECMATE_H_ 1
> +#include 
> +
> +enum checmate_hook_num {
> + /* CONFIG_SECURITY_NET hooks */
> + CHECMATE_HOOK_UNIX_STREAM_CONNECT,
> + CHECMATE_HOOK_UNIX_MAY_SEND,
> + CHECMATE_HOOK_SOCKET_CREATE,
> + CHECMATE_HOOK_SOCKET_POST_CREATE,
> + CHECMATE_HOOK_SOCKET_BIND,
> + CHECMATE_HOOK_SOCKET_CONNECT,
> + CHECMATE_HOOK_SOCKET_LISTEN,
> + CHECMATE_HOOK_SOCKET_ACCEPT,
> + CHECMATE_HOOK_SOCKET_SENDMSG,
> + CHECMATE_HOOK_SOCKET_RECVMSG,
> + CHECMATE_HOOK_SOCKET_GETSOCKNAME,
> + CHECMATE_HOOK_SOCKET_GETPEERNAME,
> + CHECMATE_HOOK_SOCKET_GETSOCKOPT,
> + CHECMATE_HOOK_SOCKET_SETSOCKOPT,
> + CHECMATE_HOOK_SOCKET_SHUTDOWN,
> + CHECMATE_HOOK_SOCKET_SOCK_RCV_SKB,
> + CHECMATE_HOOK_SK_FREE_SECURITY,
> + __CHECMATE_HOOK_MAX,
> +};

Do we really want a separate hook for each call?  A logical extension
of this would be having a separate hook per syscall which feels kinda
horrible.

> +/* CONFIG_SECURITY_NET contexts */
> +struct checmate_unix_stream_connect_ctx {
> + struct sock *sock;
> + struct sock *other;
> + struct sock *newsk;
> +};
...
> +struct checmate_sk_free_security_ctx {
> + struct sock *sk;
> +};
...
> +struct checmate_ctx {
> + int hook;
> + union {
> +/* CONFIG_SECURITY_NET contexts */
> + struct checmate_unix_stream_connect_ctx unix_stream_connect;
> + struct checmate_unix_may_send_ctx   unix_may_send;
> + struct checmate_socket_create_ctx   socket_create;
> + struct checmate_socket_bind_ctx socket_bind;
> + struct checmate_socket_connect_ctx  socket_connect;
> + struct checmate_socket_listen_ctx   socket_listen;
> + struct checmate_socket_accept_ctx   socket_accept;
> + struct checmate_socket_sendmsg_ctx  socket_sendmsg;
> + struct checmate_socket_recvmsg_ctx  socket_recvmsg;
> + struct checmate_socket_sock_rcv_skb_ctx socket_sock_rcv_skb;
> + struct checmate_sk_free_security_ctxsk_free_security;
> + };
> +};

I'm not convinced about the approach.  It's an approach which pretty
much requires future extensions while being rigid.  Not a good
combination.

> +/*

Please use /** for function comments.

> + * checmate_instance_cleanup_rcu - Cleans up a Checmate program instance
> + * @rp: rcu_head pointer to a Checmate instance
> + */
> +static void checmate_instance_cleanup_rcu(struct rcu_head *rp)
> +{
> + struct checmate_instance *instance;
> +
> + instance = container_of(rp, struct checmate_instance, rcu);
> + bpf_prog_put(instance->prog);
> + kfree(instance);
> +}
> +static struct cftype checmate_files[] = {
> +#ifdef CONFIG_SECURITY_NETWORK
> + CHECMATE_CFTYPE(CHECMATE_HOOK_UNIX_STREAM_CONNECT,
> + "unix_stream_connect"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_UNIX_MAY_SEND,
> + "unix_may_send"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_CREATE, "socket_create"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_BIND, "socket_bind"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_CONNECT, "socket_connect"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_LISTEN, "socket_listen"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_ACCEPT, "socket_accept"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SENDMSG, "socket_sendmsg"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_RECVMSG, "socket_recvmsg"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SHUTDOWN, "socket_shutdown"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SOCK_RCV_SKB,
> + "socket_sock_rcv_skb"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SK_FREE_SECURITY, "sk_free_security"),
> +#endif /* CONFIG_SECURITY_NETWORK */
> + {}
> +};

I don't think this is a good interface approach.

> +struct 

Re: [PATCH net-next] tcp: add tcp_add_backlog()

2016-08-29 Thread Yuchung Cheng
On Sat, Aug 27, 2016 at 9:25 AM, Eric Dumazet  wrote:
>
> On Sat, 2016-08-27 at 09:13 -0700, Yuchung Cheng wrote:
> > On Sat, Aug 27, 2016 at 7:37 AM, Eric Dumazet  
> > wrote:
> > >
>
> > > +   /* Only socket owner can try to collapse/prune rx queues
> > > +* to reduce memory overhead, so add a little headroom here.
> > > +* Few sockets backlog are possibly concurrently non empty.
> > > +*/
> > > +   limit += 64*1024;
> > Just a thought: only add the headroom if ofo queue exists (e.g., signs
> > of losses ore recovery).
>
> Testing the ofo would add a cache line miss, and likely slow down the
> other cpu processing the other packets for this flow.
>
> Also, even if the ofo does not exist, the sk_rcvbuf budget can be
> consumed by regular receive queue.
>
> We still need to be able to process incoming ACK, if both send and
> receive queues are 'full'.
>
> >
> > btw is the added headroom subject to the memory pressure check?
>
> Remind that the backlog check here is mostly to avoid some kind of DOS
> attacks that we had in the past.
>
> While we should definitely prevents DOS attacks, we should also not drop
> legitimate traffic.
>
> Here, number of backlogged sockets is limited by the number of cpus in
> the host (if CONFIG_PREEMPT is disabled), or number of threads blocked
> during a sendmsg()/recvmsg() (if CONFIG_PREEMPT is enabled)
>
> So we do not need to be ultra precise, just have a safe guard.
>
> The pressure check will be done at the time skbs will be added into a
> receive/ofo queue in the very near future.
Good to know. Thanks.

>
>
> Thanks !
>
>
>


Re: [PATCH net 1/2] ipv6: add missing netconf notif when 'all' is updated

2016-08-29 Thread Sergei Shtylyov

On 08/29/2016 05:13 PM, 吉藤英明 wrote:


The 'default' value was not advertised.

Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding
status")
Signed-off-by: Nicolas Dichtel 
---
 net/ipv6/addrconf.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f418d2eaeddd..299f0656e87f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -778,7 +778,14 @@ static int addrconf_fixup_forwarding(struct ctl_table
*table, int *p, int newf)
}

if (p == >ipv6.devconf_all->forwarding) {
+   int old_dftl = net->ipv6.devconf_dflt->forwarding;
+
net->ipv6.devconf_dflt->forwarding = newf;
+   if ((!newf) ^ (!old_dftl))



   IIUC, !'s are not necessary here (and more so the parens around them).
And perhaps ^ can be changed to != for clarity...


No, it will break.


   Well, if those variables are actually bit masks, ! seem to be needed 
indeed. But ^ can be replaced by != anyway.



--yoshfuji


MBR, Sergei



Fwd: ixgbe: Can't get link / carrier on embedded NIC of a D-1518 CPU

2016-08-29 Thread Sylvain Munaut
Hi,


I'm having trouble getting a link up on a ixgbe card.

The hardware is a SuperMicro SYS-5018D-FN8T (
https://www.supermicro.nl/products/system/1U/5018/SYS-5018D-FN8T.cfm )
It includes a Intel® Xeon® processor D-1518 with embedded 10G Intel
NICs and cs4227 phys to SFP+ ports.

The optics in it is an official Intel optics FTLX8571D3BCVIT1  but I
tried with some direct attach cables and some active optical tables
and other optics as well with the same results.

I tried both latest stable 4.4.x and 4.8.0-rc4 kernels

There is no errors, just the device stays at NO-CARRIER. Even loading
the module with debug=16 doesn't show anything more.

Any idea what I can try ?


Cheers,

Sylvain



Debug output :

8: eth6:  mtu 1500 qdisc mq state
DOWN mode DEFAULT group default qlen 1000
link/ether 00:25:90:5d:f8:b2 brd ff:ff:ff:ff:ff:ff


bash# ethtool eth6
Settings for eth6:
Supported ports: [ FIBRE ]
Supported link modes:   1000baseT/Full
1baseT/Full
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes:  1baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: Unknown!
Duplex: Unknown! (255)
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x7fff (32767)
   drv probe link timer ifdown ifup rx_err tx_err
tx_queued intr tx_done rx_status pktdata hw wol
Link detected: no

bash# ethtool -m eth6
Identifier: 0x03 (SFP)
Extended identifier   : 0x04 (GBIC/SFP defined
by 2-wire interface ID)
Connector : 0x07 (LC)
Transceiver codes : 0x10 0x00 0x00 0x01
0x00 0x00 0x00 0x00
Transceiver type  : 10G Ethernet: 10G Base-SR
Transceiver type  : Ethernet: 1000BASE-SX
Encoding  : 0x06 (64B/66B)
BR, Nominal   : 10300MBd
Rate identifier   : 0x02 (8/4/2G Rx
Rate_Select only)
Length (SMF,km)   : 0km
Length (SMF)  : 0m
Length (50um) : 80m
Length (62.5um)   : 30m
Length (Copper)   : 0m
Length (OM3)  : 300m
Laser wavelength  : 850nm
Vendor name   : Intel Corp
Vendor OUI: 00:1b:21
Vendor PN : FTLX8571D3BCVIT1
Vendor rev: A
Optical diagnostics support   : Yes
Laser bias current: 21.410 mA
Laser output power: 0.6035 mW / -2.19 dBm
Receiver signal average optical power : 0.7226 mW / -1.41 dBm
Module temperature: 41.82 degrees C /
107.27 degrees F
Module voltage: 3.3114 V
Alarm/warning flags implemented   : Yes
Laser bias current high alarm : Off
Laser bias current low alarm  : Off
Laser bias current high warning   : Off
Laser bias current low warning: Off
Laser output power high alarm : Off
Laser output power low alarm  : Off
Laser output power high warning   : Off
Laser output power low warning: Off
Module temperature high alarm : Off
Module temperature low alarm  : Off
Module temperature high warning   : Off
Module temperature low warning: Off
Module voltage high alarm : Off
Module voltage low alarm  : Off
Module voltage high warning   : Off
Module voltage low warning: Off
Laser rx power high alarm : Off
Laser rx power low alarm  : Off
Laser rx power high warning   : Off
Laser rx power low warning: Off
Laser bias current high alarm threshold   : 13.200 mA
Laser bias current low alarm threshold: 2.000 mA
Laser bias current high warning threshold : 12.600 mA
Laser bias current low warning threshold  : 3.000 mA
Laser output power high alarm threshold   : 1. mW / 0.00 dBm
Laser output power low alarm threshold: 0.1585 mW / -8.00 dBm
Laser output power high warning threshold : 0.7943 mW / -1.00 dBm
Laser output power low warning threshold  : 0.1995 mW / -7.00 dBm
Module temperature high alarm threshold   : 78.00 degrees C /
172.40 degrees F
Module temperature low alarm threshold: -13.00 degrees C /
8.60 

Re: [RFC PATCH 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-08-29 Thread Eric Dumazet
On Tue, 2016-08-30 at 00:03 +0800, Jia He wrote:
> This patch exchanges the two loop for collecting the percpu statistics
> data. This can aggregate the data by going through all the items of each
> cpu sequentially. In snmp_seq_show, just use one buff copy to dislay the
> Udp and UdpLite data because they are the same.

This is obviously not true.

On my laptop it seems it handled no UdpLite frames, but got plenty of
Udp ones.

$ grep Udp /proc/net/snmp
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
Udp: 1555889 108318 0 3740780 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
UdpLite: 0 0 0 0 0 0 0





Re: [PATCH v4] brcmfmac: add missing header dependencies

2016-08-29 Thread Rafał Miłecki
On 29 August 2016 at 17:42, Baoyou Xie  wrote:
> On 29 August 2016 at 23:31, Rafał Miłecki  wrote:
>>
>> On 29 August 2016 at 14:39, Baoyou Xie  wrote:
>> > We get 1 warning when build kernel with W=1:
>> > drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.c:23:6:
>> > warning: no previous prototype for '__brcmf_err' [-Wmissing-prototypes]
>>
>> building? I'm not native English, but I think so.
>>
>>
>> > In fact, this function is declared in brcmfmac/debug.h, so this patch
>> > add missing header dependencies.
>>
>> adds
>>
>>
>> > Signed-off-by: Baoyou Xie 
>> > Acked-by: Arnd Bergmann 
>>
>> Please don't resend patches just to add tags like that. This only
>> increases a noise and patchwork handles this just fine, see:
>> https://patchwork.kernel.org/patch/9303285/
>> https://patchwork.kernel.org/patch/9303285/mbox/
>
>
> it modifies a typo biuld/build.

Please send your replies to all, not privately.

OK, so you should also include a changelog in the comments part of diff, e.g.:
V4: Fix typo in "biuld"


Re: [PATCH net v2 9/9] net: ethernet: mediatek: fix error handling inside mtk_mdio_init

2016-08-29 Thread Andrew Lunn
On Mon, Aug 29, 2016 at 01:03:23PM +0800, sean.w...@mediatek.com wrote:
> From: Sean Wang 
> 
> return -ENODEV if no child is found in MDIO bus.

The "no child" is wrong here, and got me confused. What the code is
actually doing is of_device_is_available() which is looking to see if
there is a status = "okay". So what the change log should say is:

Return -ENODEV if the MDIO bus is disabled in the device tree.

Once that has been corrected:

Reviewed-by: Andrew Lunn 

Andrew


> 
> Signed-off-by: Sean Wang 
> Acked-by: John Crispin 
> ---
>  drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
> b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> index f741c6a..e48b2a4 100644
> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
> @@ -304,7 +304,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
>   }
>  
>   if (!of_device_is_available(mii_np)) {
> - ret = 0;
> + ret = -ENODEV;
>   goto err_put_node;
>   }
>  
> -- 
> 1.9.1
> 


Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-29 Thread Rick Jones

On 08/27/2016 12:41 PM, Tom Herbert wrote:

On Fri, Aug 26, 2016 at 9:35 PM, David Miller  wrote:

From: Tom Herbert 
Date: Thu, 25 Aug 2016 16:43:35 -0700


This seems like it will only confuse users even more. You've clearly
identified an issue, let's figure out how to fix it.


I kinda feel the same way about this situation.


I'm working on XFS (as the transmit analogue to RFS). We'll track
flows enough so that we should know when it's safe to move them.


Is the XFS you are working on going to subsume XPS or will the two 
continue to exist in parallel a la RPS and RFS?


rick jones



[RESEND PATCH v6 1/3] net: ethernet: dwmac: add Ethernet glue logic for stm32 chip

2016-08-29 Thread Alexandre TORGUE
From: Alexandre TORGUE 

stm324xx family chips support Synopsys MAC 3.510 IP.
This patch adds settings for logical glue logic:
-clocks
-mode selection MII or RMII.

Reviewed-by: Joachim Eastwood 
Acked-by: Giuseppe Cavallaro 
Tested-by: Maxime Coquelin 
Signed-off-by: Alexandre TORGUE 

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig 
b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index 8f06a66..c732b8c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -104,6 +104,18 @@ config DWMAC_STI
  device driver. This driver is used on for the STi series
  SOCs GMAC ethernet controller.
 
+config DWMAC_STM32
+   tristate "STM32 DWMAC support"
+   default ARCH_STM32
+   depends on OF && HAS_IOMEM
+   select MFD_SYSCON
+   ---help---
+ Support for ethernet controller on STM32 SOCs.
+
+ This selects STM32 SoC glue layer support for the stmmac
+ device driver. This driver is used on for the STM32 series
+ SOCs GMAC ethernet controller.
+
 config DWMAC_SUNXI
tristate "Allwinner GMAC support"
default ARCH_SUNXI
diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 44b630c..f0c9396 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_DWMAC_MESON) += dwmac-meson.o
 obj-$(CONFIG_DWMAC_ROCKCHIP)   += dwmac-rk.o
 obj-$(CONFIG_DWMAC_SOCFPGA)+= dwmac-altr-socfpga.o
 obj-$(CONFIG_DWMAC_STI)+= dwmac-sti.o
+obj-$(CONFIG_DWMAC_STM32)  += dwmac-stm32.o
 obj-$(CONFIG_DWMAC_SUNXI)  += dwmac-sunxi.o
 obj-$(CONFIG_DWMAC_GENERIC)+= dwmac-generic.o
 stmmac-platform-objs:= stmmac_platform.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
new file mode 100644
index 000..79d8b92
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
@@ -0,0 +1,193 @@
+/*
+ * dwmac-stm32.c - DWMAC Specific Glue layer for STM32 MCU
+ *
+ * Copyright (C) Alexandre Torgue 2015
+ * Author:  Alexandre Torgue 
+ * License terms:  GNU General Public License (GPL), version 2
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "stmmac_platform.h"
+
+#define MII_PHY_SEL_MASK   BIT(23)
+
+struct stm32_dwmac {
+   struct clk *clk_tx;
+   struct clk *clk_rx;
+   u32 mode_reg;   /* MAC glue-logic mode register */
+   struct regmap *regmap;
+   u32 speed;
+};
+
+static int stm32_dwmac_init(struct plat_stmmacenet_data *plat_dat)
+{
+   struct stm32_dwmac *dwmac = plat_dat->bsp_priv;
+   u32 reg = dwmac->mode_reg;
+   u32 val;
+   int ret;
+
+   val = (plat_dat->interface == PHY_INTERFACE_MODE_MII) ? 0 : 1;
+   ret = regmap_update_bits(dwmac->regmap, reg, MII_PHY_SEL_MASK, val);
+   if (ret)
+   return ret;
+
+   ret = clk_prepare_enable(dwmac->clk_tx);
+   if (ret)
+   return ret;
+
+   ret = clk_prepare_enable(dwmac->clk_rx);
+   if (ret)
+   clk_disable_unprepare(dwmac->clk_tx);
+
+   return ret;
+}
+
+static void stm32_dwmac_clk_disable(struct stm32_dwmac *dwmac)
+{
+   clk_disable_unprepare(dwmac->clk_tx);
+   clk_disable_unprepare(dwmac->clk_rx);
+}
+
+static int stm32_dwmac_parse_data(struct stm32_dwmac *dwmac,
+ struct device *dev)
+{
+   struct device_node *np = dev->of_node;
+   int err;
+
+   /*  Get TX/RX clocks */
+   dwmac->clk_tx = devm_clk_get(dev, "mac-clk-tx");
+   if (IS_ERR(dwmac->clk_tx)) {
+   dev_err(dev, "No tx clock provided...\n");
+   return PTR_ERR(dwmac->clk_tx);
+   }
+   dwmac->clk_rx = devm_clk_get(dev, "mac-clk-rx");
+   if (IS_ERR(dwmac->clk_rx)) {
+   dev_err(dev, "No rx clock provided...\n");
+   return PTR_ERR(dwmac->clk_rx);
+   }
+
+   /* Get mode register */
+   dwmac->regmap = syscon_regmap_lookup_by_phandle(np, "st,syscon");
+   if (IS_ERR(dwmac->regmap))
+   return PTR_ERR(dwmac->regmap);
+
+   err = of_property_read_u32_index(np, "st,syscon", 1, >mode_reg);
+   if (err)
+   dev_err(dev, "Can't get sysconfig mode offset (%d)\n", err);
+
+   return err;
+}
+
+static int stm32_dwmac_probe(struct platform_device *pdev)
+{
+   struct plat_stmmacenet_data *plat_dat;
+   struct stmmac_resources stmmac_res;
+   struct stm32_dwmac *dwmac;
+   int ret;
+
+   ret = stmmac_get_platform_resources(pdev, _res);
+   if (ret)
+   return ret;
+
+   plat_dat = 

[RESEND PATCH v6 3/3] net: ethernet: stmmac: add support of Synopsys 3.50a MAC IP

2016-08-29 Thread Alexandre TORGUE
From: Alexandre TORGUE 

Adds support of Synopsys 3.50a MAC IP in stmmac driver.

Acked-by: Giuseppe Cavallaro 
Tested-by: Maxime Coquelin 
Signed-off-by: Alexandre TORGUE 

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 756bb54..0a0d6a8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -265,6 +265,7 @@ stmmac_probe_config_dt(struct platform_device *pdev, const 
char **mac)
 * once needed on other platforms.
 */
if (of_device_is_compatible(np, "st,spear600-gmac") ||
+   of_device_is_compatible(np, "snps,dwmac-3.50a") ||
of_device_is_compatible(np, "snps,dwmac-3.70a") ||
of_device_is_compatible(np, "snps,dwmac")) {
/* Note that the max-frame-size parameter as defined in the
-- 
1.9.1



[RESEND PATCH v6 0/3] Add Ethernet support on STM32F429

2016-08-29 Thread Alexandre TORGUE
STM32F429 Chip embeds a Synopsys 3.50a MAC IP.

This series enhance current stmmac driver to control it (code already
available) and adds basic glue for STM32F429 chip.

Changes since v5:
 -Fix typo in bindings documentation patch.
 -Change clocks names in stm32-dwmac glue driver / Documentation.
 -After rebase, stm32 ethernet node is now available. It has to be updated
according to new clocks names.

Changes since v4:
 -Fix dirty copy/past in bindings documentation patch.

Changes since v3:
 -Fix "tx-clk" and "rx-clk" as required clocks. Driver and bindings are
modified.

Changes since v2:
 -Fix alphabetic order in Kconfig and Makefile.
 -Improve code according to Joachim review.
 -Binding: remove useless entry.

Changes since v1:
 -Fix Kbuild issue in Kconfig.
 -Remove init/exit callbacks. Suspend/Resume and remove driver is no more
driven in stmmac_pltfr but directly in dwmac-stm32 glue driver.
 -Take into account Joachim review.

Regards.

Alexandre.

Alexandre TORGUE (3):
  net: ethernet: dwmac: add Ethernet glue logic for stm32 chip
  Documentation: Bindings: Add STM32 DWMAC glue
  net: ethernet: stmmac: add support of Synopsys 3.50a MAC IP

 .../devicetree/bindings/net/stm32-dwmac.txt|  32 
 drivers/net/ethernet/stmicro/stmmac/Kconfig|  12 ++
 drivers/net/ethernet/stmicro/stmmac/Makefile   |   1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c  | 193 +
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   1 +
 5 files changed, 239 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/stm32-dwmac.txt
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c

-- 
1.9.1



[RESEND PATCH v6 2/3] Documentation: Bindings: Add STM32 DWMAC glue

2016-08-29 Thread Alexandre TORGUE
From: Alexandre TORGUE 

Acked-by: Rob Herring 
Signed-off-by: Alexandre TORGUE 

diff --git a/Documentation/devicetree/bindings/net/stm32-dwmac.txt 
b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
new file mode 100644
index 000..c35afb7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
@@ -0,0 +1,32 @@
+STMicroelectronics STM32 / MCU DWMAC glue layer controller
+
+This file documents platform glue layer for stmmac.
+Please see stmmac.txt for the other unchanged properties.
+
+The device node has following properties.
+
+Required properties:
+- compatible:  Should be "st,stm32-dwmac" to select glue, and
+  "snps,dwmac-3.50a" to select IP version.
+- clocks: Must contain a phandle for each entry in clock-names.
+- clock-names: Should be "stmmaceth" for the host clock.
+  Should be "mac-clk-tx" for the MAC TX clock.
+  Should be "mac-clk-rx" for the MAC RX clock.
+- st,syscon : Should be phandle/offset pair. The phandle to the syscon node 
which
+ encompases the glue register, and the offset of the control 
register.
+Example:
+
+   ethernet@40028000 {
+   compatible = "st,stm32-dwmac", "snps,dwmac-3.50a";
+   status = "disabled";
+   reg = <0x40028000 0x8000>;
+   reg-names = "stmmaceth";
+   interrupts = <0 61 0>, <0 62 0>;
+   interrupt-names = "macirq", "eth_wake_irq";
+   clock-names = "stmmaceth", "mac-clk-tx", "mac-clk-rx";
+   clocks = < 0 25>, < 0 26>, < 0 27>;
+   st,syscon = < 0x4>;
+   snps,pbl = <8>;
+   snps,mixed-burst;
+   dma-ranges;
+   };
-- 
1.9.1



Re: [PATCH net v2 9/9] net: ethernet: mediatek: fix error handling inside mtk_mdio_init

2016-08-29 Thread Sean Wang
Date: Mon, 29 Aug 2016 15:15:58 +0200,Andrew Lunn wrote:
>On Mon, Aug 29, 2016 at 01:03:23PM +0800, sean.w...@mediatek.com wrote:
>> From: Sean Wang 
>> 
>> return -ENODEV if no child is found in MDIO bus.
>
>Hi Sean
>
>Why is it an error not to have any children on the bus?
>
>Say i have a fibre optical module connected to the MAC. It is unlikely
>to have an MII interface, so i would not list it on the bus. With this
>change, if i have the mdio-bus node in my device tree, i don't get a
>working system. Without this change, it simply does not instantiate
>the MDIO device, and returns without an error.
>
>I think this patch should be dropped, or maybe a comment adding, why
>the current code returns 0 at this point.
>
>Andrew
>
Hi Andrew,

Sorry, i didn't add the comment enough and well on the patch for let you can't 
see 
what i am done

the patch I just want to let driver know if device tree is defined well at the 
earlier time

although original driver still works without this patch 

the original logic on the driver is 

1) returning -ENODEV if no mdio_bus defined in the device tree. 

2) returning -ENODEV inside ndo_init callback if no phy_dev detected on the bus

--
after with the patch, the logic becomes

1) returning -ENODEV if no mdio_bus defined in the device tree

2) returning -ENODEV if no phy_dev defined in the device tree 
a. that is used to check if the device tree about phy_dev is defined well
b. and it could be aligned with the above 1) is doing

3) returning -ENODEV inside ndo_init callback if no phy device detected on the 
bus 
(that is used to test if phy_device is workable or encounters real phy problems)

so add 2) help to make distinguish if it is a device tree definition 
problem or a real phy problem at the earlier time

this is my whole thought

thanks,
Sean

>> 
>> Signed-off-by: Sean Wang 
>> Acked-by: John Crispin 
>> ---
>>  drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
>> b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
>> index f741c6a..e48b2a4 100644
>> --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
>> +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
>> @@ -304,7 +304,7 @@ static int mtk_mdio_init(struct mtk_eth *eth)
>>  }
>>  
>>  if (!of_device_is_available(mii_np)) {
>> -ret = 0;
>> +ret = -ENODEV;
>>  goto err_put_node;
>>  }
>>  
>> -- 
>> 1.9.1
>


[RFC PATCH 5/6] ipv6: Remove useless parameter in __snmp6_fill_statsdev

2016-08-29 Thread Jia He
The parameter items(always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev.

Signed-off-by: Jia He 
---
 net/ipv6/addrconf.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index df8425f..0fa5aa1 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4944,18 +4944,18 @@ static inline size_t inet6_if_nlmsg_size(void)
 }
 
 static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
- int items, int bytes)
+   int bytes)
 {
int i;
-   int pad = bytes - sizeof(u64) * items;
+   int pad = bytes - sizeof(u64) * ICMP6_MIB_MAX;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, [0]);
-   for (i = 1; i < items; i++)
+   put_unaligned(ICMP6_MIB_MAX, [0]);
+   for (i = 1; i < ICMP6_MIB_MAX; i++)
put_unaligned(atomic_long_read([i]), [i]);
 
-   memset([items], 0, pad);
+   memset([ICMP6_MIB_MAX], 0, pad);
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
@@ -4988,7 +4988,7 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 offsetof(struct ipstats_mib, syncp));
break;
case IFLA_INET6_ICMP6STATS:
-   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
+   __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
bytes);
break;
}
 }
-- 
1.8.3.1



[RFC PATCH 2/6] proc: Reduce cache miss in snmp6_seq_show

2016-08-29 Thread Jia He
This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/ipv6/proc.c | 47 ---
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..c834646 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,13 @@
 #include 
 #include 
 
+#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b))
+
+#define MAX4(a, b, c, d) \
+   MAX(MAX(a, b), MAX(c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
 static int sockstat6_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
@@ -191,25 +198,43 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
atomic_long_t *smib,
const struct snmp_mib *itemlist)
 {
-   int i;
-   unsigned long val;
-
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   int i, c;
+   unsigned long buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
+
+   if (pcpumib) {
+   for_each_possible_cpu(c)
+   for (i = 0; itemlist[i].name; i++)
+   buff[i] += snmp_get_cpu_field(pcpumib, c,
+   itemlist[i].entry);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
 }
 
 static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,
  const struct snmp_mib *itemlist, size_t 
syncpoff)
 {
-   int i;
+   int i, c;
+   u64 buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
 
+   for_each_possible_cpu(c) {
+   for (i = 0; itemlist[i].name; i++) {
+   buff[i] += snmp_get_cpu_field64(mib, c,
+   itemlist[i].entry,
+   syncpoff);
+   }
+   }
for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff[i]);
 }
 
 static int snmp6_seq_show(struct seq_file *seq, void *v)
-- 
1.8.3.1



[RFC PATCH 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-08-29 Thread Jia He
This patch exchanges the two loop for collecting the percpu statistics
data. This can aggregate the data by going through all the items of each
cpu sequentially. In snmp_seq_show, just use one buff copy to dislay the
Udp and UdpLite data because they are the same.

Signed-off-by: Jia He 
---
 net/ipv4/proc.c | 78 -
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..ed09566 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -380,11 +380,15 @@ static void icmp_put(struct seq_file *seq)
  */
 static int snmp_seq_show(struct seq_file *seq, void *v)
 {
-   int i;
+   int i, c;
+   unsigned long buff[TCP_MIB_MAX];
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
 
-   seq_puts(seq, "Ip: Forwarding DefaultTTL");
+   memset(buff, 0, TCP_MIB_MAX * sizeof(unsigned long));
+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
+   seq_puts(seq, "Ip: Forwarding DefaultTTL");
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
@@ -393,11 +397,16 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
   net->ipv4.sysctl_ip_default_ttl);
 
BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   buff64[i] += snmp_get_cpu_field64(
+   net->mib.ip_statistics,
+   c, snmp4_ipstats_list[i].entry,
+   offsetof(struct ipstats_mib, syncp));
+   }
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
 
icmp_put(seq);  /* RFC 2011 compatibility */
icmpmsg_put(seq);
@@ -407,38 +416,41 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics,
+   c, snmp4_tcp_list[i].entry);
+   }
+
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
 
+   memset(buff, 0, TCP_MIB_MAX * sizeof(unsigned long));
+
+   for_each_possible_cpu(c) {
+   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(net->mib.udp_statistics,
+   c, snmp4_udp_list[i].entry);
+   }
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
 
/* the UDP and UDP-Lite MIBs are the same */
seq_puts(seq, "\nUdpLite:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdpLite:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udplite_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
 
seq_putc(seq, '\n');
return 0;
@@ -464,29 +476,39 @@ static const struct file_operations snmp_seq_fops = {
  */
 static int netstat_seq_show(struct seq_file *seq, void *v)
 {
-   int i;
+   int 

[RFC PATCH 3/6] proc: Reduce cache miss in sctp_snmp_seq_show

2016-08-29 Thread Jia He
This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/sctp/proc.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index ef8ba77..085fb95 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -74,12 +74,19 @@ static const struct snmp_mib sctp_snmp_list[] = {
 static int sctp_snmp_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   int i;
+   int i, c;
+   unsigned long buff[SCTP_MIB_MAX];
 
+   memset(buff, 0, sizeof(unsigned long) * SCTP_MIB_MAX);
+
+   for_each_possible_cpu(c)
+   for (i = 0; sctp_snmp_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(
+   net->sctp.sctp_statistics,
+   c, sctp_snmp_list[i].entry);
for (i = 0; sctp_snmp_list[i].name != NULL; i++)
seq_printf(seq, "%-32s\t%ld\n", sctp_snmp_list[i].name,
-  snmp_fold_field(net->sctp.sctp_statistics,
- sctp_snmp_list[i].entry));
+   buff[i]);
 
return 0;
 }
-- 
1.8.3.1



[RFC PATCH 0/6] Reduce cache miss for snmp_fold_field

2016-08-29 Thread Jia He
In a PowerPc server with large cpu number(160), besides commit
a3a773726c9f ("net: Optimize snmp stat aggregation by walking all
the percpu data at once"), I watched several other snmp_fold_field
callsites which will cause high cache miss rate.

#My simple test case, which read from the procfs items endlessly:
/***/
#include 
#include 
#include 
#include 
#include 
#define LINELEN  2560
int main(int argc, char **argv)
{
int i;
int fd = -1 ;
int rdsize = 0;
char buf[LINELEN+1];

buf[LINELEN] = 0;
memset(buf,0,LINELEN);

if(1 >= argc) {
printf("file name empty\n");
return -1;
}

fd = open(argv[1], O_RDWR, 0644);
if(0 > fd){
printf("open error\n");
return -2;
}

for(i=0;i<0x;i++) {
while(0 < (rdsize = read(fd,buf,LINELEN))){
//nothing here
}

lseek(fd, 0, SEEK_SET);
}

close(fd);
return 0;
}
/**/

#compile and run:
gcc test.c -o test

perf stat -d -e cache-misses ./test /proc/net/snmp
perf stat -d -e cache-misses ./test /proc/net/snmp6
perf stat -d -e cache-misses ./test /proc/net/netstat
perf stat -d -e cache-misses ./test /proc/net/sctp/snmp
perf stat -d -e cache-misses ./test /proc/net/xfrm_stat

#test results
I firstly test the correctness of data statistics. 

As for performance, before the patch set:

 Performance counter stats for 'system wide':

 355911097  cache-misses
 [40.08%]
2356829300  L1-dcache-loads 
 [60.04%]
 355642645  L1-dcache-load-misses #   15.09% of all L1-dcache 
hits   [60.02%]
 346544541  LLC-loads   
 [59.97%]
389763  LLC-load-misses   #0.11% of all LL-cache 
hits[40.02%]

   6.245162638 seconds time elapsed

After the patch set:
===
 Performance counter stats for 'system wide':

 194992476  cache-misses
 [40.03%]
6718051877  L1-dcache-loads 
 [60.07%]
 194871921  L1-dcache-load-misses #2.90% of all L1-dcache 
hits   [60.11%]
 187632232  LLC-loads   
 [60.04%]
464466  LLC-load-misses   #0.25% of all LL-cache 
hits[39.89%]

   6.868422769 seconds time elapsed
The cache-miss rate can be reduced from 15% to 2.9%

Jia He (6):
  proc: Reduce cache miss in {snmp,netstat}_seq_show
  proc: Reduce cache miss in snmp6_seq_show
  proc: Reduce cache miss in sctp_snmp_seq_show
  proc: Reduce cache miss in xfrm_statistics_seq_show
  ipv6: Remove useless parameter in __snmp6_fill_statsdev
  net: Suppress the "Comparison to NULL could be written" warning

 net/ipv4/proc.c  | 110 ++-
 net/ipv6/addrconf.c  |  12 +++---
 net/ipv6/proc.c  |  47 --
 net/sctp/proc.c  |  15 +--
 net/xfrm/xfrm_proc.c |  15 +--
 5 files changed, 131 insertions(+), 68 deletions(-)

-- 
1.8.3.1



[RFC PATCH 4/6] proc: Reduce cache miss in xfrm_statistics_seq_show

2016-08-29 Thread Jia He
This patch exchanges the two loop for collecting the percpu
statistics data. This can reduce cache misses by going through
all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
 net/xfrm/xfrm_proc.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c
index 9c4fbd8..c9df546 100644
--- a/net/xfrm/xfrm_proc.c
+++ b/net/xfrm/xfrm_proc.c
@@ -51,11 +51,20 @@ static const struct snmp_mib xfrm_mib_list[] = {
 static int xfrm_statistics_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   int i;
-   for (i = 0; xfrm_mib_list[i].name; i++)
+   int i, c;
+   unsigned long buff[LINUX_MIB_XFRMMAX];
+
+   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_XFRMMAX);
+
+   for_each_possible_cpu(c)
+   for (i = 0; xfrm_mib_list[i].name != NULL; i++)
+   buff[i] += snmp_get_cpu_field(
+   net->mib.xfrm_statistics,
+   c, xfrm_mib_list[i].entry);
+   for (i = 0; xfrm_mib_list[i].name != NULL; i++)
seq_printf(seq, "%-24s\t%lu\n", xfrm_mib_list[i].name,
-  snmp_fold_field(net->mib.xfrm_statistics,
-  xfrm_mib_list[i].entry));
+   buff[i]);
+
return 0;
 }
 
-- 
1.8.3.1



[RFC PATCH 6/6] net: Suppress the "Comparison to NULL

2016-08-29 Thread Jia He
This is to suppress the checkpatch.pl warning "Comparison to NULL
could be written". No functional changes here. 

Signed-off-by: Jia He 
---
 net/ipv4/proc.c  | 42 +-
 net/sctp/proc.c  |  4 ++--
 net/xfrm/xfrm_proc.c |  4 ++--
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index ed09566..8cbf3e0 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -355,22 +355,22 @@ static void icmp_put(struct seq_file *seq)
atomic_long_t *ptr = net->mib.icmpmsg_statistics->mibs;
 
seq_puts(seq, "\nIcmp: InMsgs InErrors InCsumErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " In%s", icmpmibmap[i].name);
seq_puts(seq, " OutMsgs OutErrors");
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " Out%s", icmpmibmap[i].name);
seq_printf(seq, "\nIcmp: %lu %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INERRORS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_CSUMERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + icmpmibmap[i].index));
seq_printf(seq, " %lu %lu",
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTMSGS),
snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTERRORS));
-   for (i = 0; icmpmibmap[i].name != NULL; i++)
+   for (i = 0; icmpmibmap[i].name; i++)
seq_printf(seq, " %lu",
   atomic_long_read(ptr + (icmpmibmap[i].index | 
0x100)));
 }
@@ -389,7 +389,7 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
 
seq_puts(seq, "Ip: Forwarding DefaultTTL");
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
 
seq_printf(seq, "\nIp: %d %d",
@@ -399,30 +399,30 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
 
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
buff64[i] += snmp_get_cpu_field64(
net->mib.ip_statistics,
c, snmp4_ipstats_list[i].entry,
offsetof(struct ipstats_mib, syncp));
}
-   for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
+   for (i = 0; snmp4_ipstats_list[i].name; i++)
seq_printf(seq, " %llu", buff64[i]);
 
icmp_put(seq);  /* RFC 2011 compatibility */
icmpmsg_put(seq);
 
seq_puts(seq, "\nTcp:");
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_tcp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
 
seq_puts(seq, "\nTcp:");
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_tcp_list[i].name; i++)
buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics,
c, snmp4_tcp_list[i].entry);
}
 
-   for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
+   for (i = 0; snmp4_tcp_list[i].name; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
seq_printf(seq, " %ld", buff[i]);
@@ -433,23 +433,23 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
memset(buff, 0, TCP_MIB_MAX * sizeof(unsigned long));
 
for_each_possible_cpu(c) {
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
buff[i] += snmp_get_cpu_field(net->mib.udp_statistics,
c, snmp4_udp_list[i].entry);
}
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
seq_puts(seq, "\nUdp:");
-   for (i = 0; snmp4_udp_list[i].name != NULL; i++)
+   for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %lu", buff[i]);
 
/* the UDP and UDP-Lite MIBs are the same */
seq_puts(seq, "\nUdpLite:");
- 

Re: [PATCH] net/mlx4_en: protect ring->xdp_prog with rcu_read_lock

2016-08-29 Thread Brenden Blanco
On Mon, Aug 29, 2016 at 05:59:26PM +0300, Tariq Toukan wrote:
> Hi Brenden,
> 
> The solution direction should be XDP specific that does not hurt the
> regular flow.
An rcu_read_lock is _already_ taken for _every_ packet. This is 1/64th of
that.
> 
> On 26/08/2016 11:38 PM, Brenden Blanco wrote:
> >Depending on the preempt mode, the bpf_prog stored in xdp_prog may be
> >freed despite the use of call_rcu inside bpf_prog_put. The situation is
> >possible when running in PREEMPT_RCU=y mode, for instance, since the rcu
> >callback for destroying the bpf prog can run even during the bh handling
> >in the mlx4 rx path.
> >
> >Several options were considered before this patch was settled on:
> >
> >Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all
> >of the rings are updated with the new program.
> >This approach has the disadvantage that as the number of rings
> >increases, the speed of udpate will slow down significantly due to
> >napi_synchronize's msleep(1).
> I prefer this option as it doesn't hurt the data path. A delay in a
> control command can be tolerated.
> >Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh.
> >The action of the bpf_prog_put_bh would be to then call bpf_prog_put
> >later. Those drivers that consume a bpf prog in a bh context (like mlx4)
> >would then use the bpf_prog_put_bh instead when the ring is up. This has
> >the problem of complexity, in maintaining proper refcnts and rcu lists,
> >and would likely be harder to review. In addition, this approach to
> >freeing must be exclusive with other frees of the bpf prog, for instance
> >a _bh prog must not be referenced from a prog array that is consumed by
> >a non-_bh prog.
> >
> >The placement of rcu_read_lock in this patch is functionally the same as
> >putting an rcu_read_lock in napi_poll. Actually doing so could be a
> >potentially controversial change, but would bring the implementation in
> >line with sk_busy_loop (though of course the nature of those two paths
> >is substantially different), and would also avoid future copy/paste
> >problems with future supporters of XDP. Still, this patch does not take
> >that opinionated option.
> So you decided to add a lock for all non-XDP flows, which are 99% of
> the cases.
> We should avoid this.
The whole point of rcu_read_lock architecture is to be taken in the fast
path. There won't be a performance impact from this patch.
> >
> >Testing was done with kernels in either PREEMPT_RCU=y or
> >CONFIG_PREEMPT_VOLUNTARY=y+PREEMPT_RCU=n modes, with neither exhibiting
> >any drawback. With PREEMPT_RCU=n, the extra call to rcu_read_lock did
> >not show up in the perf report whatsoever, and with PREEMPT_RCU=y the
> >overhead of rcu_read_lock (according to perf) was the same before/after.
> >In the rx path, rcu_read_lock is eventually called for every packet
> >from netif_receive_skb_internal, so the napi poll call's rcu_read_lock
> >is easily amortized.
> For now, I don't agree with this fix.
> Let me think about the options you suggested.
> I also need to do my perf tests.
> 
> Regards,
> Tariq
> 


Re: [net-next PATCH] e1000: add initial XDP support

2016-08-29 Thread Jesper Dangaard Brouer

Hi Jamal,

I'm adding: drop a specific UDP port option to my script... But I does
not match/drop the packets, command below does apply, but it does not
work in practice

$ ./tc_ingress_drop.sh --verbose --dev mlx5p2 --port 9
tc qdisc del dev mlx5p2 ingress
tc qdisc add dev mlx5p2 ingress
tc filter add dev mlx5p2 parent : prio 4 protocol ip u32 match ip protocol 
17 0xff match udp dst 9 0x flowid 1:1 action drop

(Use-case is obviously to drop pktgen UDP packets.)

I also tried with:

 tc filter add dev mlx5p2 parent : prio 4 protocol ip \
  u32 \
  match udp dst 9 0x \
  match ip protocol 17 0xff flowid 1:1 action drop

--Jesper
(top post)

On Mon, 29 Aug 2016 15:39:05 +0200 Jesper Dangaard Brouer  
wrote:

> On Mon, 29 Aug 2016 06:53:53 -0400
> Jamal Hadi Salim  wrote:
> 
> > On 16-08-29 04:30 AM, Jesper Dangaard Brouer wrote:
> >   
> > > Hi Jamal,
> > >
> > > Can you please provide a simple "tc" command that implements "tc drop"?
> > >
> > > Then, I'll add this to the series of tests I'm using for (what I call)
> > > "zoom-in" benchmarking.
> > >
> > 
> > Thanks Jesper.  
> 
> I've created a script called tc_ingress_drop.sh[1] which uses the
> commands you provided below.  Now people can easily use this script to
> perform the benchmark you were requesting ;-)
> 
> [1] 
> https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh
> 
> Example to enable dropping:
> 
>  $ ./tc_ingress_drop.sh --dev mlx5p2 --verbose
>  # (Not root, running with sudo)
>  # Flush existing ingress qdisc on device :mlx5p2
>  tc qdisc del dev mlx5p2 ingress
>  tc qdisc add dev mlx5p2 ingress
>  # Simply drop all ingress packets on device: mlx5p2
>  tc filter add dev mlx5p2 parent : prio 2 protocol ip u32 match u32 0 0 
> flowid 1:1 action drop
> 
> Example to disable again:
>  ./tc_ingress_drop.sh --dev mlx5p2 --flush
> 
>  
> > Something simple since this is done in ingress; lets say drop icmp
> > packets:
> > 
> > export ETH=eth0
> > export TC=/sbin/tc
> > #delete existing ingress qdisc - flushes all filters/actions
> > sudo $TC qdisc del dev $ETH ingress
> > #re-add ingress
> > sudo $TC qdisc add dev $ETH ingress
> > #
> > #simple rule to drop all icmp
> > sudo $TC filter add dev $ETH parent : prio 4 protocol ip \
> > u32 match ip protocol 1 0xff flowid 1:1 \
> > action drop
> > 
> > # other type of filters if you want to compare instead of above
> > #
> > # a)drop all
> > sudo $TC filter add dev $ETH parent : prio 2 protocol ip \
> > u32 match u32 0 0 flowid 1:1 \
> > action drop
> > #b) drop if src is XXX
> > sudo $TC filter add dev $ETH parent : prio 2 protocol ip \
> > u32 match ip src 192.168.100.1 flowid 1:1 \
> > action drop
> >   



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 0/3] strp: Generalize stream parser to work with other socket types

2016-08-29 Thread Tom Herbert
On Sun, Aug 28, 2016 at 8:34 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Sun, 28 Aug 2016 14:43:16 -0700
>
>> Add a read_sock protocol operation function that allows something like
>> tcp_read_sock to be called for other protocol types.
>>
>> Specific changes in this patch set:
>>   - Add read_sock function to proto_ops. This has the same signature as
>> tcp_read_sock. sk_read_actor_t is also defined in net.h.
>>   - Set peek_len and read_sock proto_op functions for TCPv4 and TCPv6
>> stream ops.
>>   - Remove references to tcp in strparser.
>>   - Call peek_len and read_sock operations from strparser instead of
>> calling TCP specific functions.
>
> I'll apply this, but I want you to shore up these new ops.
>
> A check has to happen somewhere to make sure the proto_ops in
> question have a non-NULL read_sock and peek_len method before
> starting to use it.

Is the check in strp_init not enough?...

@@ -424,9 +429,14 @@ static void strp_rx_msg_timeout(unsigned long arg)
 int strp_init(struct strparser *strp, struct sock *csk,
  struct strp_callbacks *cb)
 {
+   struct socket *sock = csk->sk_socket;
+
if (!cb || !cb->rcv_msg || !cb->parse_msg)
return -EINVAL;

+   if (!sock->ops->read_sock || !sock->ops->peek_len)
+   return -EAFNOSUPPORT;
+


Re: [RFCv2 01/16] add basic register-field manipulation macros

2016-08-29 Thread Daniel Borkmann

On 08/29/2016 05:07 PM, Jakub Kicinski wrote:

On Mon, 29 Aug 2016 16:34:25 +0200, Daniel Borkmann wrote:

On 08/26/2016 08:06 PM, Jakub Kicinski wrote:

Common approach to accessing register fields is to define
structures or sets of macros containing mask and shift pair.
Operations on the register are then performed as follows:

   field = (reg >> shift) & mask;

   reg &= ~(mask << shift);
   reg |= (field & mask) << shift;

Defining shift and mask separately is tedious.  Ivo van Doorn
came up with an idea of computing them at compilation time
based on a single shifted mask (later refined by Felix) which
can be used like this:

   #define REG_FIELD 0x000ff000

   field = FIELD_GET(REG_FIELD, reg);

   reg &= ~REG_FIELD;
   reg |= FIELD_PREP(REG_FIELD, field);

FIELD_{GET,PREP} macros take care of finding out what the
appropriate shift is based on compilation time ffs operation.

GENMASK can be used to define registers (which is usually
less error-prone and easier to match with datasheets).

This approach is the most convenient I've seen so to limit code
multiplication let's move the macros to a global header file.
Attempts to use static inlines instead of macros failed due
to false positive triggering of BUILD_BUG_ON()s, especially with
GCC < 6.0.

Signed-off-by: Jakub Kicinski 

[...]

+ * Bitfield access macros
+ *
+ * FIELD_{GET,PREP} macros take as first parameter shifted mask
+ * from which they extract the base mask and shift amount.
+ * Mask must be a compilation time constant.
+ *
+ * Example:
+ *
+ *  #define REG_FIELD_A  GENMASK(6, 0)
+ *  #define REG_FIELD_B  BIT(7)
+ *  #define REG_FIELD_C  GENMASK(15, 8)
+ *  #define REG_FIELD_D  GENMASK(31, 16)
+ *
+ * Get:
+ *  a = FIELD_GET(REG_FIELD_A, reg);
+ *  b = FIELD_GET(REG_FIELD_B, reg);
+ *
+ * Set:
+ *  reg = FIELD_PREP(REG_FIELD_A, 1) |
+ *   FIELD_PREP(REG_FIELD_B, 0) |
+ *   FIELD_PREP(REG_FIELD_C, c) |
+ *   FIELD_PREP(REG_FIELD_D, 0x40);
+ *
+ * Modify:
+ *  reg &= ~REG_FIELD_C;
+ *  reg |= FIELD_PREP(REG_FIELD_C, c);
+ */
+
+#define _bf_shf(x) (__builtin_ffsll(x) - 1)
+
+#define _BF_FIELD_CHECK(_mask, _reg, _val, _pfx)   \


Nit: if possible, please always use "__" instead of "_" as prefix, which is
more common coding style in the kernel.


I went with single underscore, because my understanding was:
  - no underscore - safe, "user-facing" API;
  - two underscores - internal, make sure you know how to use it;
  - single underscore - library internals, shouldn't be touched.


That convention would be new to me, at least I haven't seen it much (see
also recent comment on the act_tunnel set). Still think two underscores
is generally preferred (unless this is somewhere documented otherwise).


I don't expect anyone to invoke those macros, the underscore is
there to avoid collisions.


+   ({  \
+   BUILD_BUG_ON_MSG(!__builtin_constant_p(_mask),  \
+_pfx "mask is not constant");\
+   BUILD_BUG_ON_MSG(!(_mask), _pfx "mask is zero");  \
+   BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ?   \
+~((_mask) >> _bf_shf(_mask)) & (_val) : 0, \
+_pfx "value too large for the field"); \
+   BUILD_BUG_ON_MSG((_mask) > (typeof(_reg))~0ull,  \
+_pfx "type of reg too small for mask"); \
+   __BUILD_BUG_ON_NOT_POWER_OF_2((_mask) + \
+ (1ULL << _bf_shf(_mask))); \
+   })
+
+/**
+ * FIELD_PREP() - prepare a bitfield element
+ * @_mask: shifted mask defining the field's length and position
+ * @_val:  value to put in the field
+ *
+ * FIELD_PREP() masks and shifts up the value.  The result should
+ * be combined with other fields of the bitfield using logical OR.
+ */
+#define FIELD_PREP(_mask, _val)
\
+   ({  \
+   _BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_PREP: ");   \
+   ((typeof(_mask))(_val) << _bf_shf(_mask)) & (_mask);  \
+   })
+
+/**
+ * FIELD_GET() - extract a bitfield element
+ * @_mask: shifted mask defining the field's length and position
+ * @_reg:  32bit value of entire bitfield
+ *
+ * FIELD_GET() extracts the field specified by @_mask from the
+ * bitfield passed in as @_reg by masking and shifting it down.
+ */
+#define FIELD_GET(_mask, _reg) \
+   ({  \
+   _BF_FIELD_CHECK(_mask, _reg, 0U, "FIELD_GET: ");  \
+   (typeof(_mask))(((_reg) & (_mask)) >> _bf_shf(_mask));\
+   })


No strong opinion, but FIELD_PREP() sounds a bit weird. Maybe rather a

Re: [PATCH v4] brcmfmac: add missing header dependencies

2016-08-29 Thread Rafał Miłecki
On 29 August 2016 at 14:39, Baoyou Xie  wrote:
> We get 1 warning when build kernel with W=1:
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.c:23:6: warning: 
> no previous prototype for '__brcmf_err' [-Wmissing-prototypes]

building? I'm not native English, but I think so.


> In fact, this function is declared in brcmfmac/debug.h, so this patch
> add missing header dependencies.

adds


> Signed-off-by: Baoyou Xie 
> Acked-by: Arnd Bergmann 

Please don't resend patches just to add tags like that. This only
increases a noise and patchwork handles this just fine, see:
https://patchwork.kernel.org/patch/9303285/
https://patchwork.kernel.org/patch/9303285/mbox/


Re: [RFC 2/9] ipv6: sr: add code base for control plane support of SR-IPv6

2016-08-29 Thread Roopa Prabhu
On 8/26/16, 8:52 AM, David Lebrun wrote:
> This patch adds the necessary hooks and structures to provide support
> for SR-IPv6 control plane, essentially the Generic Netlink commands
> that will be used for userspace control over the Segment Routing
> kernel structures.
>
> The genetlink commands provide control over two different structures:
> tunnel source and HMAC data. The tunnel source is the source address
> that will be used by default when encapsulating packets into an
> outer IPv6 header + SRH. If the tunnel source is set to :: then an
> address of the outgoing interface will be selected as the source.
>
> The HMAC commands currently just return ENOTSUPP and will be implemented
> in a future patch.
>
> Signed-off-by: David Lebrun 
>
This looks fine. But, i am just trying to see if this can be rtnetlink.
Have you considered it already ?.
We would like to keep the API consistent or abstracted to accommodate SR-MPLS 
in the
future too. so, any abstraction there will help.

what is your control-plane software using this ?
quagga or any-other routing daemon ?

Thanks,
Roopa



  1   2   3   >