Re: [PATCH v2 08/13] ASoC: pxa: remove the dmaengine compat need

2018-05-25 Thread Daniel Mack

On Thursday, May 24, 2018 09:06 AM, Robert Jarzmik wrote:

As the pxa architecture switched towards the dmaengine slave map, the
old compatibility mechanism to acquire the dma requestor line number and
priority are not needed anymore.

This patch simplifies the dma resource acquisition, using the more
generic function dma_request_slave_channel().

Signed-off-by: Robert Jarzmik <robert.jarz...@free.fr>


Reviewed-by: Daniel Mack <dan...@zonque.org>


---
  sound/arm/pxa2xx-ac97.c | 14 ++
  sound/arm/pxa2xx-pcm-lib.c  |  6 +++---
  sound/soc/pxa/pxa2xx-ac97.c | 32 +---
  sound/soc/pxa/pxa2xx-i2s.c  |  6 ++
  4 files changed, 12 insertions(+), 46 deletions(-)

diff --git a/sound/arm/pxa2xx-ac97.c b/sound/arm/pxa2xx-ac97.c
index 4bc244c40f80..236a63cdaf9f 100644
--- a/sound/arm/pxa2xx-ac97.c
+++ b/sound/arm/pxa2xx-ac97.c
@@ -63,28 +63,18 @@ static struct snd_ac97_bus_ops pxa2xx_ac97_ops = {
.reset  = pxa2xx_ac97_legacy_reset,
  };
  
-static struct pxad_param pxa2xx_ac97_pcm_out_req = {

-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 12,
-};
-
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_out = {
.addr   = __PREG(PCDR),
.addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES,
+   .chan_name  = "pcm_pcm_stereo_out",
.maxburst   = 32,
-   .filter_data= _ac97_pcm_out_req,
-};
-
-static struct pxad_param pxa2xx_ac97_pcm_in_req = {
-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 11,
  };
  
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_in = {

.addr   = __PREG(PCDR),
.addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES,
+   .chan_name  = "pcm_pcm_stereo_in",
.maxburst   = 32,
-   .filter_data= _ac97_pcm_in_req,
  };
  
  static struct snd_pcm *pxa2xx_ac97_pcm;

diff --git a/sound/arm/pxa2xx-pcm-lib.c b/sound/arm/pxa2xx-pcm-lib.c
index e8da3b8ee721..dcbe7ecc1835 100644
--- a/sound/arm/pxa2xx-pcm-lib.c
+++ b/sound/arm/pxa2xx-pcm-lib.c
@@ -125,9 +125,9 @@ int __pxa2xx_pcm_open(struct snd_pcm_substream *substream)
if (ret < 0)
return ret;
  
-	return snd_dmaengine_pcm_open_request_chan(substream,

-   pxad_filter_fn,
-   dma_params->filter_data);
+   return snd_dmaengine_pcm_open(
+   substream, dma_request_slave_channel(rtd->cpu_dai->dev,
+dma_params->chan_name));
  }
  EXPORT_SYMBOL(__pxa2xx_pcm_open);
  
diff --git a/sound/soc/pxa/pxa2xx-ac97.c b/sound/soc/pxa/pxa2xx-ac97.c

index 803818aabee9..1b41c0f2a8fb 100644
--- a/sound/soc/pxa/pxa2xx-ac97.c
+++ b/sound/soc/pxa/pxa2xx-ac97.c
@@ -68,61 +68,39 @@ static struct snd_ac97_bus_ops pxa2xx_ac97_ops = {
.reset  = pxa2xx_ac97_cold_reset,
  };
  
-static struct pxad_param pxa2xx_ac97_pcm_stereo_in_req = {

-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 11,
-};
-
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_stereo_in = {
.addr   = __PREG(PCDR),
.addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES,
+   .chan_name  = "pcm_pcm_stereo_in",
.maxburst   = 32,
-   .filter_data= _ac97_pcm_stereo_in_req,
-};
-
-static struct pxad_param pxa2xx_ac97_pcm_stereo_out_req = {
-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 12,
  };
  
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_stereo_out = {

.addr   = __PREG(PCDR),
.addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES,
+   .chan_name  = "pcm_pcm_stereo_out",
.maxburst   = 32,
-   .filter_data= _ac97_pcm_stereo_out_req,
  };
  
-static struct pxad_param pxa2xx_ac97_pcm_aux_mono_out_req = {

-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 10,
-};
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_aux_mono_out = {
.addr   = __PREG(MODR),
.addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES,
+   .chan_name  = "pcm_aux_mono_out",
.maxburst   = 16,
-   .filter_data= _ac97_pcm_aux_mono_out_req,
  };
  
-static struct pxad_param pxa2xx_ac97_pcm_aux_mono_in_req = {

-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 9,
-};
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_aux_mono_in = {
.addr   = __PREG(MODR),
.addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES,
+   .chan_name  = "pcm_aux_mono_in",
.maxburst   = 16,
-   .filter_data= _ac97_pcm_aux_mono_in_req,
  };
  
-static struct pxad_param pxa2xx_ac97_pcm_aux_mic_mono_req = {

-   .prio = PXAD_PRIO_LOWEST,
-   .drcmr = 8,
-};
  static struct snd_dmaengine_dai_dma_data pxa2xx_ac97_pcm_mic_mono_in = {
.addr   = __PREG(MCDR),
.addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES,
+   .

Re: [PATCH v2 13/13] ARM: pxa: change SSP DMA channels allocation

2018-05-25 Thread Daniel Mack

On Thursday, May 24, 2018 09:07 AM, Robert Jarzmik wrote:

Now the dma_slave_map is available for PXA architecture, switch the SSP
device to it.

This specifically means that :
- for platform data based machines, the DMA requestor channels are
   extracted from the slave map, where pxa-ssp-dai. is a 1-1 match to
   ssp., and the channels are either "rx" or "tx".

- for device tree platforms, the dma node should be hooked into the
   pxa2xx-ac97 or pxa-ssp-dai node.

Signed-off-by: Robert Jarzmik <robert.jarz...@free.fr>


Acked-by: Daniel Mack <dan...@zonque.org>


We should, however, merge what's left of this management glue code into 
the users of it, so the dma related properties can be put in the right 
devicetree node.


I'll prepare a patch for that for 4.18. This is a good preparation for 
this round though.



Thanks,
Daniel



---
Since v1: Removed channel names from platform_data
---
  arch/arm/plat-pxa/ssp.c| 47 --
  include/linux/pxa2xx_ssp.h |  2 --
  sound/soc/pxa/pxa-ssp.c|  5 ++---
  3 files changed, 2 insertions(+), 52 deletions(-)

diff --git a/arch/arm/plat-pxa/ssp.c b/arch/arm/plat-pxa/ssp.c
index ba13f793fbce..ed36dcab80f1 100644
--- a/arch/arm/plat-pxa/ssp.c
+++ b/arch/arm/plat-pxa/ssp.c
@@ -127,53 +127,6 @@ static int pxa_ssp_probe(struct platform_device *pdev)
if (IS_ERR(ssp->clk))
return PTR_ERR(ssp->clk);
  
-	if (dev->of_node) {

-   struct of_phandle_args dma_spec;
-   struct device_node *np = dev->of_node;
-   int ret;
-
-   /*
-* FIXME: we should allocate the DMA channel from this
-* context and pass the channel down to the ssp users.
-* For now, we lookup the rx and tx indices manually
-*/
-
-   /* rx */
-   ret = of_parse_phandle_with_args(np, "dmas", "#dma-cells",
-0, _spec);
-
-   if (ret) {
-   dev_err(dev, "Can't parse dmas property\n");
-   return -ENODEV;
-   }
-   ssp->drcmr_rx = dma_spec.args[0];
-   of_node_put(dma_spec.np);
-
-   /* tx */
-   ret = of_parse_phandle_with_args(np, "dmas", "#dma-cells",
-1, _spec);
-   if (ret) {
-   dev_err(dev, "Can't parse dmas property\n");
-   return -ENODEV;
-   }
-   ssp->drcmr_tx = dma_spec.args[0];
-   of_node_put(dma_spec.np);
-   } else {
-   res = platform_get_resource(pdev, IORESOURCE_DMA, 0);
-   if (res == NULL) {
-   dev_err(dev, "no SSP RX DRCMR defined\n");
-   return -ENODEV;
-   }
-   ssp->drcmr_rx = res->start;
-
-   res = platform_get_resource(pdev, IORESOURCE_DMA, 1);
-   if (res == NULL) {
-   dev_err(dev, "no SSP TX DRCMR defined\n");
-   return -ENODEV;
-   }
-   ssp->drcmr_tx = res->start;
-   }
-
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (res == NULL) {
dev_err(dev, "no memory resource defined\n");
diff --git a/include/linux/pxa2xx_ssp.h b/include/linux/pxa2xx_ssp.h
index 8461b18e4608..03a7ca46735b 100644
--- a/include/linux/pxa2xx_ssp.h
+++ b/include/linux/pxa2xx_ssp.h
@@ -212,8 +212,6 @@ struct ssp_device {
int type;
int use_count;
int irq;
-   int drcmr_rx;
-   int drcmr_tx;
  
  	struct device_node	*of_node;

  };
diff --git a/sound/soc/pxa/pxa-ssp.c b/sound/soc/pxa/pxa-ssp.c
index 0291c7cb64eb..e09368d89bbc 100644
--- a/sound/soc/pxa/pxa-ssp.c
+++ b/sound/soc/pxa/pxa-ssp.c
@@ -104,9 +104,8 @@ static int pxa_ssp_startup(struct snd_pcm_substream 
*substream,
dma = kzalloc(sizeof(struct snd_dmaengine_dai_dma_data), GFP_KERNEL);
if (!dma)
return -ENOMEM;
-
-   dma->filter_data = substream->stream == SNDRV_PCM_STREAM_PLAYBACK ?
-   >drcmr_tx : >drcmr_rx;
+   dma->chan_name = substream->stream == SNDRV_PCM_STREAM_PLAYBACK ?
+   "tx" : "rx";
  
  	snd_soc_dai_set_dma_data(cpu_dai, substream, dma);
  





Re: [PATCH 05/15] mtd: nand: pxa3xx: remove the dmaengine compat need

2018-05-23 Thread Daniel Mack

Hi Robert,

Please refer to the attached patch instead of the one I sent earlier. I 
missed to also remove the platform_get_resource(IORESOURCE_DMA) call.



Thanks,
Daniel


On Friday, May 18, 2018 11:31 PM, Daniel Mack wrote:

Hi Robert,

Thanks for this series.

On Monday, April 02, 2018 04:26 PM, Robert Jarzmik wrote:

From: Robert Jarzmik <robert.jarz...@renault.com>

As the pxa architecture switched towards the dmaengine slave map, the
old compatibility mechanism to acquire the dma requestor line number and
priority are not needed anymore.

This patch simplifies the dma resource acquisition, using the more
generic function dma_request_slave_channel().

Signed-off-by: Robert Jarzmik <robert.jarz...@free.fr>
---
   drivers/mtd/nand/pxa3xx_nand.c | 10 +-


This driver was replaced by drivers/mtd/nand/raw/marvell_nand.c
recently, so this patch can be dropped. I attached a version for the new
driver which you can pick instead.


Thanks,
Daniel



>From 72a306157dedb21f8c3289f0f7a288fc4542bd96 Mon Sep 17 00:00:00 2001
From: Daniel Mack <dan...@zonque.org>
Date: Sat, 12 May 2018 21:50:13 +0200
Subject: [PATCH] mtd: rawnand: marvell: remove dmaengine compat code

As the pxa architecture switched towards the dmaengine slave map, the
old compatibility mechanism to acquire the dma requestor line number and
priority are not needed anymore.

This patch simplifies the dma resource acquisition, using the more
generic function dma_request_slave_channel().

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 drivers/mtd/nand/raw/marvell_nand.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
index ebb1d141b900..319fea77daf1 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -2612,8 +2612,6 @@ static int marvell_nfc_init_dma(struct marvell_nfc *nfc)
 		dev);
 	struct dma_slave_config config = {};
 	struct resource *r;
-	dma_cap_mask_t mask;
-	struct pxad_param param;
 	int ret;
 
 	if (!IS_ENABLED(CONFIG_PXA_DMA)) {
@@ -2626,20 +2624,7 @@ static int marvell_nfc_init_dma(struct marvell_nfc *nfc)
 	if (ret)
 		return ret;
 
-	r = platform_get_resource(pdev, IORESOURCE_DMA, 0);
-	if (!r) {
-		dev_err(nfc->dev, "No resource defined for data DMA\n");
-		return -ENXIO;
-	}
-
-	param.drcmr = r->start;
-	param.prio = PXAD_PRIO_LOWEST;
-	dma_cap_zero(mask);
-	dma_cap_set(DMA_SLAVE, mask);
-	nfc->dma_chan =
-		dma_request_slave_channel_compat(mask, pxad_filter_fn,
-		 , nfc->dev,
-		 "data");
+	nfc->dma_chan = dma_request_slave_channel(nfc->dev, "data");
 	if (!nfc->dma_chan) {
 		dev_err(nfc->dev,
 			"Unable to request data DMA channel\n");
-- 
2.14.3



Re: [PATCH 05/15] mtd: nand: pxa3xx: remove the dmaengine compat need

2018-05-18 Thread Daniel Mack

Hi Robert,

Thanks for this series.

On Monday, April 02, 2018 04:26 PM, Robert Jarzmik wrote:

From: Robert Jarzmik <robert.jarz...@renault.com>

As the pxa architecture switched towards the dmaengine slave map, the
old compatibility mechanism to acquire the dma requestor line number and
priority are not needed anymore.

This patch simplifies the dma resource acquisition, using the more
generic function dma_request_slave_channel().

Signed-off-by: Robert Jarzmik <robert.jarz...@free.fr>
---
  drivers/mtd/nand/pxa3xx_nand.c | 10 +-


This driver was replaced by drivers/mtd/nand/raw/marvell_nand.c 
recently, so this patch can be dropped. I attached a version for the new 
driver which you can pick instead.



Thanks,
Daniel
>From c63bc40bdfe2d596e42919235840109a2f1b2776 Mon Sep 17 00:00:00 2001
From: Daniel Mack <dan...@zonque.org>
Date: Sat, 12 May 2018 21:50:13 +0200
Subject: [PATCH] mtd: rawnand: marvell: remove dmaengine compat code

As the pxa architecture switched towards the dmaengine slave map, the
old compatibility mechanism to acquire the dma requestor line number and
priority are not needed anymore.

This patch simplifies the dma resource acquisition, using the more
generic function dma_request_slave_channel().

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 drivers/mtd/nand/raw/marvell_nand.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
index ebb1d141b900..30017cd7d91c 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -2612,8 +2612,6 @@ static int marvell_nfc_init_dma(struct marvell_nfc *nfc)
 		dev);
 	struct dma_slave_config config = {};
 	struct resource *r;
-	dma_cap_mask_t mask;
-	struct pxad_param param;
 	int ret;
 
 	if (!IS_ENABLED(CONFIG_PXA_DMA)) {
@@ -2632,14 +2630,7 @@ static int marvell_nfc_init_dma(struct marvell_nfc *nfc)
 		return -ENXIO;
 	}
 
-	param.drcmr = r->start;
-	param.prio = PXAD_PRIO_LOWEST;
-	dma_cap_zero(mask);
-	dma_cap_set(DMA_SLAVE, mask);
-	nfc->dma_chan =
-		dma_request_slave_channel_compat(mask, pxad_filter_fn,
-		 , nfc->dev,
-		 "data");
+	nfc->dma_chan = dma_request_slave_channel(nfc->dev, "data");
 	if (!nfc->dma_chan) {
 		dev_err(nfc->dev,
 			"Unable to request data DMA channel\n");
-- 
2.14.3



Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Daniel Mack
On 09/20/2017 08:51 PM, Craig Gallek wrote:
> On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <dan...@zonque.org> wrote:
>> Hi Craig,
>>
>> Thanks, this looks much cleaner already :)
>>
>> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>>> --- a/kernel/bpf/lpm_trie.c
>>> +++ b/kernel/bpf/lpm_trie.c
>>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
>>> *_key)
>>>   struct lpm_trie_node __rcu **trim;
>>>   struct lpm_trie_node *node;
>>>   unsigned long irq_flags;
>>> - unsigned int next_bit;
>>> + unsigned int next_bit = 0;
>>
>> This default assignment seems wrong, and I guess you only added it to
>> squelch a compiler warning?
> Yes, this variable is only initialized after the lookup iterations
> below (meaning it will never be initialized the the root-removal
> case).

Right, and once set, it's only updated in case we don't have an exact
match and try to drill down further.

>> [...]
>>
>>> + /* If the node has one child, we may be able to collapse the tree
>>> +  * while removing this node if the node's child is in the same
>>> +  * 'next bit' slot as this node was in its parent or if the node
>>> +  * itself is the root.
>>> +  */
>>> + if (trim == >root) {
>>> + next_bit = node->child[0] ? 0 : 1;
>>> + rcu_assign_pointer(trie->root, node->child[next_bit]);
>>> + kfree_rcu(node, rcu);
>>
>> I don't think you should treat this 'root' case special.
>>
>> Instead, move the 'next_bit' assignment outside of the condition ...
> I'm not quite sure I follow.  Are you saying do something like this:
> 
> if (trim == >root) {
>   next_bit = node->child[0] ? 0 : 1;
> }
> if (rcu_access_pointer(node->child[next_bit])) {
> ...
> 
> This would save a couple lines of code, but I think the as-is
> implementation is slightly easier to understand.  I don't have a
> strong opinion either way, though.

Me neither :)

My idea was to set

  next_bit = node->child[0] ? 0 : 1;

unconditionally, because it should result in the same in both cases.

It might be a bit of bike shedding, but I dislike this default
assignment, and I believe that not relying on next_bit to be set as a
side effect of the lookup loop makes the code a bit more readable.

WDYT?


Thanks,
Daniel



Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Daniel Mack
Hi Craig,

Thanks, this looks much cleaner already :)

On 09/20/2017 06:22 PM, Craig Gallek wrote:
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9d58a576b2ae..b5a7d70ec8b5 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
> *_key)
>   struct lpm_trie_node __rcu **trim;
>   struct lpm_trie_node *node;
>   unsigned long irq_flags;
> - unsigned int next_bit;
> + unsigned int next_bit = 0;

This default assignment seems wrong, and I guess you only added it to
squelch a compiler warning?

[...]

> + /* If the node has one child, we may be able to collapse the tree
> +  * while removing this node if the node's child is in the same
> +  * 'next bit' slot as this node was in its parent or if the node
> +  * itself is the root.
> +  */
> + if (trim == >root) {
> + next_bit = node->child[0] ? 0 : 1;
> + rcu_assign_pointer(trie->root, node->child[next_bit]);
> + kfree_rcu(node, rcu);

I don't think you should treat this 'root' case special.

Instead, move the 'next_bit' assignment outside of the condition ...

> + } else if (rcu_access_pointer(node->child[next_bit])) {
> + rcu_assign_pointer(*trim, node->child[next_bit]);
> + kfree_rcu(node, rcu);

... and then this branch would handle the case just fine. Correct?

Otherwise, looks good to me!



Thanks,
Daniel


Re: [PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-19 Thread Daniel Mack
On 09/19/2017 11:29 PM, David Miller wrote:
> From: Craig Gallek <kraigatg...@gmail.com>
> Date: Tue, 19 Sep 2017 17:16:13 -0400
> 
>> On Tue, Sep 19, 2017 at 5:13 PM, Daniel Mack <dan...@zonque.org> wrote:
>>> On 09/19/2017 10:55 PM, David Miller wrote:
>>>> From: Craig Gallek <kraigatg...@gmail.com>
>>>> Date: Mon, 18 Sep 2017 15:30:54 -0400
>>>>
>>>>> This was previously left as a TODO.  Add the implementation and
>>>>> extend the test to cover it.
>>>>
>>>> Series applied, thanks.
>>>>
>>>
>>> Hmm, I think these patches need some more discussion regarding the IM
>>> nodes handling, see the reply I sent an hour ago. Could you wait for
>>> that before pushing your tree?
>>
>> I can follow up with a patch to implement your suggestion.  It's
>> really just an efficiency improvement, though, so I think it's ok to
>> handle independently. (Sorry, I haven't had a chance to play with the
>> implementation details yet).
> 
> Sorry, I thought the core implementation had been agreed upon and the
> series was OK.  All that was asked for were simplifications and/or
> optimization which could be done via follow-up patches.
> 
> It's already pushed out to my tree, so I would need to do a real
> revert.
> 
> I hope that won't be necessary.
> 

Nah, it's okay I guess. I trust Craig to send follow-up patches. After
all, efficiency is what this whole exercise is all about, so I think it
should be done correctly :)



Thanks,
Daniel


Re: [PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-19 Thread Daniel Mack
On 09/19/2017 10:55 PM, David Miller wrote:
> From: Craig Gallek 
> Date: Mon, 18 Sep 2017 15:30:54 -0400
> 
>> This was previously left as a TODO.  Add the implementation and
>> extend the test to cover it.
> 
> Series applied, thanks.
> 

Hmm, I think these patches need some more discussion regarding the IM
nodes handling, see the reply I sent an hour ago. Could you wait for
that before pushing your tree?


Thanks,
Daniel



Re: [PATCH net-next 1/3] bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE

2017-09-19 Thread Daniel Mack
Hi,

Thanks for working on this, Craig!

On 09/19/2017 06:12 PM, Daniel Borkmann wrote:
> On 09/19/2017 05:08 PM, Craig Gallek wrote:
>> On Mon, Sep 18, 2017 at 6:53 PM, Alexei Starovoitov  wrote:
>>> On 9/18/17 12:30 PM, Craig Gallek wrote:
> [...]
 +
 +   next_bit = extract_bit(key->data, node->prefixlen);
 +   /* If we hit a node that has more than one child or is a
 valid
 +* prefix itself, do not remove it. Reset the root of the
 trim
 +* path to its descendant on our path.
 +*/
 +   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
 +   (node->child[0] && node->child[1]))
 +   trim = >child[next_bit];
 +   node = rcu_dereference_protected(
 +   node->child[next_bit],
 lockdep_is_held(>lock));
 +   }
 +
 +   if (!node || node->prefixlen != key->prefixlen ||
 +   (node->flags & LPM_TREE_NODE_FLAG_IM)) {
 +   ret = -ENOENT;
 +   goto out;
 +   }
 +
 +   trie->n_entries--;
 +
 +   /* If the node we are removing is not a leaf node, simply mark it
 +* as intermediate and we are done.
 +*/
 +   if (rcu_access_pointer(node->child[0]) ||
 +   rcu_access_pointer(node->child[1])) {
 +   node->flags |= LPM_TREE_NODE_FLAG_IM;
 +   goto out;
 +   }
 +
 +   /* trim should now point to the slot holding the start of a path
 from
 +* zero or more intermediate nodes to our leaf node for deletion.
 +*/
 +   while ((node = rcu_dereference_protected(
 +   *trim, lockdep_is_held(>lock {
 +   RCU_INIT_POINTER(*trim, NULL);
 +   trim = rcu_access_pointer(node->child[0]) ?
 +   >child[0] :
 +   >child[1];
 +   kfree_rcu(node, rcu);
>>>
>>> can it be that some of the nodes this loop walks have
>>> both child[0] and [1] ?
>> No, the loop above will push trim down the walk every time it
>> encounters a node with two children.  The only other trim assignment
>> is the initial trim = >root.  But the only time we would skip
>> the assignment in the loop is if the node being removed is the root.
>> If the root had multiple children and is being removed, it would be
>> handled by the case that turns the node into an intermediate node
>> rather than walking the trim path freeing things.
> 
> Looks good to me. We should probably still merge nodes once we turn
> a real node into an im which just has a single child attached to it;
> parent can be im or real node. Thus, we don't need to traverse this
> extra one on lookup.

Right, but only if the the parent of the node allows us to do that,
because the 'next bit' in the lookup key has to match the slot index.

To illustrate, consider the following trie with no IM nodes:

  ++
  |   (1)  (R) |
  | 192.168.0.0/16 |
  |value: 1|
  |   [0][1]   |
  ++
   |  |
++  +--+
|   (2)  |  |(3)   |
| 192.168.0.0/23 |  | 192.168.128.0/24 |
|value: 2|  | value: 3 |
|   [0][1]   |  |[0][1]|
++  +--+
|
  ++
  |   (4)  |
  | 192.168.1.0/24 |
  | value: 4   |
  |   [0][1]   |
  ++

If you now try to delete (2), the node has to stay around because (3)
and (4) share the same value in bit 17 (1). If, however, (4) had a
prefix of 192.168.0.0/24, then (2) should be removed completely, and (4)
should be directly attached to (1) as child[0].

With this implementation, a situation in which multiple IM nodes appear
in a chain cannot emerge. And that again should make your trimming
algorithm simpler, because you only need to find an exact match, and
then handle three distinct cases:

a) the node as 0 children: simply remove it and nullify the pointer in
the parent

b) the node has 1 child: apply logic I described above

c) the node has 2 children: turn the node into an IM


Makes sense?


Thanks,
Daniel


Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-12 Thread Daniel Mack
On 02/11/2017 05:28 AM, Alexei Starovoitov wrote:
> If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
> to the given cgroup the descendent cgroup will be able to override
> effective bpf program that was inherited from this cgroup.
> By default it's not passed, therefore override is disallowed.
> 
> Examples:
> 1.
> prog X attached to /A with default
> prog Y fails to attach to /A/B and /A/B/C
> Everything under /A runs prog X
> 
> 2.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A/B with default (non-override)
> prog M attached to /A/B with allow_override.
> Everything under /A/B runs prog M only.
> 
> 3.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A with default.
> The user has to detach first to switch the mode.
> 
> In the future this behavior may be extended with a chain of
> non-overridable programs.
> 
> Also fix the bug where detach from cgroup where nothing is attached
> was not throwing error. Return ENOENT in such case.
> 
> Add several testcases and adjust libbpf.
> 
> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>

Looks good to me.

Acked-by: Daniel Mack <dan...@zonque.org>

Let's get this into 4.10!


Thanks,
Daniel



> ---
> v1->v2: disallowed overridable->non_override transition as suggested by Andy
> added tests and fixed double detach bug
> 
> Andy, Daniel,
> please review and ack quickly, so it can land into 4.10.
> ---
>  include/linux/bpf-cgroup.h   | 13 
>  include/uapi/linux/bpf.h |  7 +
>  kernel/bpf/cgroup.c  | 59 +++---
>  kernel/bpf/syscall.c | 20 
>  kernel/cgroup.c  |  9 +++---
>  samples/bpf/test_cgrp2_attach.c  |  2 +-
>  samples/bpf/test_cgrp2_attach2.c | 68 
> +---
>  samples/bpf/test_cgrp2_sock.c|  2 +-
>  samples/bpf/test_cgrp2_sock2.c   |  2 +-
>  tools/lib/bpf/bpf.c  |  4 ++-
>  tools/lib/bpf/bpf.h  |  3 +-
>  11 files changed, 151 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 92bc89ae7e20..c970a25d2a49 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -21,20 +21,19 @@ struct cgroup_bpf {
>*/
>   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>   struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
> + bool disallow_override[MAX_BPF_ATTACH_TYPE];
>  };
>  
>  void cgroup_bpf_put(struct cgroup *cgrp);
>  void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
>  
> -void __cgroup_bpf_update(struct cgroup *cgrp,
> -  struct cgroup *parent,
> -  struct bpf_prog *prog,
> -  enum bpf_attach_type type);
> +int __cgroup_bpf_update(struct cgroup *cgrp, struct cgroup *parent,
> + struct bpf_prog *prog, enum bpf_attach_type type,
> + bool overridable);
>  
>  /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
> -void cgroup_bpf_update(struct cgroup *cgrp,
> -struct bpf_prog *prog,
> -enum bpf_attach_type type);
> +int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
> +   enum bpf_attach_type type, bool overridable);
>  
>  int __cgroup_bpf_run_filter_skb(struct sock *sk,
>   struct sk_buff *skb,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e5b8cf16cbaf..69f65b710b10 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -116,6 +116,12 @@ enum bpf_attach_type {
>  
>  #define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
>  
> +/* If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
> + * to the given target_fd cgroup the descendent cgroup will be able to
> + * override effective bpf program that was inherited from this cgroup
> + */
> +#define BPF_F_ALLOW_OVERRIDE (1U << 0)
> +
>  #define BPF_PSEUDO_MAP_FD1
>  
>  /* flags for BPF_MAP_UPDATE_ELEM command */
> @@ -171,6 +177,7 @@ union bpf_attr {
>   __u32   target_fd;  /* container object to attach 
> to */
>   __u32   attach_bpf_fd;  /* eBPF program to attach */
>   __u32   attach_type;
> + __u32   attach_flags;
>   };
>  } __attribute__((aligned(8)));
>  
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index a515f7b007c6..da0f53690295 100644
> --- a

Re: [PATCH v4 1/3] bpf: add a longest prefix match trie map implementation

2017-01-23 Thread Daniel Mack
On 01/23/2017 05:39 PM, Daniel Borkmann wrote:
> On 01/21/2017 05:26 PM, Daniel Mack wrote:
> [...]
>> +/* Called from syscall or from eBPF program */
>> +static int trie_update_elem(struct bpf_map *map,
>> +void *_key, void *value, u64 flags)
>> +{
>> +struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
>> +struct lpm_trie_node *node, *im_node, *new_node = NULL;
> 
> im_node is uninitialized here ...
> 
>> +struct lpm_trie_node __rcu **slot;
>> +struct bpf_lpm_trie_key *key = _key;
>> +unsigned long irq_flags;
>> +unsigned int next_bit;
>> +size_t matchlen = 0;
>> +int ret = 0;
>> +
>> +if (unlikely(flags > BPF_EXIST))
>> +return -EINVAL;
>> +
>> +if (key->prefixlen > trie->max_prefixlen)
>> +return -EINVAL;
>> +
>> +raw_spin_lock_irqsave(>lock, irq_flags);
>> +
>> +/* Allocate and fill a new node */
>> +
>> +if (trie->n_entries == trie->map.max_entries) {
>> +ret = -ENOSPC;
>> +goto out;
> 
> ... and here we go to out path with ret as non-zero ...
> 
>> +}
>> +
>> +new_node = lpm_trie_node_alloc(trie, value);
>> +if (!new_node) {
>> +ret = -ENOMEM;
>> +goto out;
>> +}
> [...]
>> +
>> +out:
>> +if (ret) {
>> +if (new_node)
>> +trie->n_entries--;
>> +
>> +kfree(new_node);
>> +kfree(im_node);
> 
> ... which does kfree() in im_node here.

Oops. Nice catch! gcc was too stupid to recognize that :)

Thanks, I'll repost a v5 with Alexei's Acked-by later today.


Daniel



[PATCH v4 1/3] bpf: add a longest prefix match trie map implementation

2017-01-21 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 503 +++
 3 files changed, 511 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..ba19241d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,503 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016,2017 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   raw_spinlock_t  lock;
+};
+
+/* This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (

[PATCH v4 2/3] bpf: Add tests for the lpm trie map

2017-01-21 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 3 files changed, 361 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..26775c0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,358 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+  

[PATCH v4 0/3] bpf: add longest prefix match map

2017-01-21 Thread Daniel Mack
This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Note that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, 2/2 an extensive test suite and 3/3
has benchmarking code for the new trie type.

Feedback is much appreciated.
 
 
Thanks,
Daniel

Changelog:

v3 -> v4:
* David added a 3rd patch that augments map_perf_test for
  LPM trie benchmarks
* Limit allocation of maps of this new type to CAP_SYS_ADMIN
  for now, as requested by Alexei
* Add a stub .map_delete_elem so the core does not stumble
  over a NULL pointer when the syscall is invoked
* Tests for non-power-of-2 prefix lengths were added
* More comment style fixes

v2 -> v3:
* Store both the key match data and the caller provided
  value in the same byte array attached to a node. This
  avoids double allocations
* Bring back node->flags to distinguish between 'real'
  and intermediate nodes
* Fix comment style and some typos

v1 -> v2:
* Turn spin lock into raw spinlock
* Lock with irqsave options during trie_update_elem()
* Return -ENOMEM properly from trie_alloc()
* Force attr->flags == BPF_F_NO_PREALLOC during creation
* Set trie->map.pages after creation to account for map memory
* Allow arbitrary value sizes
* Removed node->flags and denode intermediate nodes through
  node->value == NULL instead

rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues

Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (2):
  bpf: Add tests for the lpm trie map
  samples/bpf: add lpm-trie benchmark

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 503 +
 samples/bpf/map_perf_test_kern.c   |  30 ++
 samples/bpf/map_perf_test_user.c   |  49 +++
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 
 8 files changed, 951 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



[PATCH v4 3/3] samples/bpf: add lpm-trie benchmark

2017-01-21 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

Extend the map_perf_test_{user,kern}.c infrastructure to stress test
lpm-trie lookups. We hook into the kprobe on sys_gettid() and measure
the latency depending on trie size and lookup count.

On my Intel Haswell i7-6400U, a single gettid() syscall with an empty
bpf program takes roughly 6.5us on my system. Lookups in empty tries
take ~1.8us on first try, ~0.9us on retries. Lookups in tries with 8192
entries take ~7.1us (on the first _and_ any subsequent try).

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Reviewed-by: Daniel Mack <dan...@zonque.org>
---
 samples/bpf/map_perf_test_kern.c | 30 
 samples/bpf/map_perf_test_user.c | 49 
 2 files changed, 79 insertions(+)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 7ee1574..a91872a 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -57,6 +57,14 @@ struct bpf_map_def SEC("maps") percpu_hash_map_alloc = {
.map_flags = BPF_F_NO_PREALLOC,
 };
 
+struct bpf_map_def SEC("maps") lpm_trie_map_alloc = {
+   .type = BPF_MAP_TYPE_LPM_TRIE,
+   .key_size = 8,
+   .value_size = sizeof(long),
+   .max_entries = 1,
+   .map_flags = BPF_F_NO_PREALLOC,
+};
+
 SEC("kprobe/sys_getuid")
 int stress_hmap(struct pt_regs *ctx)
 {
@@ -135,5 +143,27 @@ int stress_percpu_lru_hmap_alloc(struct pt_regs *ctx)
return 0;
 }
 
+SEC("kprobe/sys_gettid")
+int stress_lpm_trie_map_alloc(struct pt_regs *ctx)
+{
+   union {
+   u32 b32[2];
+   u8 b8[8];
+   } key;
+   unsigned int i;
+
+   key.b32[0] = 32;
+   key.b8[4] = 192;
+   key.b8[5] = 168;
+   key.b8[6] = 0;
+   key.b8[7] = 1;
+
+#pragma clang loop unroll(full)
+   for (i = 0; i < 32; ++i)
+   bpf_map_lookup_elem(_trie_map_alloc, );
+
+   return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 9505b4d..680260a 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -37,6 +37,7 @@ static __u64 time_get_ns(void)
 #define PERCPU_HASH_KMALLOC(1 << 3)
 #define LRU_HASH_PREALLOC  (1 << 4)
 #define PERCPU_LRU_HASH_PREALLOC   (1 << 5)
+#define LPM_KMALLOC(1 << 6)
 
 static int test_flags = ~0;
 
@@ -112,6 +113,18 @@ static void test_percpu_hash_kmalloc(int cpu)
   cpu, MAX_CNT * 10ll / (time_get_ns() - start_time));
 }
 
+static void test_lpm_kmalloc(int cpu)
+{
+   __u64 start_time;
+   int i;
+
+   start_time = time_get_ns();
+   for (i = 0; i < MAX_CNT; i++)
+   syscall(__NR_gettid);
+   printf("%d:lpm_perf kmalloc %lld events per sec\n",
+  cpu, MAX_CNT * 10ll / (time_get_ns() - start_time));
+}
+
 static void loop(int cpu)
 {
cpu_set_t cpuset;
@@ -137,6 +150,9 @@ static void loop(int cpu)
 
if (test_flags & PERCPU_LRU_HASH_PREALLOC)
test_percpu_lru_hash_prealloc(cpu);
+
+   if (test_flags & LPM_KMALLOC)
+   test_lpm_kmalloc(cpu);
 }
 
 static void run_perf_test(int tasks)
@@ -162,6 +178,37 @@ static void run_perf_test(int tasks)
}
 }
 
+static void fill_lpm_trie(void)
+{
+   struct bpf_lpm_trie_key *key;
+   unsigned long value = 0;
+   unsigned int i;
+   int r;
+
+   key = alloca(sizeof(*key) + 4);
+   key->prefixlen = 32;
+
+   for (i = 0; i < 512; ++i) {
+   key->prefixlen = rand() % 33;
+   key->data[0] = rand() & 0xff;
+   key->data[1] = rand() & 0xff;
+   key->data[2] = rand() & 0xff;
+   key->data[3] = rand() & 0xff;
+   r = bpf_map_update_elem(map_fd[6], key, , 0);
+   assert(!r);
+   }
+
+   key->prefixlen = 32;
+   key->data[0] = 192;
+   key->data[1] = 168;
+   key->data[2] = 0;
+   key->data[3] = 1;
+   value = 128;
+
+   r = bpf_map_update_elem(map_fd[6], key, , 0);
+   assert(!r);
+}
+
 int main(int argc, char **argv)
 {
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
@@ -182,6 +229,8 @@ int main(int argc, char **argv)
return 1;
}
 
+   fill_lpm_trie();
+
run_perf_test(num_cpu);
 
return 0;
-- 
2.9.3



[PATCH v3 2/2] bpf: Add tests for the lpm trie map

2017-01-14 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 3 files changed, 361 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..dd83f0b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,358 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8)

[PATCH v3 0/2] bpf: add longest prefix match map

2017-01-14 Thread Daniel Mack
This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Note that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.

Feedback is much appreciated.
 
 
Thanks,
Daniel

Changelog:

v2 -> v3:
* Store both the key match data and the caller provided
  value in the same byte array attached to a node. This
  avoids double allocations
* Bring back node->flags to distinguish between 'real'
  and intermediate nodes
* Fix comment style and some typos

v1 -> v2:
* Turn spin lock into raw spinlock
* Lock with irqsave options during trie_update_elem()
* Return -ENOMEM properly from trie_alloc()
* Force attr->flags == BPF_F_NO_PREALLOC during creation
* Set trie->map.pages after creation to account for map memory
* Allow arbitrary value sizes
* Removed node->flags and denode intermediate nodes through
  node->value == NULL instead

rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues


Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 493 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 6 files changed, 862 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



[PATCH v3 1/2] bpf: add a longest prefix match trie map implementation

2017-01-14 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 493 +++
 3 files changed, 501 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..1c1ad27
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,493 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016,2017 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   raw_spinlock_t  lock;
+};
+
+/* This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (

Re: [PATCH v2 1/2] bpf: add a longest prefix match trie map implementation

2017-01-14 Thread Daniel Mack
On 01/13/2017 07:01 PM, Alexei Starovoitov wrote:
> On Thu, Jan 12, 2017 at 06:29:21PM +0100, Daniel Mack wrote:
>> This trie implements a longest prefix match algorithm that can be used
>> to match IP addresses to a stored set of ranges.
>>
>> Internally, data is stored in an unbalanced trie of nodes that has a
>> maximum height of n, where n is the prefixlen the trie was created
>> with.
>>
>> Tries may be created with prefix lengths that are multiples of 8, in
>> the range from 8 to 2048. The key used for lookup and update operations
>> is a struct bpf_lpm_trie_key, and the value is a uint64_t.
>>
>> The code carries more information about the internal implementation.
>>
>> Signed-off-by: Daniel Mack <dan...@zonque.org>
>> Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
>> ---
>>  include/uapi/linux/bpf.h |   7 +
>>  kernel/bpf/Makefile  |   2 +-
>>  kernel/bpf/lpm_trie.c| 499 
>> +++
>>  3 files changed, 507 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/bpf/lpm_trie.c

...

Thanks for spotting my typos! :)

>> +static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie 
>> *trie,
>> + const void *value)
>> +{
>> +struct lpm_trie_node *node;
>> +gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
>> +
>> +node = kmalloc(sizeof(struct lpm_trie_node) + trie->data_size, gfp);
>> +if (!node)
>> +return ERR_PTR(-ENOMEM);
>> +
>> +if (value) {
>> +node->value = kmemdup(value, trie->map.value_size, gfp);
> 
> can you make value to be part of the node? similar to how hash map is done?
> that will help avoid 2nd allocation, will speedup insertion and will
> help converting this code to user pre-allocated elements.
> I suspect the concern was that for many inner nodes that value is null ?
> But in your use case the value_size will be == 0 eventually,
> so by embedding it when can save memory too, since 'value' pointer will
> be replaced with boolean present flag ?
> So potentially less memory and less cache misses?

Yes, that's a good idea. Implemented that now.

> Overall algorithm is indeed straightforward and simple which is great,
> but I would still like to see some performance numbers.

I'm not sure yet how to implement such a test in a meaningful way, tbh.
Given that the lookups have to be done one by one, I expect the syscall
overhead to be quite significant.

> Looks like
> the best case for single 32-bit element it needs 4 xors and compares
> which is fine. For mostly populate trie it's 4xors * 32 depth
> which is pretty good too, but cache misses on pointer walks may
> kill performance unless we're hitting the same path all the time.
> I think it's all acceptable due to simplicity of the implementation
> which we may improve later if it turns out to be a bottle neck for
> some use cases. We just need a baseline to have realistic expectations.

Yes, the maximum height of the trie is the number of bits in the prefix,
so for n bits, the iteration would at most take n steps to finish. For
each step, an xor and compare for n/8 bytes are needed.

As you say, the implementation could be improved under the hood if
someone spots a bottleneck somewhere.

I'll post a v3 with your comments addressed for further discussion.


Thanks,
Daniel





[PATCH v2 2/2] bpf: Add tests for the lpm trie map

2017-01-12 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 3 files changed, 361 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..dd83f0b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,358 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8)

[PATCH v2 1/2] bpf: add a longest prefix match trie map implementation

2017-01-12 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 499 +++
 3 files changed, 507 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..7f6d47e
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,499 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016,2017 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   void*value;
+   u32 prefixlen;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   raw_spinlock_t  lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child index depends on the next bit
+ * that is outsid

[PATCH v2 0/2] bpf: add longest prefix match map

2017-01-12 Thread Daniel Mack
This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.

We didn't yet get around to augment the tests for non 2^n bit depths
and benchmarks. We'll add that later.
 
Feedback is much appreciated.
 
 
Thanks,
Daniel

Changelog:

v1 -> v2:
* Turn spin lock into raw spinlock
* Lock with irqsave options during trie_update_elem()
* Return -ENOMEM properly from trie_alloc()
* Force attr->flags == BPF_F_NO_PREALLOC during creation
* Set trie->map.pages after creation to account for map memory
* Allow arbitrary value sizes
* Removed node->flags and denode intermediate nodes through
  node->value == NULL instead

rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues


Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 499 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 358 +
 6 files changed, 868 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Mack
Hi,

On 01/05/2017 09:01 PM, Daniel Borkmann wrote:
> On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
>> On 12/29/2016 06:28 PM, Daniel Mack wrote:

> [...]
>>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>>> +{
>>> +struct lpm_trie *trie;
>>> +
>>> +/* check sanity of attributes */
>>> +if (attr->max_entries == 0 || attr->map_flags ||
>>> +attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>>> +attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>>> +attr->value_size != sizeof(u64))
>>> +return ERR_PTR(-EINVAL);
> 
> One more question on this regarding value size as u64 (perhaps I
> missed it along the way): reason this was chosen was because for
> keeping stats? Why not making user choose a size as in other maps,
> so also custom structs could be stored there?

In my use case, the actual value of a node is in fact ignored, all that
matters is whether a node exists in a trie or not. The test code uses
u64 for its tests.

I can change it around so that the value size can be defined by
userspace, but ideally it would also support 0-byte lengths then. The
bpf map syscall handler should handle the latter just fine if I read the
code correctly?


Thanks,
Daniel


Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Mack
Hi Daniel,

Thanks for your feedback! I agree on all points. Two questions below.

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
> On 12/29/2016 06:28 PM, Daniel Mack wrote:

>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> new file mode 100644
>> index 000..8b6a61d
>> --- /dev/null
>> +++ b/kernel/bpf/lpm_trie.c

[..]

>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>> +{
>> +struct lpm_trie *trie;
>> +
>> +/* check sanity of attributes */
>> +if (attr->max_entries == 0 || attr->map_flags ||
>> +attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>> +attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>> +attr->value_size != sizeof(u64))
>> +return ERR_PTR(-EINVAL);
> 
> The correct attr->map_flags test here would need to be ...
> 
>attr->map_flags != BPF_F_NO_PREALLOC
> 
> ... since in this case we don't have any prealloc pool, and
> should that come one day that test could be relaxed again.
> 
>> +trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
>> +if (!trie)
>> +return NULL;
>> +
>> +/* copy mandatory map attributes */
>> +trie->map.map_type = attr->map_type;
>> +trie->map.key_size = attr->key_size;
>> +trie->map.value_size = attr->value_size;
>> +trie->map.max_entries = attr->max_entries;
> 
> You also need to fill in trie->map.pages as that is eventually
> used to charge memory against in bpf_map_charge_memlock(), right
> now that would remain as 0 meaning the map is not accounted for.

Hmm, okay. The nodes are, however, allocated dynamically at runtime in
this case. That means that we have trie->map.pages on each allocation,
right?

>> +static void trie_free(struct bpf_map *map)
>> +{
>> +struct lpm_trie_node __rcu **slot;
>> +struct lpm_trie_node *node;
>> +struct lpm_trie *trie =
>> +container_of(map, struct lpm_trie, map);
>> +
>> +spin_lock(>lock);
>> +
>> +/*
>> + * Always start at the root and walk down to a node that has no
>> + * children. Then free that node, nullify its pointer in the parent,
>> + * then start over.
>> + */
>> +
>> +for (;;) {
>> +slot = >root;
>> +
>> +for (;;) {
>> +node = rcu_dereference_protected(*slot,
>> +lockdep_is_held(>lock));
>> +if (!node)
>> +goto out;
>> +
>> +if (node->child[0]) {
> 
> rcu_access_pointer(node->child[0]) (at least to keep sparse happy?)

Done, but sparse does not actually complain here.



Thanks,
Daniel



[PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2016-12-29 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 468 +++
 3 files changed, 476 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..8b6a61d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,468 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u64 value;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   spinlock_t  lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less 

[PATCH v1 2/2] bpf: Add tests for the lpm trie map

2016-12-29 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +
 3 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..08db750
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,348 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+   assert(t1 == 

[PATCH v1 0/2] bpf: add longest prefix match map

2016-12-29 Thread Daniel Mack
This patch set adds a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges. It is exposed as a
bpf map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.
 
Feedback is much appreciated.
 
 
Thanks,
Daniel

Changelog:

rfc -> v1:
* Add __rcu pointer annotations to make sparse happy
* Fold _lpm_trie_find_target_node() into its only caller
* Fix some minor documentation issues


Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 468 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +
 6 files changed, 827 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-20 Thread Daniel Mack
Hi,

On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <dan...@zonque.org> wrote:

> To clarify, since this thread has gotten excessively long and twisted,
> I think it's important that, for hooks attached to a cgroup, you be
> able to tell in a generic way whether something is plugged into the
> hook.  The natural way to see a cgroup's configuration is to read from
> cgroupfs, so I think that reading from cgroupfs should show you that a
> BPF program is attached and also give enough information that, once
> bpf programs become dumpable, you can dump the program (using the
> bpf() syscall or whatever).

[...]

> There isn't a big semantic difference between
> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
> because the permission check is much weaker.

Okay, if you have such a control file, you can of course do something
like that. When we discussed things back then with Tejun however, we
concluded that a controller that is not completely controllable through
control knobs that can be written and read via cat is meaningless.
That's why this has become a 'hidden' cgroup feature.

With your proposed API, you'd first go to the bpf(2) syscall in order to
get a prog fd, and then come back to some sort of cgroup API to put the
fd in there. That's quite a mix and match, which is why we considered
the API cleaner in its current form, as everything that is related to
bpf is encapsulated behind a single syscall.

> My preference would be to do an ioctl on a new
> /cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tells
> you whether something is attached and hopefully also gives enough
> information (a hash of the BPF program, perhaps) to dump the actual
> program using future bpf() interfaces.  write() and ioctl() can be
> used to configure it as appropriate.

So am I reading this right? You're proposing to add ioctl() hooks to
kernfs/cgroupfs? That would open more possibilities of course, but I'm
not sure where that rabbit hole leads us eventually.

> Another option that I like less would be to have a
> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
> their contents.  You would do an ioctl() on that to program a hook and
> you could read it to see what's there.

Yes, read() could, in theory, give you similar information than ioctl(),
but in human-readable form.

> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> It doesn't make a semantic difference, except that I dislike
> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> all.

Well, I think it is; it pops the bpf program from a target and drops the
reference on it. It's not much code, but it's certainly bpf-specific.

>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>> enable delegation and if the program running there wants to do its own
>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>
>> That's a fair point, and we've discussed it as well. The issue is, as
>> Alexei already pointed out, that we do not want to traverse the tree up
>> to the root for nested cgroups due to the runtime costs in the
>> networking fast-path. After all, we're running the bpf program for each
>> packet in flight. Hence, we opted for the approach to only look at the
>> leaf node for now, with the ability to open it up further in the future
>> using flags during attach etc.
> 
> Careful here!  You don't look only at the leaf node for now.  You do a
> fancy traversal and choose the nearest node that has a hook set up.

But we do the 'complex' operation at attach time or when a cgroup is
created, both of which are slow-path operations. In the fast-path, we
only look at the leaf, which may or may not have an effective program
installed. And that's of course much cheaper then doing the traversing
for each packet.

> mkdir /cgroup/foo
> BPF_PROG_ATTACH(some program to foo)
> mkdir /cgroup/foo/bar
> chown -R some_user /cgroup/foo/bar
> 
> If the kernel only looked at the leaf, then the program that did the
> above would not expect that the program would constrain
> /cgroup/foo/bar's activity.  But, as it stands, the program *would*
> expect /cgroup/foo/bar to be constrained, except that, whenever the
> capable() check changes to ns_capable() (which will happen eventually
> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
> install a new no-op hook there, and br

[PATCH] bpf: cgroup: annotate pointers in struct cgroup_bpf with __rcu

2016-12-15 Thread Daniel Mack
The member 'effective' in 'struct cgroup_bpf' is protected by RCU.
Annotate it accordingly to squelch a sparse warning.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/linux/bpf-cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 7b6e5d1..92bc89a 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -20,7 +20,7 @@ struct cgroup_bpf {
 * when this cgroup is accessed.
 */
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
-   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
 };
 
 void cgroup_bpf_put(struct cgroup *cgrp);
-- 
2.9.3



[PATCH RFC 2/2] bpf: Add tests for the lpm trie map

2016-12-14 Thread Daniel Mack
From: David Herrmann <dh.herrm...@gmail.com>

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann <dh.herrm...@gmail.com>
Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +
 3 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..08db750
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,348 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+   assert(t1 == 

[PATCH RFC 0/2] bpf: add longest prefix match map

2016-12-14 Thread Daniel Mack
This patch set adds longest prefix match algorithm that can be used to
match IP addresses to a stored set of ranges. It is exposed as a bpf
map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.
 
Feedback is much appreciated.
 
 
Thanks,
Daniel

Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 491 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 
 6 files changed, 850 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



[PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation

2016-12-14 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reviewed-by: David Herrmann <dh.herrm...@gmail.com>
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 491 +++
 3 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..cae759d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,491 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node*child[2];
+   u32 prefixlen;
+   u32 flags;
+   u64 value;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node*root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   spinlock_t  lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child index depends on 

Re: [PATCH net-next] cgroup, bpf: remove unnecessary #include

2016-11-29 Thread Daniel Mack
On 11/29/2016 11:48 AM, Daniel Borkmann wrote:
> On 11/26/2016 08:23 AM, Alexei Starovoitov wrote:
>> this #include is unnecessary and brings whole set of
>> other headers into cgroup-defs.h. Remove it.
>>
>> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
>> Signed-off-by: Alexei Starovoitov <a...@kernel.org>
> 
> This fixes many build errors in samples/bpf/ due to wrong helper
> redefinitions (originating from kernel includes conflicting with
> samples' helper declarations).
> 
> I don't see it pushed out to net-next yet, so:
> 
> Acked-by: Daniel Borkmann <dan...@iogearbox.net>
> 

FWIW:

Acked-by: Daniel Mack <dan...@zonque.org>



Re: [PATCH] bpf: cgroup: fix documentation of __cgroup_bpf_update()

2016-11-28 Thread Daniel Mack
On 11/28/2016 02:03 PM, Daniel Borkmann wrote:
> On 11/28/2016 12:04 PM, Daniel Mack wrote:
>> There's a 'not' missing in one paragraph. Add it.
>>
>> Signed-off-by: Daniel Mack <dan...@zonque.org>
>> Reported-by: Rami Rosen <roszenr...@gmail.com>
>> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
> 
> Small nit in subject: s/[PATCH]/[PATCH net-next]/
> 
>>   kernel/bpf/cgroup.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a0ab43f..b708e6e 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -70,9 +70,9 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
>> *parent)
>>* releases the one that is currently attached, if any. @prog is then made
>>* the effective program of type @type in that cgroup.
>>*
>> - * If @prog is %NULL, the currently attached program of type @type is 
>> released,
>> - * and the effective program of the parent cgroup (if any) is inherited to
>> - * @cgrp.
>> + * If @prog is not %NULL, the currently attached program of type @type is
>> + * released, and the effective program of the parent cgroup (if any) is
>> + * inherited to @cgrp.
> 
> Both paragraphs for __cgroup_bpf_update() currently say:
> 
> [...]
>   * If @prog is %NULL, this function attaches a new program to the cgroup and
>   * releases the one that is currently attached, if any. @prog is then made
>   * the effective program of type @type in that cgroup.
>   *
>   * If @prog is %NULL, the currently attached program of type @type is 
> released,
>   * and the effective program of the parent cgroup (if any) is inherited to
>   * @cgrp.
> [...]
> 
> It looks to me that you are 'fixing' the wrong location. First paragraph is
> actually missing a "not", which would then also align with what the code does.
> 

Argh, sorry. Will resend.


[PATCH net-next v2] bpf: cgroup: fix documentation of __cgroup_bpf_update()

2016-11-28 Thread Daniel Mack
There's a 'not' missing in one paragraph. Add it.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reported-by: Rami Rosen <roszenr...@gmail.com>
Fixes: 3007098494be ("cgroup: add support for eBPF programs")
---
 kernel/bpf/cgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f..8c784f8 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -66,8 +66,8 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
  * Each cgroup has a set of two pointers for bpf programs; one for eBPF
  * programs it owns, and which is effective for execution.
  *
- * If @prog is %NULL, this function attaches a new program to the cgroup and
- * releases the one that is currently attached, if any. @prog is then made
+ * If @prog is not %NULL, this function attaches a new program to the cgroup
+ * and releases the one that is currently attached, if any. @prog is then made
  * the effective program of type @type in that cgroup.
  *
  * If @prog is %NULL, the currently attached program of type @type is released,
-- 
2.7.4



[PATCH] bpf: cgroup: fix documentation of __cgroup_bpf_update()

2016-11-28 Thread Daniel Mack
There's a 'not' missing in one paragraph. Add it.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Reported-by: Rami Rosen <roszenr...@gmail.com>
Fixes: 3007098494be ("cgroup: add support for eBPF programs")
---
 kernel/bpf/cgroup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f..b708e6e 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -70,9 +70,9 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup 
*parent)
  * releases the one that is currently attached, if any. @prog is then made
  * the effective program of type @type in that cgroup.
  *
- * If @prog is %NULL, the currently attached program of type @type is released,
- * and the effective program of the parent cgroup (if any) is inherited to
- * @cgrp.
+ * If @prog is not %NULL, the currently attached program of type @type is
+ * released, and the effective program of the parent cgroup (if any) is
+ * inherited to @cgrp.
  *
  * Then, the descendants of @cgrp are walked and the effective program for
  * each of them is set to the effective program of @cgrp unless the
-- 
2.7.4



Re: [PATCH v9 2/6] cgroup: add support for eBPF programs

2016-11-24 Thread Daniel Mack
Hi Rami,

On 11/23/2016 11:46 PM, Rami Rosen wrote:
> A minor comment:
> 
>> +/**
>> + * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
>> + * propagate the change to descendants
>> + * @cgrp: The cgroup which descendants to traverse
>> + * @parent: The parent of @cgrp, or %NULL if @cgrp is the root
>> + * @prog: A new program to pin
>> + * @type: Type of pinning operation (ingress/egress)
>> + *
>> + * Each cgroup has a set of two pointers for bpf programs; one for eBPF
>> + * programs it owns, and which is effective for execution.
>> + *
> You have in the following section twice identical checks, for If @prog
> is %NULL".
> Shouldn't it be here (in the first case only) "If @prog is not %NULL"
> instead "If @prog is %NULL"?

Yes, you're right, thanks for spotting.

If possible, I would like to not send a v10 just for this one missing
word in the comments though, but rather fix that up in a separate patch
afterwards.


Thanks,
Daniel


> 
>> + * If @prog is %NULL, this function attaches a new program to the cgroup and
>> + * releases the one that is currently attached, if any. @prog is then made
>> + * the effective program of type @type in that cgroup.
>> + *
>> + * If @prog is %NULL, the currently attached program of type @type is 
>> released,
>> + * and the effective program of the parent cgroup (if any) is inherited to
>> + * @cgrp.
>> + *
> 
> 
> Regard,
> Rami Rosen
> 



[PATCH v9 2/6] cgroup: add support for eBPF programs

2016-11-23 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf-cgroup.h  |  79 +
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/cgroup.c |  18 +
 6 files changed, 281 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..ec80d0c
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,79 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb)   \
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled) \
+   __ret = __cgroup_bpf_run_filter(sk, skb,\
+   BPF_CGROUP_INET_INGRESS); \
+   \
+   __ret;  \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb)
\
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled && sk && sk == skb->sk) {\
+   typeof(sk) __sk = sk_to_full_sk(sk);\
+   if (sk_fullsock(__sk))  \
+   __ret = __cgroup_bpf_run_filter(__sk, skb,  \
+   BPF_CGROUP_INET_EGRESS); \
+   }   \
+   __ret;  \
+})
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs *

[PATCH v9 4/6] net: filter: run cgroup eBPF ingress programs

2016-11-23 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..474b486 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.7.4



[PATCH v9 1/6] bpf: add new prog type for cgroup socket filtering

2016-11-23 Thread Daniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  9 +
 net/core/filter.c| 23 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   default:
+   return sk_filter_func_proto(func_id);
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_skb_ops = {
+   .get_func_proto = cg_skb_func_proto,
+   .is_valid_access= sk_filter_is_valid_access,
+   .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops= _filter_ops,
.type   = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+   .ops= _skb_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SKB,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_skb_type);
 
return 0;
 }
-- 
2.7.4



[PATCH v9 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-11-23 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SKB);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.7.4



[PATCH v9 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-11-23 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/ipv4/ip_output.c  | 26 --
 net/ipv6/ip6_output.c |  9 +
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..ed0c276 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -281,6 +282,13 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
 static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
unsigned int mtu;
+   int ret;
+
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
@@ -299,6 +307,20 @@ static int ip_finish_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
return ip_finish_output2(net, sk, skb);
 }
 
+static int ip_mc_finish_output(struct net *net, struct sock *sk,
+  struct sk_buff *skb)
+{
+   int ret;
+
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
+   return dev_loopback_xmit(net, sk, skb);
+}
+
 int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
@@ -336,7 +358,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
if (newskb)
NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, newskb, NULL, newskb->dev,
-   dev_loopback_xmit);
+   ip_mc_finish_output);
}
 
/* Multicasts with ttl 0 must not go beyond the host */
@@ -352,7 +374,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
if (newskb)
NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, newskb, NULL, newskb->dev,
-   dev_loopback_xmit);
+   ip_mc_finish_output);
}
 
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..ddeb41e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -131,6 +132,14 @@ static int ip6_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *
 
 static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
+   int ret;
+
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
dst_allfrag(skb_dst(skb)) ||
(IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
-- 
2.7.4



[PATCH v9 0/6] Add eBPF hooks for cgroups

2016-11-23 Thread Daniel Mack


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  79 +++
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  17 
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/bpf/syscall.c|  81 +++
 kernel/cgroup.c |  18 +
 net/core/filter.c   |  27 +++
 net/ipv4/ip_output.c|  26 ++-
 net/ipv6/ip6_output.c   |   9 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++
 15 files changed, 612 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.7.4



[PATCH v9 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-11-23 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb6..d0a799a 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..63ef208
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load

[PATCH v8 0/6] Add eBPF hooks for cgroups

2016-11-17 Thread Daniel Mack
This is v8 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

Again, only minor details are updated in this version.


Thanks,
Daniel


Changes from v7:

* Replace the static inline function cgroup_bpf_run_filter() with
  two specific macros for ingress and egress.  This addresses David
  Miller's concern regarding skb->sk vs. sk in the egress path.
  Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the
  suggestions.


Changes from v6:

* Rebased to 4.9-rc2

* Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
  now succeeds in building this version of the patch set.

* Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
  tamper with the contents of skb->cb[]. Pointed out by Daniel
  Borkmann.

* Use sk_to_full_sk() in the egress path, as suggested by Daniel
  Borkmann.

* Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
  requested by David Ahern.

* Added Alexei's Acked-by tags.


Changes from v5:

* The eBPF programs now operate on L3 rather than on L2 of the packets,
  and the egress hooks were moved from __dev_queue_xmit() to
  ip*_output().

* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
  through BPF_LD_[ABS|IND] instructions, but hook up the
  bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
  for the help.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples:

[PATCH v8 2/6] cgroup: add support for eBPF programs

2016-11-17 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf-cgroup.h  |  79 +
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/cgroup.c |  18 +
 6 files changed, 281 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..ec80d0c
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,79 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb)   \
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled) \
+   __ret = __cgroup_bpf_run_filter(sk, skb,\
+   BPF_CGROUP_INET_INGRESS); \
+   \
+   __ret;  \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb)
\
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled && sk && sk == skb->sk) {\
+   typeof(sk) __sk = sk_to_full_sk(sk);\
+   if (sk_fullsock(__sk))  \
+   __ret = __cgroup_bpf_run_filter(__sk, skb,  \
+   BPF_CGROUP_INET_EGRESS); \
+   }   \
+   __ret;  \
+})
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs *

[PATCH v8 4/6] net: filter: run cgroup eBPF ingress programs

2016-11-17 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..474b486 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.7.4



[PATCH v8 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-11-17 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SKB);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.7.4



[PATCH v8 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-11-17 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb6..d0a799a 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..63ef208
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load

[PATCH v8 1/6] bpf: add new prog type for cgroup socket filtering

2016-11-17 Thread Daniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  9 +
 net/core/filter.c| 23 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   default:
+   return sk_filter_func_proto(func_id);
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_skb_ops = {
+   .get_func_proto = cg_skb_func_proto,
+   .is_valid_access= sk_filter_is_valid_access,
+   .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops= _filter_ops,
.type   = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+   .ops= _skb_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SKB,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_skb_type);
 
return 0;
 }
-- 
2.7.4



[PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-11-17 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/ipv4/ip_output.c  | 15 +++
 net/ipv6/ip6_output.c |  8 
 2 files changed, 23 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..5914006 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+   int ret;
 
/*
 *  If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
/*
 *  Multicasts are looped back for other local users
 */
@@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   int ret;
 
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..483f91b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+   int ret;
 
if (unlikely(idev->cnf.disable_ipv6)) {
IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
return 0;
}
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip6_finish_output,
-- 
2.7.4



Re: [PATCH nf-next,RFC] netfilter: nft_meta: add cgroup version 2 support

2016-11-14 Thread Daniel Mack
Hi Pablo,

On 11/14/2016 10:12 AM, Pablo Neira Ayuso wrote:
> Add cgroup version 2 support to nf_tables.
> 
> This extension allows us to fetch the cgroup i-node number from the
> cgroup socket data, place it in a register, then match it against any
> value specified by user. This approach scales up nicely since it
> integrates well in the existing nf_tables map infrastructure.
> 
> Contrary to what iptables cgroup v2 match does, this patch doesn't use
> cgroup_is_descendant() because this call cannot guarantee that the cgroup
> hierarchy is honored in anyway given that the cgroup v2 field becomes yet
> another packet selector that you can use to build your filtering policy.
> 
> Actually, using the i-node approach, it should be easy to build a policy
> that honors the hierarchy if you need this, eg.
> 
>   meta cgroup2 vmap { "/A/B" : jump b-cgroup-chain,
>   "/A/C" : jump c-cgroup-chain,
>   "/A" : jump a-cgroup-chain }
> 
> then, the b-cgroup-chain looks like:
> 
>   jump a-cgroup-chain
>   ... # specific policy b-cgroup-chain goes here
> 
> similarly, the c-cgroup-chain looks like:
> 
>   jump a-cgroup-chain
>   ... # specific policy c-cgroup-chain goes here
> 
> So both B and C would evaluate A's ruleset. Note that cgroup A would
> also jump to the root cgroup chain policy.
> 
> Anyway, this cgroup i-node approach provides way more flexibility since
> it is up to the sysadmin to decide if he wants to honor the hierarchy or
> simply define a fast path to skip any further classification.

I don't think this can work. The problem is that inodes in cgroupfs are
dynamically allocated when a cgroup is created, so the sysadmin cannot
install the jump rules before that. Worse yet, inode numbers in pseudo
filesystems are recycled, so if a cgroup goes away and a new one is
created, the latter may well end up having the same inode than the old
one. As cgroupfs decoupled from netfilter tables, this will lead to
major chaos in the field.

Note that this was different with the netclass controller in v1 that
would assign a user-controlled numeric value to each cgroup, so both
sides were in the control of the sysadmin. It is also different with the
path matching logic for v2 which does a full path string comparison.
That's potentially expensive, it does lead to predictable runtime behavior.

One way forward here would be to assign a atomically increasing 64-bit
sequence number to each cgroup and expose that. I've recently talked to
Tejun about that. While that won't solve the predictability issue, it
would at least make it practically impossible to have re-used IDs.

Anyway - I think it would be great to have an alternative to the v2 path
matching here, but of course this patch does not solve the ingress issue
we've been discussing. It is still impossible to reliably determine the
cgroup of a local receiver at the time when the netfilter rules are
processed, even for unicast packets.



Thanks,
Daniel


> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
>  include/uapi/linux/netfilter/nf_tables.h |  2 ++
>  net/netfilter/nft_meta.c | 15 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/uapi/linux/netfilter/nf_tables.h 
> b/include/uapi/linux/netfilter/nf_tables.h
> index 0da7ccf65511..5d4d08367a87 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -729,6 +729,7 @@ enum nft_exthdr_attributes {
>   * @NFT_META_OIFGROUP: packet output interface group
>   * @NFT_META_CGROUP: socket control group (skb->sk->sk_classid)
>   * @NFT_META_PRANDOM: a 32bit pseudo-random number
> + * @NFT_META_CGROUP2: socket control group v2 (skb->sk->sk_cgrp_data)
>   */
>  enum nft_meta_keys {
>   NFT_META_LEN,
> @@ -756,6 +757,7 @@ enum nft_meta_keys {
>   NFT_META_OIFGROUP,
>   NFT_META_CGROUP,
>   NFT_META_PRANDOM,
> + NFT_META_CGROUP2,
>  };
>  
>  /**
> diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
> index 6c1e0246706e..1e793e133903 100644
> --- a/net/netfilter/nft_meta.c
> +++ b/net/netfilter/nft_meta.c
> @@ -190,6 +190,18 @@ void nft_meta_get_eval(const struct nft_expr *expr,
>   *dest = prandom_u32_state(state);
>   break;
>   }
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> + case NFT_META_CGROUP2: {
> + struct cgroup *cgrp;
> +
> + if (!skb->sk || !sk_fullsock(skb->sk))
> + goto err;
> +
> + cgrp = sock_cgroup_ptr(>sk->sk_cgrp_data);
> + *dest = cgrp->kn->ino;
> + break;
> + }
> +#endif
>   default:
>   WARN_ON(1);
>   goto err;
> @@ -273,6 +285,9 @@ int nft_meta_get_init(const struct nft_ctx *ctx,
>  #ifdef CONFIG_CGROUP_NET_CLASSID
>   case NFT_META_CGROUP:
>  #endif
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> + case NFT_META_CGROUP2:
> +#endif
>   len = 

Re: [PATCH v2 net-next 1/5] bpf: Refactor cgroups code in prep for new type

2016-10-31 Thread Daniel Mack
On 10/31/2016 06:05 PM, David Ahern wrote:
> On 10/31/16 11:00 AM, Daniel Mack wrote:
>> On 10/31/2016 05:58 PM, David Miller wrote:
>>> From: David Ahern <d...@cumulusnetworks.com> Date: Wed, 26 Oct
>>> 2016 17:58:38 -0700
>>> 
>>>> diff --git a/include/uapi/linux/bpf.h
>>>> b/include/uapi/linux/bpf.h index 6b62ee9a2f78..73da296c2125
>>>> 100644 --- a/include/uapi/linux/bpf.h +++
>>>> b/include/uapi/linux/bpf.h @@ -98,7 +98,7 @@ enum bpf_prog_type
>>>> { BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, 
>>>> BPF_PROG_TYPE_PERF_EVENT, -BPF_PROG_TYPE_CGROUP_SKB, +
>>>> BPF_PROG_TYPE_CGROUP, };
>>>> 
>>>> enum bpf_attach_type {
>>> 
>>> If we do this then the cgroup-bpf series should use this value
>>> rather than changing it after-the-fact in your series here.
>>> 
>> 
>> Yeah, I'm confused too. I changed that name in my v7 from 
>> BPF_PROG_TYPE_CGROUP_SOCK to BPF_PROG_TYPE_CGROUP_SKB on David's
>> (Ahern) request. Why is it now renamed again?
> 
> Thomas pushed back on adding another program type in favor of using
> subtypes. So this makes the program type generic to CGROUP and patch
> 2 in this v2 set added Mickaël's subtype patch with the socket
> mangling done that way in patch 3.
> 

Fine for me. I can change it around again.


Thanks,
Daniel


Re: [PATCH v2 net-next 1/5] bpf: Refactor cgroups code in prep for new type

2016-10-31 Thread Daniel Mack
On 10/31/2016 05:58 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed, 26 Oct 2016 17:58:38 -0700
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 6b62ee9a2f78..73da296c2125 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -98,7 +98,7 @@ enum bpf_prog_type {
>>  BPF_PROG_TYPE_TRACEPOINT,
>>  BPF_PROG_TYPE_XDP,
>>  BPF_PROG_TYPE_PERF_EVENT,
>> -BPF_PROG_TYPE_CGROUP_SKB,
>> +BPF_PROG_TYPE_CGROUP,
>>  };
>>  
>>  enum bpf_attach_type {
> 
> If we do this then the cgroup-bpf series should use this value rather than
> changing it after-the-fact in your series here.
> 

Yeah, I'm confused too. I changed that name in my v7 from
BPF_PROG_TYPE_CGROUP_SOCK to BPF_PROG_TYPE_CGROUP_SKB on David's (Ahern)
request. Why is it now renamed again?


Thanks,
Daniel



Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Daniel Mack
On 10/28/2016 01:53 PM, Pablo Neira Ayuso wrote:
> On Thu, Oct 27, 2016 at 10:40:14AM +0200, Daniel Mack wrote:

>> It's not anything new. These hooks live on the very same level as
>> SO_ATTACH_FILTER. The only differences are that the BPF programs are
>> stored in the cgroup, and not in the socket, and that they exist for
>> egress as well.
> 
> Can we agree this is going further than SO_ATTACH_FILTER?

It's the same level. Only the way of setting the program(s) is different.

>> Adding it there would mean we need to early-demux *every* packet as soon
>> as there is *any* such rule installed, and that renders many
>> optimizations in the kernel to drop traffic that has no local receiver
>> useless.
> 
> I think such concern applies to doing early demux inconditionally in
> all possible scenarios (such as UDP broadcast/multicast), that implies
> wasted cycles for people not requiring this.

If you have a rule that acts on a condition based on a local receiver
detail such as a cgroup membership, then the INPUT filter *must* know
the local receiver for *all* packets passing by, otherwise it cannot act
upon it. And that means that you have to early-demux in any case as long
as at least one such a rule exists.

> If we can do what demuxing in an optional way, ie. only when socket
> filtering is required, then only those that need it would pay that
> price. Actually, if we can do this demux very early, from ingress,
> performance numbers would be also good to perform any socket-based
> filtering.

For multicast, rules have to be executed for each receiver, which is
another reason why the INPUT path is the wrong place to solve to problem.

You actually convinced me yourself about these details, but you seem to
constantly change your opinion about all this. Why is this such a
whack-a-mole game?

> I guess you're using an old kernel and refering to iptables, this is
> not true for some time, so we don't have any impact now with loaded
> iptables modules.

My point is that the performance decrease introduced by my patch set is
not really measurable, even if you pipe all the wire-saturating test
traffic through the example program. At least not with my setup here. If
a local receiver has no applicable bpf in its cgroup, the logic bails
out way earlier, leading a lot less overhead even. And if no cgroup has
any program attached, the code is basically no-op thanks to the static
branch. I really see no reason to block this patch set due to unfounded
claims of bad performance.


Thanks,
Daniel



Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-27 Thread Daniel Mack
On 10/26/2016 09:59 PM, Pablo Neira Ayuso wrote:
> On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> [...]
>>   Dumping programs once they are installed is problematic because of
>>   the internal optimizations done to the eBPF program during its
>>   lifetime. Also, the references to maps etc. would need to be
>>   restored during the dump.
>>
>>   Just exposing whether or not a program is attached would be
>>   trivial to do, however, most easily through another bpf(2)
>>   command. That can be added later on though.
> 
> I don't know if anyone told you, but during last netconf, this topic
> took a bit of time of discussion and it was controversial, I would say
> 1/3 of netdev hackers there showed their concerns, and that's
> something that should not be skipped IMO.
> 
> While xdp is pushing bpf programs at the very early packet path, not
> interfering with the stack, before even entering the generic ingress
> path. But this is adding hooks to push bpf programs in the middle of
> our generic stack, this is way different domain.

It's not anything new. These hooks live on the very same level as
SO_ATTACH_FILTER. The only differences are that the BPF programs are
stored in the cgroup, and not in the socket, and that they exist for
egress as well.

> I would really like to explore way earlier filtering, by extending
> socket lookup facilities. So far the problem seems to be that we need
> to lookup for broadcast/multicast UDP sockets and those cannot be
> attach via the usual skb->sk.

We've been there. We've discussed all that. And we concluded that doing
early demux in the input filter path is not the right approach. That was
my very first take on that issue back in June 2015 (!), and it was
rightfully turned down for good reasons.

Adding it there would mean we need to early-demux *every* packet as soon
as there is *any* such rule installed, and that renders many
optimizations in the kernel to drop traffic that has no local receiver
useless.

> I think it would be possible to wrap
> around this socket code in functions so we can invoke it. I guess
> filtering of UDP and TCP should be good for you at this stage. This
> would require more work though, but this would come with no hooks in
> the stack and packets will not have to consume *lots of cycles* just
> to be dropped before entering the socket queue.
>
> How useful can be to drop lots of unwanted traffic at such a late
> stage? How would the performance numbers to drop packets would look
> like? Extremely bad, I predict.

I fear I'm repeating myself here, but this is unfounded. I'm not sure
why you keep bringing it up. As I said weeks ago - just loading the
netfilter modules without any rules deployed has more impact than
running the example program in 6/6 on every packet in the test traffic.
Please give it a shot yourself.

Also, the eBPF programs can well be used in combination with existing
netfilter setups. There is no reason to not combine the two levels of
filtering. Both have their right to exist, and nobody is taking anything
away.


Thanks,
Daniel



[PATCH v7 2/6] cgroup: add support for eBPF programs

2016-10-25 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/bpf-cgroup.h  |  71 +++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/cgroup.c |  18 +
 6 files changed, 273 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   if (cgroup_bpf_enabled)
+   return __cgroup_bpf_run_filter(sk, skb, type);
+
+   return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index 34407f1..405120b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1154,6 +1154,18 @@ config CGROUP_PERF
 
  Say N if unsure.
 
+config CGROUP_BPF
+   bool "Support for eBPF programs attached to cgroups"
+   depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+   help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+ inet sockets.
+
 config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
in

[PATCH v7 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-10-25 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SKB);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.7.4



[PATCH v7 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-10-25 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb6..d0a799a 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..63ef208
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load

[PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-25 Thread Daniel Mack
ich
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  71 +
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  17 
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/bpf/syscall.c|  81 +++
 kernel/cgroup.c |  18 +
 net/core/filter.c   |  27 +++
 net/ipv4/ip_output.c|  17 
 net/ipv6/ip6_output.c   |   9 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++
 15 files changed, 597 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.7.4



[PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-10-25 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/ipv4/ip_output.c  | 17 +
 net/ipv6/ip6_output.c |  9 +
 2 files changed, 26 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 05d1058..ee4b249 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+   int ret;
 
/*
 *  If the indicated interface is up and running, send the packet.
@@ -312,6 +314,13 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+   BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
/*
 *  Multicasts are looped back for other local users
 */
@@ -364,12 +373,20 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   int ret;
 
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+   BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..1947026 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+   int ret;
 
if (unlikely(idev->cnf.disable_ipv6)) {
IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,13 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
return 0;
}
 
+   ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+   BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip6_finish_output,
-- 
2.7.4



[PATCH v7 1/6] bpf: add new prog type for cgroup socket filtering

2016-10-25 Thread Daniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/uapi/linux/bpf.h |  9 +
 net/core/filter.c| 23 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   default:
+   return sk_filter_func_proto(func_id);
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_skb_ops = {
+   .get_func_proto = cg_skb_func_proto,
+   .is_valid_access= sk_filter_is_valid_access,
+   .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops= _filter_ops,
.type   = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+   .ops= _skb_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SKB,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_skb_type);
 
return 0;
 }
-- 
2.7.4



[PATCH v7 4/6] net: filter: run cgroup eBPF ingress programs

2016-10-25 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..bd6eebe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.7.4



Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-22 Thread Daniel Mack
On 09/22/2016 05:12 PM, Daniel Borkmann wrote:
> On 09/22/2016 02:05 PM, Pablo Neira Ayuso wrote:

>> Benefits are, rewording previous email:
>>
>> * You get access to all of the existing netfilter hooks in one go
>>to run bpf programs. No need for specific redundant hooks. This
>>provides raw access to the netfilter hook, you define the little
>>code that your hook runs before you bpf run invocation. So there
>>is *no need to bloat the stack with more hooks, we use what we
>>have.*
> 
> But also this doesn't really address the fundamental underlying problem
> that is discussed here. nft doesn't even have cgroups v2 support, only
> xt_cgroups has it so far, but even if it would have it, then it's still
> a scalability issue that this model has over what is being proposed by
> Daniel, since you still need to test linearly wrt cgroups v2 membership,
> whereas in the set that is proposed it's integral part of cgroups and can
> be extended further, also for non-networking users to use this facility.
> Or would the idea be that the current netfilter hooks would be redone in
> a way that they are generic enough so that any other user could make use
> of it independent of netfilter?

Yes, that part I don't understand either.

Pablo, could you outline in more detail (in terms of syscalls, commands,
resulting nftables layout etc) how your proposed model would support
having per-cgroup byte and packet counters for both ingress and egress,
and filtering at least for ingress?

And how would that mitigate the race gaps you have been worried about,
between cgroup creation and filters taking effect for a task?


Thanks,
Daniel


Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-20 Thread Daniel Mack
Hi Pablo,

On 09/20/2016 04:29 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote:
> [...]
>> Why would we artificially limit the use-cases of this implementation if
>> the way it stands, both filtering and introspection are possible?
> 
> Why should we place infrastructure in the kernel to filter packets so
> late, and why at postrouting btw, when we can do this way earlier
> before any packet is actually sent?

The point is that from an application's perspective, restricting the
ability to bind a port and dropping packets that are being sent is a
very different thing. Applications will start to behave differently if
they can't bind to a port, and that's something we do not want to happen.

Looking at packets and making a verdict on them is the only way to
implement what we have in mind. Given that's in line with what netfilter
does, it can't be all that wrong, can it?

> No performance impact, no need for
> skbuff allocation and *no cycles wasted to evaluate if every packet is
> wanted or not*.

Hmm, not sure why this keeps coming up. As I said - for accounting,
there is no other option than to look at every packet and its size.

Regarding the performance concerns, are you saying a netfilter based
implementation that uses counters for that purpose would be more
efficient? Because in my tests, just loading the netfilter modules with
no rules in place at all has more impact than running the code from 6/6
on every packet.

As stated before, I see no reason why we shouldn't have a netfilter
based implementation that can achieve the same, function-wise. And I
would also like to compare their throughput.


Thanks,
Daniel


Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-20 Thread Daniel Mack
On 09/19/2016 11:53 PM, Sargun Dhillon wrote:
> On Mon, Sep 19, 2016 at 06:34:28PM +0200, Daniel Mack wrote:
>> On 09/16/2016 09:57 PM, Sargun Dhillon wrote:

>>> Now, with this patch, we don't have that, but I think we can reasonably add 
>>> some 
>>> flag like "no override" when applying policies, or alternatively something 
>>> like 
>>> "no new privileges", to prevent children from applying policies that 
>>> override 
>>> top-level policy.
>>
>> Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
>> capability away from your children, and they can't tamper with the
>> policy. Does that work for you?
>
> No. This can be addressed in a follow-on patch, but the use-case is that I 
> have 
> a container orchestrator (Docker, or Mesos), and systemd. The sysadmin 
> controls 
> systemd, and Docker is controlled by devs. Typically, the system owner wants 
> some system level statistics, and filtering, and then we want to do 
> per-container filtering.
> 
> We really want to be able to do nesting with userspace tools that are 
> oblivious, 
> and we want to delegate a level of the cgroup hierarchy to the tool that 
> created 
> it. I do not see Docker integrating with systemd any time soon, and that's 
> really the only other alternative.

Then we'd need to find out whether you want to block other users from
installing (thus overriding) an existing eBPF program, or if you want to
allow that but execute them all. Both is possible.

[...]

>>> It would be nice to be able to see whether or not a filter is attached to a 
>>> cgroup, but given this is going through syscalls, at least introspection
>>> is possible as opposed to something like netlink.
>>
>> Sure, there are many ways. I implemented the bpf cgroup logic using an
>> own cgroup controller once, which made it possible to read out the
>> status. But as we agreed on attaching programs through the bpf(2) system
>> call, I moved back to the implementation that directly stores the
>> pointers in the cgroup.
>>
>> First enabling the controller through the fs-backed cgroup interface,
>> then come back through the bpf(2) syscall and then go back to the fs
>> interface to read out status values is a bit weird.
>>
> Hrm, that makes sense. with the BPF syscall, would there be a way to get
> file descriptor of the currently attached BPF program?

A file descriptor is local to a task, so we would need to install a new
fd and return its number. But I'm not sure what we'd gain from that.


Thanks,
Daniel



Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-19 Thread Daniel Mack
On 09/19/2016 10:35 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 19, 2016 at 09:30:02PM +0200, Daniel Mack wrote:
>> On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote:

>>> Actually, did you look at Google's approach to this problem?  They
>>> want to control this at socket level, so you restrict what the process
>>> can actually bind. That is enforcing the policy way before you even
>>> send packets. On top of that, what they submitted is infrastructured
>>> so any process with CAP_NET_ADMIN can access that policy that is being
>>> applied and fetch a readable policy through kernel interface.
>>
>> Yes, I've seen what they propose, but I want this approach to support
>> accounting, and so the code has to look at each and every packet in
>> order to count bytes and packets. Do you know of any better place to put
>> the hook then?
> 
> Accounting is part of the usecase that fits into the "network
> introspection" idea that has been mentioned here, so you can achieve
> this by adding a hook that returns no verdict, so this becomes similar
> to the tracing infrastructure.

Why would we artificially limit the use-cases of this implementation if
the way it stands, both filtering and introspection are possible?

> Filtering packets with cgroups is braindead.

Filtering is done via eBPF, and cgroups are just the containers. I don't
see what's brain-dead in that approach. After all, accessing the cgroup
once we have a local socket is really fast, so the idea is kinda obvious.

> You have the means to ensure that processes send no packets via
> restricting port binding, there is no reason to do this any later for
> locally generated traffic.

Yes, restricting port binding can be done on top, if people are worried
about the performance overhead of a per-packet program.



Thanks,
Daniel


Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-19 Thread Daniel Mack
On 09/19/2016 09:19 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 19, 2016 at 06:44:00PM +0200, Daniel Mack wrote:
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 6001e78..5dc90aa 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  
>> +#include 
>>  #include 
>>  #include 
>>  
>> @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
>> sk_buff *skb)
>>  {
>>  struct net_device *dev = skb_dst(skb)->dev;
>>  struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
>> +int ret;
>>  
>>  if (unlikely(idev->cnf.disable_ipv6)) {
>>  IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
>> @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct 
>> sk_buff *skb)
>>  return 0;
>>  }
>>  
>> +ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
>> +if (ret) {
>> +kfree_skb(skb);
>> +return ret;
>> +}
> 
> 1) If your goal is to filter packets, why so late? The sooner you
>enforce your policy, the less cycles you waste.
> 
> Actually, did you look at Google's approach to this problem?  They
> want to control this at socket level, so you restrict what the process
> can actually bind. That is enforcing the policy way before you even
> send packets. On top of that, what they submitted is infrastructured
> so any process with CAP_NET_ADMIN can access that policy that is being
> applied and fetch a readable policy through kernel interface.

Yes, I've seen what they propose, but I want this approach to support
accounting, and so the code has to look at each and every packet in
order to count bytes and packets. Do you know of any better place to put
the hook then?

That said, I can well imagine more hooks types that also operate at port
bind time. That would be easy to add on top.

> 2) This will turn the stack into a nightmare to debug I predict. If
>any process with CAP_NET_ADMIN can potentially attach bpf blobs
>via these hooks, we will have to include in the network stack
>traveling documentation something like: "Probably you have to check
>that your orchestrator is not dropping your packets for some
>reason". So I wonder how users will debug this and how the policy that
>your orchestrator applies will be exposed to userspace.

Sure, every new limitation mechanism adds another knob to look at if
things don't work. But apart from taking care at userspace level to make
the behavior as obvious as possible, I'm open to suggestions of how to
improve the transparency of attached eBPF programs on the kernel side.


Thanks,
Daniel



[PATCH v6 2/6] cgroup: add support for eBPF programs

2016-09-19 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/linux/bpf-cgroup.h  |  71 +++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 166 
 kernel/cgroup.c |  18 +
 6 files changed, 272 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   if (cgroup_bpf_enabled)
+   return __cgroup_bpf_run_filter(sk, skb, type);
+
+   return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..71c71b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
  Say N if unsure.
 
+config CGROUP_BPF
+   bool "Support for eBPF programs attached to cgroups"
+   depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+   help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+ inet sockets.
+
 config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile

[PATCH v6 0/6] Add eBPF hooks for cgroups

2016-09-19 Thread Daniel Mack
This is v6 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.


Changes from v5:

* The eBPF programs now operate on L3 rather than on L2 of the packets,
  and the egress hooks were moved from __dev_queue_xmit() to
  ip*_output().

* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
  through BPF_LD_[ABS|IND] instructions, but hook up the
  bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
  for the help.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  71 +
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  17 
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 166 
 kernel/bpf/syscall.c|  81 
 kernel/cgroup.c |  18 +
 net/core/filter.c   |  27 +++
 net/ipv4/ip_output.c|  15 
 net/ipv6/ip6_output.c   |   8 ++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++
 15 files changed, 

[PATCH v6 1/6] bpf: add new prog type for cgroup socket filtering

2016-09-19 Thread Daniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h |  9 +
 net/core/filter.c| 23 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f896dfa..55f815e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SOCKET,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 298b146..e46c98e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2496,6 +2496,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+cg_sk_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return _skb_load_bytes_proto;
+   default:
+   return sk_filter_func_proto(func_id);
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2818,6 +2829,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_sk_ops = {
+   .get_func_proto = cg_sk_func_proto,
+   .is_valid_access= sk_filter_is_valid_access,
+   .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops= _filter_ops,
.type   = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2838,12 +2855,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_type __read_mostly = {
+   .ops= _sk_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SOCKET,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_sk_type);
 
return 0;
 }
-- 
2.5.5



[PATCH v6 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-19 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55f815e..7cd3616 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1a8592a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SOCKET);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.5.5



[PATCH v6 4/6] net: filter: run cgroup eBPF ingress programs

2016-09-19 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e46c98e..ce6e527 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.5.5



[PATCH v6 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-09-19 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..19e4ec0
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+   struct bpf

[PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-19 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/ipv4/ip_output.c  | 15 +++
 net/ipv6/ip6_output.c |  8 
 2 files changed, 23 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 05d1058..3ca3d7a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+   int ret;
 
/*
 *  If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
/*
 *  Multicasts are looped back for other local users
 */
@@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   int ret;
 
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..5dc90aa 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+   int ret;
 
if (unlikely(idev->cnf.disable_ipv6)) {
IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
return 0;
}
 
+   ret = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_EGRESS);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip6_finish_output,
-- 
2.5.5



Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-19 Thread Daniel Mack
Hi,

On 09/16/2016 09:57 PM, Sargun Dhillon wrote:
> On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:

>> I have no idea what makes you think this is limited to systemd. As I
>> said, I provided an example for userspace that works from the command
>> line. The same limitation apply as for all other users of cgroups.
>>
> So, at least in my work, we have Mesos, but on nearly every machine that 
> Mesos 
> runs, people also have systemd. Now, there's recently become a bit of a 
> battle 
> of ownership of things like cgroups on these machines. We can usually solve 
> it 
> by nesting under systemd cgroups, and thus so far we've avoided making too 
> many 
> systemd-specific concessions.
> 
> The reason this works (mostly), is because everything we touch has a sense of 
> nesting, where we can apply policy at a place lower in the hierarchy, and yet 
> systemd's monitoring and policy still stays in place. 
> 
> Now, with this patch, we don't have that, but I think we can reasonably add 
> some 
> flag like "no override" when applying policies, or alternatively something 
> like 
> "no new privileges", to prevent children from applying policies that override 
> top-level policy.

Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
capability away from your children, and they can't tamper with the
policy. Does that work for you?

> I realize there is a speed concern as well, but I think for 
> people who want nested policy, we're willing to make the tradeoff. The cost
> of traversing a few extra pointers still outweighs the overhead of network
> namespaces, iptables, etc.. for many of us. 

Not sure. Have you tried it?

> What do you think Daniel?

I think we should look at an implementation once we really need it, and
then revisit the performance impact. In any case, this can be changed
under the hood, without touching the userspace API (except for adding
flags if we need them).

>> Not necessarily. You can as well do it the inetd way, and pass the
>> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
>> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
>> is not transparent to the launched binary at all. The proposed cgroup
>> eBPF solution implements a very similar behavior in that regard.
>
> It would be nice to be able to see whether or not a filter is attached to a 
> cgroup, but given this is going through syscalls, at least introspection
> is possible as opposed to something like netlink.

Sure, there are many ways. I implemented the bpf cgroup logic using an
own cgroup controller once, which made it possible to read out the
status. But as we agreed on attaching programs through the bpf(2) system
call, I moved back to the implementation that directly stores the
pointers in the cgroup.

First enabling the controller through the fs-backed cgroup interface,
then come back through the bpf(2) syscall and then go back to the fs
interface to read out status values is a bit weird.

>> And FWIW, I agree with Thomas - there is nothing wrong with having
>> multiple options to use for such use-cases.
>
> Right now, for containers, we have netfilter and network namespaces.
> There's a lot of performance overhead that comes with this.

Out of curiosity: Could you express that in numbers? And how exactly are
you testing?

> Not only
> that, but iptables doesn't really have a simple way of usage by
> automated infrastructure. We (firewalld, systemd, dockerd, mesos)
> end up fighting with one another for ownership over firewall rules.

Yes, that's a common problem.

> Although, I have problems with this approach, I think that it's
> a good baseline where we can have top level owned by systemd,
> docker underneath that, and Mesos underneath that. We can add
> additional hooks for things like Checmate and Landlock, and
> with a little more work, we can do compositition, solving
> all of our problems.

It is supposed to be just a baseline, yes.


Thanks for your feedback,
Daniel



Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-15 Thread Daniel Mack
On 09/15/2016 08:36 AM, Vincent Bernat wrote:
>  ❦ 12 septembre 2016 18:12 CEST, Daniel Mack <dan...@zonque.org> :
> 
>> * The sample program learned to support both ingress and egress, and
>>   can now optionally make the eBPF program drop packets by making it
>>   return 0.
> 
> Ability to lock the eBPF program to avoid modification from a later
> program or in a subcgroup would be pretty interesting from a security
> perspective.

For now, you can achieve that by dropping CAP_NET_ADMIN after installing
a program between fork and exec. I think that should suffice for a first
version. Flags to further limit that could be be added later.


Thanks,
Daniel


Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-14 Thread Daniel Mack
Hi Pablo,

On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
>> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
>>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>>>> This is v5 of the patch set to allow eBPF programs for network
>>>> filtering and accounting to be attached to cgroups, so that they apply
>>>> to all sockets of all tasks placed in that cgroup. The logic also
>>>> allows to be extendeded for other cgroup based eBPF logic.
>>>
>>> 1) This infrastructure can only be useful to systemd, or any similar
>>>orchestration daemon. Look, you can only apply filtering policies
>>>to processes that are launched by systemd, so this only works
>>>for server processes.
>>
>> Sorry, but both statements aren't true. The eBPF policies apply to every
>> process that is placed in a cgroup, and my example program in 6/6 shows
>> how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

I have no idea what makes you think this is limited to systemd. As I
said, I provided an example for userspace that works from the command
line. The same limitation apply as for all other users of cgroups.

> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

Yes, that's right. But as I said, this rule applies to many other kernel
concepts, so I don't see any real issue.

>> That's a limitation that applies to many more control mechanisms in the
>> kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

Probably, but that's only solvable with rules that store the full cgroup
path then, and do a string comparison (!) for each packet flying by.

>> That's just as transparent as SO_ATTACH_FILTER. What kind of
>> introspection mechanism do you have in mind?
> 
> SO_ATTACH_FILTER is called from the process itself, so this is a local
> filtering policy that you apply to your own process.

Not necessarily. You can as well do it the inetd way, and pass the
socket to a process that is launched on demand, but do SO_ATTACH_FILTER
+ SO_LOCK_FILTER  in the middle. What happens with payload on the socket
is not transparent to the launched binary at all. The proposed cgroup
eBPF solution implements a very similar behavior in that regard.

>> It's about filtering outgoing network packets of applications, and
>> providing them with L2 information for filtering purposes. I don't think
>> that's a very specific use-case.
>>
>> When the feature is not used at all, the added costs on the output path
>> are close to zero, due to the use of static branches.
> 
> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

As I said, I'm open to discussing that. In order to make it work for L3,
the LL_OFF issues need to be solved, as Daniel explained. Daniel,
Alexei, any idea how much work that would be?

> That is only a rough ~30 lines kernel patchset to support this in
> netfilter and only one extra input hook, with potential access to
> conntrack and better integration with other existing subsystems.

Care to share the patches for that? I'd really like to have a look.

And FWIW, I agree with Thomas - there is nothing wrong with having
multiple options to use for such use-cases.


Thanks,
Daniel



Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-13 Thread Daniel Mack
Hi,

On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>> This is v5 of the patch set to allow eBPF programs for network
>> filtering and accounting to be attached to cgroups, so that they apply
>> to all sockets of all tasks placed in that cgroup. The logic also
>> allows to be extendeded for other cgroup based eBPF logic.
> 
> 1) This infrastructure can only be useful to systemd, or any similar
>orchestration daemon. Look, you can only apply filtering policies
>to processes that are launched by systemd, so this only works
>for server processes.

Sorry, but both statements aren't true. The eBPF policies apply to every
process that is placed in a cgroup, and my example program in 6/6 shows
how that can be done from the command line. Also, systemd is able to
control userspace processes just fine, and it not limited to 'server
processes'.

> For client processes this infrastructure is
>*racy*, you have to add new processes in runtime to the cgroup,
>thus there will be time some little time where no filtering policy
>will be applied. For quality of service, this may be an acceptable
>race, but this is aiming to deploy a filtering policy.

That's a limitation that applies to many more control mechanisms in the
kernel, and it's something that can easily be solved with fork+exec.

> 2) This aproach looks uninfrastructured to me. This provides a hook
>to push a bpf blob at a place in the stack that deploys a filtering
>policy that is not visible to others.

That's just as transparent as SO_ATTACH_FILTER. What kind of
introspection mechanism do you have in mind?

> We have interfaces that allows
>us to dump the filtering policy that is being applied, report events
>to enable cooperation between several processes with similar
>capabilities and so on.

Well, in practice, for netfilter, there can only be one instance in the
system that acts as central authoritative, otherwise you'll end up with
orphaned entries or with situation where some client deletes rules
behind the back of the one that originally installed it. So I really
think there is nothing wrong with demanding a single, privileged
controller to manage things.

>> After chatting with Daniel Borkmann and Alexei off-list, we concluded
>> that __dev_queue_xmit() is the place where the egress hooks should live
>> when eBPF programs need access to the L2 bits of the skb.
> 
> 3) This egress hook is coming very late, the only reason I find to
>place it at __dev_queue_xmit() is that bpf naturally works with
>layer 2 information in place. But this new hook is placed in
>_everyone's output ath_ that only works for the very specific
>usecase I exposed above.

It's about filtering outgoing network packets of applications, and
providing them with L2 information for filtering purposes. I don't think
that's a very specific use-case.

When the feature is not used at all, the added costs on the output path
are close to zero, due to the use of static branches. If used somewhere
in the system but not for the packet in flight, costs are slightly
higher but acceptable. In fact, it's not even measurable in my tests
here. How is that different from the netfilter OUTPUT hook, btw?

That said, limiting it to L3 is still an option. It's just that we need
ingress and egress to be in sync, so both would be L3 then. So far, the
possible advantages for future use-cases having access to L2 outweighed
the concerns of putting the hook to dev_queue_xmit(), but I'm open to
discussing that.

> The main concern during the workshop was that a hook only for cgroups
> is too specific, but this is actually even more specific than this.

This patch set merely implements an infrastructure that can accommodate
many more things as well in the future. We could, in theory, even add
hooks for forwarded packets specifically, or other eBPF programs, not
even for network filtering etc.

> I have nothing against systemd or the needs for more
> programmability/flexibility in the stack, but I think this needs to
> fulfill some requirements to fit into the infrastructure that we have
> in the right way.

Well, as I explained already, this patch set results from endless
discussions that went nowhere, about how such a thing can be achieved
with netfilter.


Thanks,
Daniel


[PATCH v5 2/6] cgroup: add support for eBPF programs

2016-09-12 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/linux/bpf-cgroup.h  |  71 +++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 166 
 kernel/cgroup.c |  18 +
 6 files changed, 272 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   if (cgroup_bpf_enabled)
+   return __cgroup_bpf_run_filter(sk, skb, type);
+
+   return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..71c71b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
  Say N if unsure.
 
+config CGROUP_BPF
+   bool "Support for eBPF programs attached to cgroups"
+   depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+   help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+ inet sockets.
+
 config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile

[PATCH v5 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-12 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55f815e..7cd3616 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1a8592a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SOCKET);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.5.5



[PATCH v5 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-09-12 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..19e4ec0
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+   struct bpf

[PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-12 Thread Daniel Mack
This is v5 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

After chatting with Daniel Borkmann and Alexei off-list, we concluded
that __dev_queue_xmit() is the place where the egress hooks should live
when eBPF programs need access to the L2 bits of the skb.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: core: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  71 +
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  17 
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 166 
 kernel/bpf/syscall.c|  81 
 kernel/bpf/verifier.c   |   1 +
 kernel/cgroup.c |  18 +
 net/core/dev.c  |   6 ++
 net/core/filter.c   |  10 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++
 15 files changed, 560 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5



[PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering

2016-09-12 Thread Daniel Mack
For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h | 9 +
 kernel/bpf/verifier.c| 1 +
 net/core/filter.c| 6 ++
 3 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f896dfa..55f815e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SOCKET,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90493a6..d5d2875 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1830,6 +1830,7 @@ static bool may_access_skb(enum bpf_prog_type type)
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_CGROUP_SOCKET:
return true;
default:
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..176b6f2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_type __read_mostly = {
+   .ops= _filter_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SOCKET,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_sk_type);
 
return 0;
 }
-- 
2.5.5



[PATCH v5 5/6] net: core: run cgroup eBPF egress programs

2016-09-12 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/core/dev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..f951db2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+   rc = cgroup_bpf_run_filter(skb->sk, skb, BPF_CGROUP_INET_EGRESS);
+   if (rc)
+   goto free_skb_list;
+
/* Disable soft irqs for various locks below. Also
 * stops preemption for RCU.
 */
@@ -3416,6 +3421,7 @@ recursion_alert:
rcu_read_unlock_bh();
 
atomic_long_inc(>tx_dropped);
+free_skb_list:
kfree_skb_list(skb);
return rc;
 out:
-- 
2.5.5



[PATCH v5 4/6] net: filter: run cgroup eBPF ingress programs

2016-09-12 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 176b6f2..3662c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.5.5



Re: [PATCH v4 2/6] cgroup: add support for eBPF programs

2016-09-07 Thread Daniel Mack
On 09/06/2016 07:18 PM, Daniel Borkmann wrote:
> On 09/06/2016 03:46 PM, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>A - B - C
>>  \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
>> Attaching and detaching programs will be done through the bpf(2)
>> syscall. For now, ingress and egress inet socket filtering are the
>> only supported use-cases.
>>
>> Signed-off-by: Daniel Mack <dan...@zonque.org>
> [...]
>> +/**
>> + * __cgroup_bpf_run_filter() - Run a program for packet filtering
>> + * @sk: The socken sending or receiving traffic
>> + * @skb: The skb that is being sent or received
>> + * @type: The type of program to be exectuted
>> + *
>> + * If no socket is passed, or the socket is not of type INET or INET6,
>> + * this function does nothing and returns 0.
>> + *
>> + * The program type passed in via @type must be suitable for network
>> + * filtering. No further check is performed to assert that.
>> + *
>> + * This function will return %-EPERM if any if an attached program was found
>> + * and if it returned != 1 during execution. In all other cases, 0 is 
>> returned.
>> + */
>> +int __cgroup_bpf_run_filter(struct sock *sk,
>> +struct sk_buff *skb,
>> +enum bpf_attach_type type)
>> +{
>> +struct bpf_prog *prog;
>> +struct cgroup *cgrp;
>> +int ret = 0;
>> +
>> +if (!sk)
>> +return 0;
> 
> Doesn't this also need to check || !sk_fullsock(sk)?

Ah, yes. We should limit it to full sockets. Thanks!


Daniel



[PATCH v4 1/6] bpf: add new prog type for cgroup socket filtering

2016-09-06 Thread Daniel Mack
For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h | 9 +
 kernel/bpf/verifier.c| 1 +
 net/core/filter.c| 6 ++
 3 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f896dfa..55f815e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SOCKET,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 48c2705..1b8a871 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type)
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_CGROUP_SOCKET:
return true;
default:
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..176b6f2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_type __read_mostly = {
+   .ops= _filter_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SOCKET,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(_filter_type);
bpf_register_prog_type(_cls_type);
bpf_register_prog_type(_act_type);
bpf_register_prog_type(_type);
+   bpf_register_prog_type(_sk_type);
 
return 0;
 }
-- 
2.5.5



[PATCH v4 4/6] net: filter: run cgroup eBPF ingress programs

2016-09-06 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 176b6f2..3662c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.5.5



[PATCH v4 5/6] net: core: run cgroup eBPF egress programs

2016-09-06 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 net/core/dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..eb2bd20 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+   rc = cgroup_bpf_run_filter(skb->sk, skb, BPF_CGROUP_INET_EGRESS);
+   if (rc)
+   goto free_skb_list;
+
/* Disable soft irqs for various locks below. Also
 * stops preemption for RCU.
 */
@@ -3414,8 +3419,8 @@ recursion_alert:
 
rc = -ENETDOWN;
rcu_read_unlock_bh();
-
atomic_long_inc(>tx_dropped);
+free_skb_list:
kfree_skb_list(skb);
return rc;
 out:
-- 
2.5.5



[PATCH v4 2/6] cgroup: add support for eBPF programs

2016-09-06 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/linux/bpf-cgroup.h  |  70 +++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 165 
 kernel/cgroup.c |  18 +
 6 files changed, 270 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..eac0957
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,70 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   if (cgroup_bpf_enabled)
+   return __cgroup_bpf_run_filter(sk, skb, type);
+
+   return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type)
+{
+   return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..71c71b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
  Say N if unsure.
 
+config CGROUP_BPF
+   bool "Support for eBPF programs attached to cgroups"
+   depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+   help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+ inet sockets.
+
 config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kerne

[PATCH v4 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-06 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55f815e..7cd3616 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1a8592a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SOCKET);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get();
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach();
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach();
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.5.5



[PATCH v4 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-09-06 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <dan...@zonque.org>
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, , sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, , sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, , sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..19e4ec0
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+   struct bpf

[PATCH v4 0/6] Add eBPF hooks for cgroups

2016-09-06 Thread Daniel Mack
This is v4 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

All the comments I got since v3 were addressed. FWIW, I left the
egress hook in __dev_queue_xmit() for now, as I don't currently see
any better place to put it. If we find one, we can still move the
hook around, and relax the !sk and sk->sk_family checks.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: core: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
cgroups

 include/linux/bpf-cgroup.h  |  70 +
 include/linux/cgroup-defs.h |   4 +
 include/uapi/linux/bpf.h|  17 +
 init/Kconfig|  12 +++
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 165 
 kernel/bpf/syscall.c|  81 
 kernel/bpf/verifier.c   |   1 +
 kernel/cgroup.c |  18 +
 net/core/dev.c  |   7 +-
 net/core/filter.c   |  10 +++
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 +
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++
 15 files changed, 558 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5



Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack
On 09/05/2016 08:32 PM, Alexei Starovoitov wrote:
> On 9/5/16 10:09 AM, Daniel Borkmann wrote:
>> On 09/05/2016 04:09 PM, Daniel Mack wrote:

>>> I really don't think it's worth sparing 8 bytes here and then do the
>>> binary compat dance after flags are added, for no real gain.
>>
>> Sure, but there's not much of a dance needed, see for example how map_flags
>> were added some time ago. So, iff there's really no foreseeable use-case in
>> sight and since we have this flexibility in place already, then I don't
>> quite
>> follow why it's needed, if there's zero pain to add it later on. I would
>> understand it of course, if it cannot be handled later on anymore.
> 
> I agree with Daniel B. Since flags are completely unused right now,
> there is no plan to use it for anything in the coming months and
> even worse they make annoying hole in the struct, let's not
> add them. We can safely do that later. CHECK_ATTR() allows us to
> do it easily. It's not like syscall where flags are must have,
> since we cannot add it later. Here it's done trivially.

Okay then. If you both agree, I won't interfere :)


Daniel



Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack
On 09/05/2016 05:30 PM, David Laight wrote:
> From: Daniel Mack
>>>> +
>>>> +  struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>>> +  __u32   target_fd;  /* container object to attach 
>>>> to */
>>>> +  __u32   attach_bpf_fd;  /* eBPF program to attach */
>>>> +  __u32   attach_type;/* BPF_ATTACH_TYPE_* */
>>>> +  __u64   attach_flags;
>>>> +  };
>>>
>>> there is a 4 byte hole in this struct. Can we pack it differently?
>>
>> Okay - I swapped "type" and "flags" to repair this.
> 
> That just moves the pad to the end of the structure.
> Still likely to cause a problem for 32bit apps on a 64bit kernel.

What kind of problem do you have in mind? Again, this is embedded in a
union of much bigger total size, and the API is not used in any kind of
hot-path.

> If you can't think of any flags, why 64 of them?

I can't think of them right now, but this is about defining an API that
can be used in other context as well. Also, it doesn't matter at all,
they don't harm. IMO, it's just better to have them right away than to
do a binary compat dance once someone needs them.


Thanks,
Daniel



Re: [PATCH v3 2/6] cgroup: add support for eBPF programs

2016-09-05 Thread Daniel Mack
Hi,

On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>   A - B - C
>> \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
> How does this work when running and orchestrator within an orchestrator? The 
> Docker in Docker / Mesos in Mesos use case, where the top level orchestrator 
> is 
> observing the traffic, and there is an orchestrator within that also need to 
> run 
> it.
> 
> In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
> and so on.

Running multiple programs was an idea I had in one of my earlier drafts,
but after some discussion, I refrained from it again because potentially
walking the cgroup hierarchy on every packet is just too expensive.

> Is it possible to allow this, either by flattening out the
> datastructure (copy a ref to the bpf programs to C and E) or
> something similar?

That would mean we carry a list of eBPF program pointers of dynamic
size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
so it can store a reference to all programs of all of its ancestor.

While I think that would be possible, even at some later point, I'd
really like to avoid it for the sake of simplicity.

Is there any reason why this can't be done in userspace? Compile a
program X for A, and overload it with Y, with Y doing the same than X
but add some extra checks? Note that all users of the bpf(2) syscall API
will need CAP_NET_ADMIN anyway, so there is no delegation to
unprivileged sub-orchestators or anything alike really.


Thanks,
Daniel



Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs

2016-09-05 Thread Daniel Mack
On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index a75df86..17484e6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -141,6 +141,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>
>>   #include "net-sysfs.h"
>>
>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
>> *accel_priv)
>>  if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>  __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>
>> +rc = cgroup_bpf_run_filter(skb->sk, skb,
>> +   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>> +if (rc)
>> +return rc;
> 
> This would leak the whole skb by the way.

Ah, right.

> Apart from that, could this be modeled w/o affecting the forwarding path (at 
> some
> local output point where we know to have a valid socket)? Then you could also 
> drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of 
> what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to 
> be
> just zeroes since not available at that point?

Hmm, I wonder where this hook could be put instead then. When placed in
ip_output() and ip6_output(), the mac headers cannot be pushed before
running the program, resulting in bogus skb data from the eBPF program.

Also, if I read the code correctly, ip[6]_output is not called for
multicast packets.

Any other ideas?


Thanks,
Daniel



Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack
On 09/05/2016 03:56 PM, Daniel Borkmann wrote:
> On 09/05/2016 02:54 PM, Daniel Mack wrote:
>> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>>
>>>>enum bpf_map_type {
>>>> @@ -147,6 +149,13 @@ union bpf_attr {
>>>>__aligned_u64   pathname;
>>>>__u32   bpf_fd;
>>>>};
>>>> +
>>>> +  struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>>> +  __u32   target_fd;  /* container object to attach 
>>>> to */
>>>> +  __u32   attach_bpf_fd;  /* eBPF program to attach */
>>>> +  __u32   attach_type;/* BPF_ATTACH_TYPE_* */
>>>> +  __u64   attach_flags;
>>>
>>> Could we just do ...
>>>
>>> __u32 dst_fd;
>>> __u32 src_fd;
>>> __u32 attach_type;
>>>
>>> ... and leave flags out, since unused anyway? Also see below.
>>
>> I'd really like to keep the flags, even if they're unused right now.
>> This only adds 8 bytes during the syscall operation, so it doesn't harm.
>> However, we cannot change the userspace API after the fact, and who
>> knows what this (rather generic) interface will be used for later on.
> 
> With the below suggestion added, then flags doesn't need to be
> added currently as it can be done safely at a later point in time
> with respecting old binaries. See also the syscall handling code
> in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
> underlying idea of this was taken from perf_event_open() syscall
> back then, see [1] for a summary.
> 
>[1] https://lkml.org/lkml/2014/8/26/116

Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.



Thanks,
Daniel



Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-09-05 Thread Daniel Mack
On 08/27/2016 02:08 AM, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 09:58:49PM +0200, Daniel Mack wrote:

>> +
>> +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> +__u32   target_fd;  /* container object to attach 
>> to */
>> +__u32   attach_bpf_fd;  /* eBPF program to attach */
>> +__u32   attach_type;/* BPF_ATTACH_TYPE_* */
>> +__u64   attach_flags;
>> +};
> 
> there is a 4 byte hole in this struct. Can we pack it differently?

Okay - I swapped "type" and "flags" to repair this.

>> +switch (attr->attach_type) {
>> +case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
>> +case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
>> +struct cgroup *cgrp;
>> +
>> +prog = bpf_prog_get_type(attr->attach_bpf_fd,
>> + BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
>> +if (IS_ERR(prog))
>> +return PTR_ERR(prog);
>> +
>> +cgrp = cgroup_get_from_fd(attr->target_fd);
>> +if (IS_ERR(cgrp)) {
>> +bpf_prog_put(prog);
>> +return PTR_ERR(cgrp);
>> +}
>> +
>> +cgroup_bpf_update(cgrp, prog, attr->attach_type);
>> +cgroup_put(cgrp);
>> +
>> +break;
>> +}
> 
> this } formatting style is confusing. The above } looks
> like it matches 'switch () {'.
> May be move 'struct cgroup *cgrp' to the top to avoid that?

I kept it local to its users, but you're right, it's not worth it. Will
change.


Thanks,
Daniel




  1   2   >