Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-03 Thread Daniel Borkmann

On 4/2/24 6:58 PM, Andrii Nakryiko wrote:

On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee  wrote:


rethook_find_ret_addr() prints a warning message and returns 0 when the
target task is running and not the "current" task to prevent returning an
incorrect return address. However, this check is incomplete as the target
task can still transition to the running state when finding the return
address, although it is safe with RCU.

The issue we encounter is that the kernel frequently prints warning
messages when BPF profiling programs call to bpf_get_task_stack() on
running tasks.

The callers should be aware and willing to take the risk of receiving an
incorrect return address from a task that is currently running other than
the "current" one. A warning is not needed here as the callers are intent
on it.

Signed-off-by: Kui-Feng Lee 
---
  kernel/trace/rethook.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..4297a132a7ae 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
*tsk, unsigned long frame
 if (WARN_ON_ONCE(!cur))
 return 0;

-   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
+   if (tsk != current && task_is_running(tsk))
 return 0;



This should probably go through Masami's tree, but the change makes
sense to me, given this is an expected condition.

Acked-by: Andrii Nakryiko 


Masami, I assume you'll pick this up?

Thanks,
Daniel



Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.

2023-12-18 Thread Daniel Borkmann

Hi Sebastian,

On 12/15/23 6:07 PM, Sebastian Andrzej Siewior wrote:

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

[...]


  drivers/net/hyperv/netvsc_bpf.c |  1 +
  drivers/net/netkit.c| 13 +++
  drivers/net/tun.c   | 28 +--
  drivers/net/veth.c  | 40 -
  drivers/net/virtio_net.c|  1 +
  drivers/net/xen-netfront.c  |  1 +
  6 files changed, 52 insertions(+), 32 deletions(-)

[...]

Please exclude netkit from this set given it does not support XDP, but
instead only accepts tc BPF typed programs.

Thanks,
Daniel


diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 39171380ccf29..fbcf78477bda8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct 
net_device *dev)
netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
skb->dev = peer;
entry = rcu_dereference(nk->active);
-   if (entry)
-   ret = netkit_run(entry, skb, ret);
+   if (entry) {
+   scoped_guard(local_lock_nested_bh, _run_lock.redirect_lock) 
{
+   ret = netkit_run(entry, skb, ret);
+   if (ret == NETKIT_REDIRECT) {
+   dev_sw_netstats_tx_add(dev, 1, len);
+   skb_do_redirect(skb);
+   }
+   }
+   }
switch (ret) {
case NETKIT_NEXT:
case NETKIT_PASS:
@@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
break;
case NETKIT_REDIRECT:
-   dev_sw_netstats_tx_add(dev, 1, len);
-   skb_do_redirect(skb);
break;
case NETKIT_DROP:
default:




Re: [PATCH net] bpf: test_run: fix WARNING in format_decode

2023-11-27 Thread Daniel Borkmann

On 11/22/23 6:28 AM, Yonghong Song wrote:

On 11/21/23 7:50 PM, Edward Adam Davis wrote:

Confirm that skb->len is not 0 to ensure that skb length is valid.

Fixes: 114039b34201 ("bpf: Move skb->len == 0 checks into __bpf_redirect")
Reported-by: syzbot+e2c932aec5c8a6e1d...@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis 


Stan, Could you take a look at this patch?


I think this only papers over the bug.. also BPF selftests seem to break
with this change.

Looking again at the syzkaller trace :

  [...]
  Please remove unsupported %\0 in format string
  WARNING: CPU: 0 PID: 5068 at lib/vsprintf.c:2675 format_decode+0xa03/0xba0 
lib/vsprintf.c:2675
  [...]

We need to fix bpf_bprintf_prepare() instead to reject invalid fmts such
as %0 and similar.


  net/bpf/test_run.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c9fdcc5cdce1..78258a822a5c 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -845,6 +845,9 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct 
__sk_buff *__skb)
  {
  struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb;
+    if (!skb->len)
+    return -EINVAL;
+
  if (!__skb)
  return 0;





Re: [PATCH bpf-next v2 3/4] libbpf: add low level TC-BPF API

2021-04-19 Thread Daniel Borkmann

On 4/19/21 11:43 PM, Toke Høiland-Jørgensen wrote:

Daniel Borkmann  writes:

On 4/19/21 2:18 PM, Kumar Kartikeya Dwivedi wrote:

This adds functions that wrap the netlink API used for adding,
manipulating, and removing traffic control filters. These functions
operate directly on the loaded prog's fd, and return a handle to the
filter using an out parameter named id.

The basic featureset is covered to allow for attaching, manipulation of
properties, and removal of filters. Some additional features like
TCA_BPF_POLICE and TCA_RATE for tc_cls have been omitted. These can
added on top later by extending the bpf_tc_cls_opts struct.

Support for binding actions directly to a classifier by passing them in
during filter creation has also been omitted for now. These actions have
an auto clean up property because their lifetime is bound to the filter
they are attached to. This can be added later, but was omitted for now
as direct action mode is a better alternative to it, which is enabled by
default.

An API summary:

bpf_tc_act_{attach, change, replace} may be used to attach, change, and


typo on bpf_tc_act_{...} ?
 ^^^

replace SCHED_CLS bpf classifier. The protocol field can be set as 0, in
which case it is subsitituted as ETH_P_ALL by default.


Do you have an actual user that needs anything other than ETH_P_ALL? Why is it
even needed? Why not stick to just ETH_P_ALL?


The behavior of the three functions is as follows:

attach = create filter if it does not exist, fail otherwise
change = change properties of the classifier of existing filter
replace = create filter, and replace any existing filter


This touches on tc oddities quite a bit. Why do we need to expose them? Can't we
simplify/abstract this e.g. i) create or update instance, ii) delete instance,
iii) get instance ? What concrete use case do you have that you need those three
above?


bpf_tc_cls_detach may be used to detach existing SCHED_CLS
filter. The bpf_tc_cls_attach_id object filled in during attach,
change, or replace must be passed in to the detach functions for them to
remove the filter and its attached classififer correctly.

bpf_tc_cls_get_info is a helper that can be used to obtain attributes
for the filter and classififer. The opts structure may be used to
choose the granularity of search, such that info for a specific filter
corresponding to the same loaded bpf program can be obtained. By
default, the first match is returned to the user.

Examples:

struct bpf_tc_cls_attach_id id = {};
struct bpf_object *obj;
struct bpf_program *p;
int fd, r;

obj = bpf_object_open("foo.o");
if (IS_ERR_OR_NULL(obj))
return PTR_ERR(obj);

p = bpf_object__find_program_by_title(obj, "classifier");
if (IS_ERR_OR_NULL(p))
return PTR_ERR(p);

if (bpf_object__load(obj) < 0)
return -1;

fd = bpf_program__fd(p);

r = bpf_tc_cls_attach(fd, if_nametoindex("lo"),
  BPF_TC_CLSACT_INGRESS,
  NULL, );
if (r < 0)
return r;

... which is roughly equivalent to (after clsact qdisc setup):
# tc filter add dev lo ingress bpf obj foo.o sec classifier da

... as direct action mode is always enabled.

If a user wishes to modify existing options on an attached classifier,
bpf_tc_cls_change API may be used.

Only parameters class_id can be modified, the rest are filled in to
identify the correct filter. protocol can be left out if it was not
chosen explicitly (defaulting to ETH_P_ALL).

Example:

/* Optional parameters necessary to select the right filter */
DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts,
.handle = id.handle,
.priority = id.priority,
.chain_index = id.chain_index)


Why do we need chain_index as part of the basic API?


opts.class_id = TC_H_MAKE(1UL << 16, 12);
r = bpf_tc_cls_change(fd, if_nametoindex("lo"),
  BPF_TC_CLSACT_INGRESS,
  , );


Also, I'm not sure whether the prefix should even be named  bpf_tc_cls_*() tbh,
yes, despite being "low level" api. I think in the context of bpf we should stop
regarding this as 'classifier' and 'action' objects since it's really just a
single entity and not separate ones. It's weird enough to explain this concept
to new users and if a libbpf based api could cleanly abstract it, I would be all
for it. I don't think we need to map 1:1 the old tc legacy even in the low level
api, tbh, as it feels backwards. I think the 'handle' & 'priority' bits are 
okay,
but I would remove the others.


Hmm, I'm OK with dropping the TC oddities (including the cls_ in the
name), but I think we should be documenting it so that users that do
come from TC will not be completel

Re: [PATCH bpf-next v2 3/4] libbpf: add low level TC-BPF API

2021-04-19 Thread Daniel Borkmann

On 4/19/21 2:18 PM, Kumar Kartikeya Dwivedi wrote:

This adds functions that wrap the netlink API used for adding,
manipulating, and removing traffic control filters. These functions
operate directly on the loaded prog's fd, and return a handle to the
filter using an out parameter named id.

The basic featureset is covered to allow for attaching, manipulation of
properties, and removal of filters. Some additional features like
TCA_BPF_POLICE and TCA_RATE for tc_cls have been omitted. These can
added on top later by extending the bpf_tc_cls_opts struct.

Support for binding actions directly to a classifier by passing them in
during filter creation has also been omitted for now. These actions have
an auto clean up property because their lifetime is bound to the filter
they are attached to. This can be added later, but was omitted for now
as direct action mode is a better alternative to it, which is enabled by
default.

An API summary:

bpf_tc_act_{attach, change, replace} may be used to attach, change, and


typo on bpf_tc_act_{...} ?
   ^^^

replace SCHED_CLS bpf classifier. The protocol field can be set as 0, in
which case it is subsitituted as ETH_P_ALL by default.


Do you have an actual user that needs anything other than ETH_P_ALL? Why is it
even needed? Why not stick to just ETH_P_ALL?


The behavior of the three functions is as follows:

attach = create filter if it does not exist, fail otherwise
change = change properties of the classifier of existing filter
replace = create filter, and replace any existing filter


This touches on tc oddities quite a bit. Why do we need to expose them? Can't we
simplify/abstract this e.g. i) create or update instance, ii) delete instance,
iii) get instance ? What concrete use case do you have that you need those three
above?


bpf_tc_cls_detach may be used to detach existing SCHED_CLS
filter. The bpf_tc_cls_attach_id object filled in during attach,
change, or replace must be passed in to the detach functions for them to
remove the filter and its attached classififer correctly.

bpf_tc_cls_get_info is a helper that can be used to obtain attributes
for the filter and classififer. The opts structure may be used to
choose the granularity of search, such that info for a specific filter
corresponding to the same loaded bpf program can be obtained. By
default, the first match is returned to the user.

Examples:

struct bpf_tc_cls_attach_id id = {};
struct bpf_object *obj;
struct bpf_program *p;
int fd, r;

obj = bpf_object_open("foo.o");
if (IS_ERR_OR_NULL(obj))
return PTR_ERR(obj);

p = bpf_object__find_program_by_title(obj, "classifier");
if (IS_ERR_OR_NULL(p))
return PTR_ERR(p);

if (bpf_object__load(obj) < 0)
return -1;

fd = bpf_program__fd(p);

r = bpf_tc_cls_attach(fd, if_nametoindex("lo"),
  BPF_TC_CLSACT_INGRESS,
  NULL, );
if (r < 0)
return r;

... which is roughly equivalent to (after clsact qdisc setup):
   # tc filter add dev lo ingress bpf obj foo.o sec classifier da

... as direct action mode is always enabled.

If a user wishes to modify existing options on an attached classifier,
bpf_tc_cls_change API may be used.

Only parameters class_id can be modified, the rest are filled in to
identify the correct filter. protocol can be left out if it was not
chosen explicitly (defaulting to ETH_P_ALL).

Example:

/* Optional parameters necessary to select the right filter */
DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts,
.handle = id.handle,
.priority = id.priority,
.chain_index = id.chain_index)


Why do we need chain_index as part of the basic API?


opts.class_id = TC_H_MAKE(1UL << 16, 12);
r = bpf_tc_cls_change(fd, if_nametoindex("lo"),
  BPF_TC_CLSACT_INGRESS,
  , );


Also, I'm not sure whether the prefix should even be named  bpf_tc_cls_*() tbh,
yes, despite being "low level" api. I think in the context of bpf we should stop
regarding this as 'classifier' and 'action' objects since it's really just a
single entity and not separate ones. It's weird enough to explain this concept
to new users and if a libbpf based api could cleanly abstract it, I would be all
for it. I don't think we need to map 1:1 the old tc legacy even in the low level
api, tbh, as it feels backwards. I think the 'handle' & 'priority' bits are 
okay,
but I would remove the others.


if (r < 0)
return r;

struct bpf_tc_cls_info info = {};
r = bpf_tc_cls_get_info(fd, if_nametoindex("lo"),
BPF_TC_CLSACT_INGRESS,
, );
if (r < 0)
return r;

assert(info.class_id == 

Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-15 Thread Daniel Borkmann

On 4/16/21 12:22 AM, Andrii Nakryiko wrote:

On Thu, Apr 15, 2021 at 3:10 PM Daniel Borkmann  wrote:

On 4/15/21 1:58 AM, Andrii Nakryiko wrote:

On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann  wrote:

On 4/15/21 1:19 AM, Andrii Nakryiko wrote:

On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
 wrote:

On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:

On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:

On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi  wrote:

[...]


All of these things are messy because of tc legacy. bpf tried to follow tc style
with cls and act distinction and it didn't quite work. cls with
direct-action is the only
thing that became mainstream while tc style attach wasn't really addressed.
There were several incidents where tc had tens of thousands of progs attached
because of this attach/query/index weirdness described above.
I think the only way to address this properly is to introduce bpf_link style of
attaching to tc. Such bpf_link would support ingress/egress only.
direction-action will be implied. There won't be any index and query
will be obvious.


Note that we already have bpf_link support working (without support for pinning
ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle,
chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link
and are able to operate on the exact filter during release.


Except they're not unique. The library can stash them, but something else
doing detach via iproute2 or their own netlink calls will detach the prog.
This other app can attach to the same spot a different prog and now
bpf_link__destroy will be detaching somebody else prog.


So I would like to propose to take this patch set a step further from
what Daniel said:
int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
and make this proposed api to return FD.
To detach from tc ingress/egress just close(fd).


You mean adding an fd-based TC API to the kernel?


yes.


I'm totally for bpf_link-based TC attachment.

But I think *also* having "legacy" netlink-based APIs will allow
applications to handle older kernels in a much nicer way without extra
dependency on iproute2. We have a similar situation with kprobe, where
currently libbpf only supports "modern" fd-based attachment, but users
periodically ask questions and struggle to figure out issues on older
kernels that don't support new APIs.


+1; I am OK with adding a new bpf_link-based way to attach TC programs,
but we still need to support the netlink API in libbpf.


So I think we'd have to support legacy TC APIs, but I agree with
Alexei and Daniel that we should keep it to the simplest and most
straightforward API of supporting direction-action attachments and
setting up qdisc transparently (if I'm getting all the terminology
right, after reading Quentin's blog post). That coincidentally should
probably match how bpf_link-based TC API will look like, so all that
can be abstracted behind a single bpf_link__attach_tc() API as well,
right? That's the plan for dealing with kprobe right now, btw. Libbpf
will detect the best available API and transparently fall back (maybe
with some warning for awareness, due to inherent downsides of legacy
APIs: no auto-cleanup being the most prominent one).


Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the
high-level API auto-detect. That way users can also still use the
netlink attach function if they don't want the fd-based auto-close
behaviour of bpf_link.


So I thought a bit more about this, and it feels like the right move
would be to expose only higher-level TC BPF API behind bpf_link. It
will keep the API complexity and amount of APIs that libbpf will have
to support to the minimum, and will keep the API itself simple:
direct-attach with the minimum amount of input arguments. By not
exposing low-level APIs we also table the whole bpf_tc_cls_attach_id
design discussion, as we now can keep as much info as needed inside
bpf_link_tc (which will embed bpf_link internally as well) to support
detachment and possibly some additional querying, if needed.


But then there would be no way for the caller to explicitly select a
mechanism? I.e., if I write a BPF program using this mechanism targeting
a 5.12 kernel, I'll get netlink attachment, which can stick around when
I do bpf_link__disconnect(). But then if the kernel gets upgraded to
support bpf_link for TC programs I'll suddenly transparently get
bpf_link and the attachments will go away unless I pin them. This
seems... less than ideal?


That's what we are doing with bpf_program__attach_kprobe(), though.
And so far I've only seen people (privately) sayi

Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-15 Thread Daniel Borkmann

On 4/15/21 1:58 AM, Andrii Nakryiko wrote:

On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann  wrote:

On 4/15/21 1:19 AM, Andrii Nakryiko wrote:

On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
 wrote:

On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:

On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:

On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi  wrote:

[...]


All of these things are messy because of tc legacy. bpf tried to follow tc style
with cls and act distinction and it didn't quite work. cls with
direct-action is the only
thing that became mainstream while tc style attach wasn't really addressed.
There were several incidents where tc had tens of thousands of progs attached
because of this attach/query/index weirdness described above.
I think the only way to address this properly is to introduce bpf_link style of
attaching to tc. Such bpf_link would support ingress/egress only.
direction-action will be implied. There won't be any index and query
will be obvious.


Note that we already have bpf_link support working (without support for pinning
ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle,
chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link
and are able to operate on the exact filter during release.


Except they're not unique. The library can stash them, but something else
doing detach via iproute2 or their own netlink calls will detach the prog.
This other app can attach to the same spot a different prog and now
bpf_link__destroy will be detaching somebody else prog.


So I would like to propose to take this patch set a step further from
what Daniel said:
int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
and make this proposed api to return FD.
To detach from tc ingress/egress just close(fd).


You mean adding an fd-based TC API to the kernel?


yes.


I'm totally for bpf_link-based TC attachment.

But I think *also* having "legacy" netlink-based APIs will allow
applications to handle older kernels in a much nicer way without extra
dependency on iproute2. We have a similar situation with kprobe, where
currently libbpf only supports "modern" fd-based attachment, but users
periodically ask questions and struggle to figure out issues on older
kernels that don't support new APIs.


+1; I am OK with adding a new bpf_link-based way to attach TC programs,
but we still need to support the netlink API in libbpf.


So I think we'd have to support legacy TC APIs, but I agree with
Alexei and Daniel that we should keep it to the simplest and most
straightforward API of supporting direction-action attachments and
setting up qdisc transparently (if I'm getting all the terminology
right, after reading Quentin's blog post). That coincidentally should
probably match how bpf_link-based TC API will look like, so all that
can be abstracted behind a single bpf_link__attach_tc() API as well,
right? That's the plan for dealing with kprobe right now, btw. Libbpf
will detect the best available API and transparently fall back (maybe
with some warning for awareness, due to inherent downsides of legacy
APIs: no auto-cleanup being the most prominent one).


Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the
high-level API auto-detect. That way users can also still use the
netlink attach function if they don't want the fd-based auto-close
behaviour of bpf_link.


So I thought a bit more about this, and it feels like the right move
would be to expose only higher-level TC BPF API behind bpf_link. It
will keep the API complexity and amount of APIs that libbpf will have
to support to the minimum, and will keep the API itself simple:
direct-attach with the minimum amount of input arguments. By not
exposing low-level APIs we also table the whole bpf_tc_cls_attach_id
design discussion, as we now can keep as much info as needed inside
bpf_link_tc (which will embed bpf_link internally as well) to support
detachment and possibly some additional querying, if needed.


But then there would be no way for the caller to explicitly select a
mechanism? I.e., if I write a BPF program using this mechanism targeting
a 5.12 kernel, I'll get netlink attachment, which can stick around when
I do bpf_link__disconnect(). But then if the kernel gets upgraded to
support bpf_link for TC programs I'll suddenly transparently get
bpf_link and the attachments will go away unless I pin them. This
seems... less than ideal?


That's what we are doing with bpf_program__attach_kprobe(), though.
And so far I've only seen people (privately) saying how good it would
be to have bpf_link-based TC APIs, doesn't seem like anyone with a
realis

Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-15 Thread Daniel Borkmann

On 4/15/21 11:32 AM, Jianlin Lv wrote:

For debugging JITs, dumping the JITed image to kernel log is discouraged,
"bpftool prog dump jited" is much better way to examine JITed dumps.
This patch get rid of the code related to bpf_jit_enable=2 mode and
update the proc handler of bpf_jit_enable, also added auxiliary
information to explain how to use bpf_jit_disasm tool after this change.

Signed-off-by: Jianlin Lv 

[...]

diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 0a7a2870f111..8d36b4658076 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2566,9 +2566,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
*prog)
cond_resched();
}
  
-	if (bpf_jit_enable > 1)

-   bpf_jit_dump(prog->len, proglen, pass + 1, image);
-
if (image) {
bpf_jit_binary_lock_ro(header);
prog->bpf_func = (void *)image;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index c8496c1142c9..990b1720c7a4 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,16 +273,8 @@ static int proc_dointvec_minmax_bpf_enable(struct 
ctl_table *table, int write,
  
  	tmp.data = _enable;

ret = proc_dointvec_minmax(, write, buffer, lenp, ppos);
-   if (write && !ret) {
-   if (jit_enable < 2 ||
-   (jit_enable == 2 && bpf_dump_raw_ok(current_cred( {
-   *(int *)table->data = jit_enable;
-   if (jit_enable == 2)
-   pr_warn("bpf_jit_enable = 2 was set! NEVER use this 
in production, only for JIT debugging!\n");
-   } else {
-   ret = -EPERM;
-   }
-   }
+   if (write && !ret)
+   *(int *)table->data = jit_enable;
return ret;
  }
  
@@ -389,7 +381,7 @@ static struct ctl_table net_core_table[] = {

.extra2 = SYSCTL_ONE,
  # else
.extra1 = SYSCTL_ZERO,
-   .extra2 = ,
+   .extra2 = SYSCTL_ONE,
  # endif
},
  # ifdef CONFIG_HAVE_EBPF_JIT
diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
index c8ae95804728..efa4b17ae016 100644
--- a/tools/bpf/bpf_jit_disasm.c
+++ b/tools/bpf/bpf_jit_disasm.c
@@ -7,7 +7,7 @@
   *
   * To get the disassembly of the JIT code, do the following:
   *
- *  1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
+ *  1) Insert bpf_jit_dump() and recompile the kernel to output JITed image 
into log


Hmm, if we remove bpf_jit_dump(), the next drive-by cleanup patch will be thrown
at bpf@vger stating that bpf_jit_dump() has no in-tree users and should be 
removed.
Maybe we should be removing bpf_jit_disasm.c along with it as well as 
bpf_jit_dump()
itself ... I guess if it's ever needed in those rare occasions for JIT 
debugging we
can resurrect it from old kernels just locally. But yeah, bpftool's jit dump 
should
suffice for vast majority of use cases.

There was a recent set for ppc32 jit which was merged into ppc tree which will 
create
a merge conflict with this one [0]. So we would need a rebase and take it maybe 
during
merge win once the ppc32 landed..

  [0] 
https://lore.kernel.org/bpf/cover.1616430991.git.christophe.le...@csgroup.eu/


   *  2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 
192.168.20.0/24`)
   *  3) Run e.g. `bpf_jit_disasm -o` to read out the last JIT code
   *
diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
index 40a88df275f9..98c7eec2923f 100644
--- a/tools/bpf/bpftool/feature.c
+++ b/tools/bpf/bpftool/feature.c
@@ -203,9 +203,6 @@ static void probe_jit_enable(void)
case 1:
printf("JIT compiler is enabled\n");
break;
-   case 2:
-   printf("JIT compiler is enabled with debugging traces in 
kernel logs\n");
-   break;


This would still need to be there for older kernels ...


case -1:
printf("Unable to retrieve JIT-compiler status\n");
break;





Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-14 Thread Daniel Borkmann

On 4/15/21 1:19 AM, Andrii Nakryiko wrote:

On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  wrote:

Andrii Nakryiko  writes:

On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
 wrote:

On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:

On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:

On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi  wrote:

[...]


All of these things are messy because of tc legacy. bpf tried to follow tc style
with cls and act distinction and it didn't quite work. cls with
direct-action is the only
thing that became mainstream while tc style attach wasn't really addressed.
There were several incidents where tc had tens of thousands of progs attached
because of this attach/query/index weirdness described above.
I think the only way to address this properly is to introduce bpf_link style of
attaching to tc. Such bpf_link would support ingress/egress only.
direction-action will be implied. There won't be any index and query
will be obvious.


Note that we already have bpf_link support working (without support for pinning
ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, handle,
chain_index tuple uniquely identifies a filter, so we stash this in the bpf_link
and are able to operate on the exact filter during release.


Except they're not unique. The library can stash them, but something else
doing detach via iproute2 or their own netlink calls will detach the prog.
This other app can attach to the same spot a different prog and now
bpf_link__destroy will be detaching somebody else prog.


So I would like to propose to take this patch set a step further from
what Daniel said:
int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
and make this proposed api to return FD.
To detach from tc ingress/egress just close(fd).


You mean adding an fd-based TC API to the kernel?


yes.


I'm totally for bpf_link-based TC attachment.

But I think *also* having "legacy" netlink-based APIs will allow
applications to handle older kernels in a much nicer way without extra
dependency on iproute2. We have a similar situation with kprobe, where
currently libbpf only supports "modern" fd-based attachment, but users
periodically ask questions and struggle to figure out issues on older
kernels that don't support new APIs.


+1; I am OK with adding a new bpf_link-based way to attach TC programs,
but we still need to support the netlink API in libbpf.


So I think we'd have to support legacy TC APIs, but I agree with
Alexei and Daniel that we should keep it to the simplest and most
straightforward API of supporting direction-action attachments and
setting up qdisc transparently (if I'm getting all the terminology
right, after reading Quentin's blog post). That coincidentally should
probably match how bpf_link-based TC API will look like, so all that
can be abstracted behind a single bpf_link__attach_tc() API as well,
right? That's the plan for dealing with kprobe right now, btw. Libbpf
will detect the best available API and transparently fall back (maybe
with some warning for awareness, due to inherent downsides of legacy
APIs: no auto-cleanup being the most prominent one).


Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the
high-level API auto-detect. That way users can also still use the
netlink attach function if they don't want the fd-based auto-close
behaviour of bpf_link.


So I thought a bit more about this, and it feels like the right move
would be to expose only higher-level TC BPF API behind bpf_link. It
will keep the API complexity and amount of APIs that libbpf will have
to support to the minimum, and will keep the API itself simple:
direct-attach with the minimum amount of input arguments. By not
exposing low-level APIs we also table the whole bpf_tc_cls_attach_id
design discussion, as we now can keep as much info as needed inside
bpf_link_tc (which will embed bpf_link internally as well) to support
detachment and possibly some additional querying, if needed.


But then there would be no way for the caller to explicitly select a
mechanism? I.e., if I write a BPF program using this mechanism targeting
a 5.12 kernel, I'll get netlink attachment, which can stick around when
I do bpf_link__disconnect(). But then if the kernel gets upgraded to
support bpf_link for TC programs I'll suddenly transparently get
bpf_link and the attachments will go away unless I pin them. This
seems... less than ideal?


That's what we are doing with bpf_program__attach_kprobe(), though.
And so far I've only seen people (privately) saying how good it would
be to have bpf_link-based TC APIs, doesn't seem like anyone with a
realistic use case prefers the current APIs. So I suspect it's not
going to be a problem in practice. But at least I'd start there and

Re: [PATCH 5.10 39/41] bpf, x86: Validate computation of branch displacements for x86-32

2021-04-09 Thread Daniel Borkmann

On 4/9/21 9:51 PM, Sudip Mukherjee wrote:

On Fri, Apr 09, 2021 at 11:54:01AM +0200, Greg Kroah-Hartman wrote:

From: Piotr Krysiuk 

commit 26f55a59dc65ff77cd1c4b37991e26497fc68049 upstream.


I am not finding this in Linus's tree and even not seeing this change in
master branch also. Am I missing something?


Both are in -net tree at this point, thus commit sha won't change anymore. 
David or
Jakub will likely send their -net PR to Linus today or tomorrow for landing in
mainline. For stable things had to move a bit quicker given the announcement in 
[0]
yesterday. Timing was a bit unfortunate here as I would have preferred for 
things to
land in stable the regular way first (aka merge to mainline, cherry-picking to 
stable,
minor stable release, then announcement).

Thanks,
Daniel

  [0] https://www.openwall.com/lists/oss-security/2021/04/08/1


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-01 Thread Daniel Borkmann

On 3/31/21 11:44 AM, Kumar Kartikeya Dwivedi wrote:

On Wed, Mar 31, 2021 at 02:55:47AM IST, Daniel Borkmann wrote:

Do we even need the _block variant? I would rather prefer to take the chance
and make it as simple as possible, and only iff really needed extend with
other APIs, for example:


The block variant can be dropped, I'll use the TC_BLOCK/TC_DEV alternative which
sets parent_id/ifindex properly.


   bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS});

Internally, this will create the sch_clsact qdisc & cls_bpf filter instance
iff not present yet, and attach to a default prio 1 handle 1, and _always_ in
direct-action mode. This is /as simple as it gets/ and we don't need to bother
users with more complex tc/cls_bpf internals unless desired. For example,
extended APIs could add prio/parent so that multi-prog can be attached to a
single cls_bpf instance, but even that could be a second step, imho.


I am not opposed to clsact qdisc setup if INGRESS/EGRESS is supplied (not sure
how others feel about it).


What speaks against it? It would be 100% clear from API side where the prog is
being attached. Same as with tc cmdline where you specify 'ingress'/'egress'.


We could make direct_action mode default, and similarly choose prio


To be honest, I wouldn't even support a mode from the lib/API side where 
direct_action
is not set. It should always be forced to true. Everything else is rather broken
setup-wise, imho, since it won't scale. We added direct_action a bit later to 
the
kernel than original cls_bpf, but if I would do it again today, I'd make it the
only available option. I don't see a reasonable use-case where you have it to 
false.


as 1 by default instead of letting the kernel do it. Then you can just pass in
NULL for bpf_tc_cls_opts and be close to what you're proposing. For protocol we
can choose ETH_P_ALL by default too if the user doesn't set it.


Same here with ETH_P_ALL, I'm not sure anyone uses anything other than 
ETH_P_ALL,
so yes, that should be default.


With these modifications, the equivalent would look like
bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, INGRESS), NULL, );


Few things compared to bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):

1) nit, but why even 'cls' in the name. I think we shouldn't expose such 
old-days
   tc semantics to a user. Just bpf_tc_attach() is cleaner/simpler to 
understand.
2) What's the 'TC_DEV(ifindex, INGRESS)' macro doing exactly? Looks unnecessary,
   why not regular args to the API?
3) Exposed bpf_tc_attach() API could internally call a bpf_tc_attach_opts() API
   with preset defaults, and the latter could have all the custom bits if the 
user
   needs to go beyond the simple API, so from your bpf_tc_cls_attach() I'd also
   drop the NULL.
4) For the simple API I'd likely also drop the id (you could have a query API if
   needed).


So as long as the user doesn't care about other details, they can just pass opts
as NULL.




Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-03-30 Thread Daniel Borkmann

On 3/30/21 10:39 PM, Andrii Nakryiko wrote:

On Sun, Mar 28, 2021 at 1:11 AM Kumar Kartikeya Dwivedi
 wrote:

On Sun, Mar 28, 2021 at 10:12:40AM IST, Andrii Nakryiko wrote:

Is there some succinct but complete enough documentation/tutorial/etc
that I can reasonably read to understand kernel APIs provided by TC
(w.r.t. BPF, of course). I'm trying to wrap my head around this and
whether API makes sense or not. Please share links, if you have some.


Hi Andrii,

Unfortunately for the kernel API part, I couldn't find any when I was working
on this. So I had to read the iproute2 tc code (tc_filter.c, f_bpf.c,
m_action.c, m_bpf.c) and the kernel side bits (cls_api.c, cls_bpf.c, act_api.c,
act_bpf.c) to grok anything I didn't understand. There's also similar code in
libnl (lib/route/{act,cls}.c).

Other than that, these resources were useful (perhaps you already went through
some/all of them):

https://docs.cilium.io/en/latest/bpf/#tc-traffic-control
https://qmonnet.github.io/whirl-offload/2020/04/11/tc-bpf-direct-action/
tc(8), and tc-bpf(8) man pages

I hope this is helpful!


Thanks! I'll take a look. Sorry, I'm a bit behind with all the stuff,
trying to catch up.

I was just wondering if it would be more natural instead of having
_dev _block variants and having to specify __u32 ifindex, __u32
parent_id, __u32 protocol, to have some struct specifying TC
"destination"? Maybe not, but I thought I'd bring this up early. So
you'd have just bpf_tc_cls_attach(), and you'd so something like

bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, parent_id, protocol))

or

bpf_tc_cls_attach(prog_fd, TC_BLOCK(block_idx, protocol))

? Or it's taking it too far?

But even if not, I think detaching can be unified between _dev and
_block, can't it?


Do we even need the _block variant? I would rather prefer to take the chance
and make it as simple as possible, and only iff really needed extend with
other APIs, for example:

  bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS});

Internally, this will create the sch_clsact qdisc & cls_bpf filter instance
iff not present yet, and attach to a default prio 1 handle 1, and _always_ in
direct-action mode. This is /as simple as it gets/ and we don't need to bother
users with more complex tc/cls_bpf internals unless desired. For example,
extended APIs could add prio/parent so that multi-prog can be attached to a
single cls_bpf instance, but even that could be a second step, imho.

Thanks,
Daniel


Re: [PATCH v3] bpf: Fix memory leak in copy_process()

2021-03-19 Thread Daniel Borkmann

On 3/17/21 4:09 AM, qiang.zh...@windriver.com wrote:

From: Zqiang 

The syzbot report a memleak follow:
BUG: memory leak
unreferenced object 0x888101b41d00 (size 120):
   comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] alloc_pid+0x66/0x560
 [] copy_process+0x1465/0x25e0
 [] kernel_clone+0xf3/0x670
 [] kernel_thread+0x61/0x80
 [] call_usermodehelper_exec_work
 [] call_usermodehelper_exec_work+0xc4/0x120
 [] process_one_work+0x2c9/0x600
 [] worker_thread+0x59/0x5d0
 [] kthread+0x178/0x1b0
 [] ret_from_fork+0x1f/0x30

unreferenced object 0x888110ef5c00 (size 232):
   comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] kmem_cache_zalloc
 [] __alloc_file+0x1f/0xf0
 [] alloc_empty_file+0x69/0x120
 [] alloc_file+0x33/0x1b0
 [] alloc_file_pseudo+0xb2/0x140
 [] create_pipe_files+0x138/0x2e0
 [] umd_setup+0x33/0x220
 [] call_usermodehelper_exec_async+0xb4/0x1b0
 [] ret_from_fork+0x1f/0x30

after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid
need to be release.

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates 
bpffs.")
Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com
Signed-off-by: Zqiang 


Applied to bpf, thanks (also did minor style fixups to fix kernel style)!


Re: linux-next: manual merge of the net-next tree with the net tree

2021-03-19 Thread Daniel Borkmann

On 3/19/21 4:33 PM, Alexei Starovoitov wrote:

On Fri, Mar 19, 2021 at 8:17 AM Yonghong Song  wrote:

On 3/19/21 12:21 AM, Daniel Borkmann wrote:

On 3/19/21 3:11 AM, Piotr Krysiuk wrote:

Hi Daniel,

On Fri, Mar 19, 2021 at 12:16 AM Stephen Rothwell 
wrote:


diff --cc kernel/bpf/verifier.c
index 44e4ec1640f1,f9096b049cd6..
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@@ -5876,10 -6056,22 +6060,23 @@@ static int
retrieve_ptr_limit(const str
  if (mask_to_left)
  *ptr_limit = MAX_BPF_STACK + off;
  else
   -  *ptr_limit = -off;
   -  return 0;
   +  *ptr_limit = -off - 1;
   +  return *ptr_limit >= max ? -ERANGE : 0;
+   case PTR_TO_MAP_KEY:
+   /* Currently, this code is not exercised as the only use
+* is bpf_for_each_map_elem() helper which requires
+* bpf_capble. The code has been tested manually for
+* future use.
+*/
+   if (mask_to_left) {
+   *ptr_limit = ptr_reg->umax_value + ptr_reg->off;
+   } else {
+   off = ptr_reg->smin_value + ptr_reg->off;
+   *ptr_limit = ptr_reg->map_ptr->key_size - off;
+   }
+   return 0;


PTR_TO_MAP_VALUE logic above looks like copy-paste of old
PTR_TO_MAP_VALUE
code from before "bpf: Fix off-by-one for area size in creating mask to
left" and is apparently affected by the same off-by-one, except this time
on "key_size" area and not "value_size".

This needs to be fixed in the same way as we did with PTR_TO_MAP_VALUE.
What is the best way to proceed?


Hm, not sure why PTR_TO_MAP_KEY was added by 69c087ba6225 in the first
place, I
presume noone expects this to be used from unprivileged as the comment
says.
Resolution should be to remove the PTR_TO_MAP_KEY case entirely from
that switch
until we have an actual user.


Alexei suggested so that we don't forget it in the future if
bpf_capable() requirement is removed.
 https://lore.kernel.org/bpf/c837ae55-2487-2f39-47f6-a18781dc6...@fb.com/

I am okay with either way, fix it or remove it.


I prefer to fix it.


If the bpf_capable() is removed, the verifier would bail out on PTR_TO_MAP_KEY
if not covered in the switch given the recent fixes we did. I can fix it up 
after
merge if we think bpf_for_each_map_elem() will be used by unpriv in future..


Re: linux-next: manual merge of the net-next tree with the net tree

2021-03-19 Thread Daniel Borkmann

On 3/19/21 3:11 AM, Piotr Krysiuk wrote:

Hi Daniel,

On Fri, Mar 19, 2021 at 12:16 AM Stephen Rothwell 
wrote:


diff --cc kernel/bpf/verifier.c
index 44e4ec1640f1,f9096b049cd6..
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@@ -5876,10 -6056,22 +6060,23 @@@ static int retrieve_ptr_limit(const str
 if (mask_to_left)
 *ptr_limit = MAX_BPF_STACK + off;
 else
  -  *ptr_limit = -off;
  -  return 0;
  +  *ptr_limit = -off - 1;
  +  return *ptr_limit >= max ? -ERANGE : 0;
+   case PTR_TO_MAP_KEY:
+   /* Currently, this code is not exercised as the only use
+* is bpf_for_each_map_elem() helper which requires
+* bpf_capble. The code has been tested manually for
+* future use.
+*/
+   if (mask_to_left) {
+   *ptr_limit = ptr_reg->umax_value + ptr_reg->off;
+   } else {
+   off = ptr_reg->smin_value + ptr_reg->off;
+   *ptr_limit = ptr_reg->map_ptr->key_size - off;
+   }
+   return 0;



PTR_TO_MAP_VALUE logic above looks like copy-paste of old PTR_TO_MAP_VALUE
code from before "bpf: Fix off-by-one for area size in creating mask to
left" and is apparently affected by the same off-by-one, except this time
on "key_size" area and not "value_size".

This needs to be fixed in the same way as we did with PTR_TO_MAP_VALUE.
What is the best way to proceed?


Hm, not sure why PTR_TO_MAP_KEY was added by 69c087ba6225 in the first place, I
presume noone expects this to be used from unprivileged as the comment says.
Resolution should be to remove the PTR_TO_MAP_KEY case entirely from that switch
until we have an actual user.

Thanks,
Daniel


Re: [PATCH] selftests/bpf: fix warning comparing pointer to 0

2021-03-18 Thread Daniel Borkmann

On 3/18/21 2:55 AM, Jiapeng Chong wrote:

Fix the following coccicheck warning:

./tools/testing/selftests/bpf/progs/fentry_test.c:76:15-16: WARNING
comparing pointer to 0.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
  tools/testing/selftests/bpf/progs/fentry_test.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/fentry_test.c 
b/tools/testing/selftests/bpf/progs/fentry_test.c
index 5f645fd..d4247d6 100644
--- a/tools/testing/selftests/bpf/progs/fentry_test.c
+++ b/tools/testing/selftests/bpf/progs/fentry_test.c
@@ -64,7 +64,7 @@ struct bpf_fentry_test_t {
  SEC("fentry/bpf_fentry_test7")
  int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
  {
-   if (arg == 0)
+   if (!arg)
test7_result = 1;
return 0;
  }
@@ -73,7 +73,7 @@ int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
  SEC("fentry/bpf_fentry_test8")
  int BPF_PROG(test8, struct bpf_fentry_test_t *arg)
  {
-   if (arg->a == 0)
+   if (!arg->a)
test8_result = 1;
return 0;
  }



This doesn't apply. Please rebase against bpf-next tree, and also make sure to
squash any other such patches into a single one.


Re: [PATCH bpf-next v2] bpf: Simplify expression for identify bpf mem type

2021-03-18 Thread Daniel Borkmann

On 3/18/21 7:36 AM, Jianlin Lv wrote:

Added BPF_LD_ST_SIZE_MASK macro as mask of size modifier that help to
reduce the evaluation of expressions in if statements,
and remove BPF_SIZE_MASK in netronome driver.

Signed-off-by: Jianlin Lv 
---
v2: Move the bpf_LD_ST_SIZE_MASK macro definition to include/linux/bpf.h
---
  drivers/net/ethernet/netronome/nfp/bpf/main.h |  8 +++-
  include/linux/bpf.h   |  1 +
  kernel/bpf/verifier.c | 12 
  3 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index d0e17eebddd9..e90981e69763 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -346,8 +346,6 @@ struct nfp_insn_meta {
struct list_head l;
  };
  
-#define BPF_SIZE_MASK	0x18

-
  static inline u8 mbpf_class(const struct nfp_insn_meta *meta)
  {
return BPF_CLASS(meta->insn.code);
@@ -375,7 +373,7 @@ static inline bool is_mbpf_alu(const struct nfp_insn_meta 
*meta)
  
  static inline bool is_mbpf_load(const struct nfp_insn_meta *meta)

  {
-   return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM);
+   return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_LDX | BPF_MEM);
  }
  
  static inline bool is_mbpf_jmp32(const struct nfp_insn_meta *meta)

@@ -395,7 +393,7 @@ static inline bool is_mbpf_jmp(const struct nfp_insn_meta 
*meta)
  
  static inline bool is_mbpf_store(const struct nfp_insn_meta *meta)

  {
-   return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_MEM);
+   return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | BPF_MEM);
  }
  
  static inline bool is_mbpf_load_pkt(const struct nfp_insn_meta *meta)

@@ -430,7 +428,7 @@ static inline bool is_mbpf_classic_store_pkt(const struct 
nfp_insn_meta *meta)
  
  static inline bool is_mbpf_atomic(const struct nfp_insn_meta *meta)

  {
-   return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_ATOMIC);
+   return (meta->insn.code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | 
BPF_ATOMIC);
  }
  
  static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a25730eaa148..e85924719c65 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -995,6 +995,7 @@ struct bpf_array {
 BPF_F_RDONLY_PROG |\
 BPF_F_WRONLY | \
 BPF_F_WRONLY_PROG)
+#define BPF_LD_ST_SIZE_MASK0x18/* mask of size modifier */
  
  #define BPF_MAP_CAN_READ	BIT(0)

  #define BPF_MAP_CAN_WRITE BIT(1)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f9096b049cd6..29fdfdb8abfa 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11384,15 +11384,11 @@ static int convert_ctx_accesses(struct 
bpf_verifier_env *env)
for (i = 0; i < insn_cnt; i++, insn++) {
bpf_convert_ctx_access_t convert_ctx_access;
  
-		if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||

-   insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
-   insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
-   insn->code == (BPF_LDX | BPF_MEM | BPF_DW))
+   /* opcode: BPF_MEM |  | BPF_LDX */
+   if ((insn->code & ~BPF_LD_ST_SIZE_MASK) == (BPF_LDX | BPF_MEM))
type = BPF_READ;
-   else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) ||
-insn->code == (BPF_STX | BPF_MEM | BPF_H) ||
-insn->code == (BPF_STX | BPF_MEM | BPF_W) ||
-insn->code == (BPF_STX | BPF_MEM | BPF_DW))
+   /* opcode: BPF_MEM |  | BPF_STX */
+   else if ((insn->code & ~BPF_LD_ST_SIZE_MASK) == (BPF_STX | 
BPF_MEM))
type = BPF_WRITE;
else
continue;



To me this cleanup makes the code harder to read, in particular on verfier side,
I don't think it's worth it, especially given it's not in (highly) performance
critical code.

Thanks,
Daniel


Re: [PATCH] MIPS/bpf: Enable bpf_probe_read{, str}() on MIPS again

2021-03-17 Thread Daniel Borkmann

On 3/17/21 8:15 AM, Tiezhu Yang wrote:

After commit 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to
archs where they work"), bpf_probe_read{, str}() functions were not longer
available on MIPS, so there exists some errors when running bpf program:

root@linux:/home/loongson/bcc# python examples/tracing/task_switch.py
bpf: Failed to load program: Invalid argument
[...]
11: (85) call bpf_probe_read#4
unknown func bpf_probe_read#4
[...]
Exception: Failed to load BPF program count_sched: Invalid argument

So select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE in arch/mips/Kconfig,
otherwise the bpf old helper bpf_probe_read() will not be available.

This is similar with the commit d195b1d1d1196 ("powerpc/bpf: Enable
bpf_probe_read{, str}() on powerpc again").

Fixes: 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they 
work")
Signed-off-by: Tiezhu Yang 


Thomas, I presume you pick this up via mips tree (with typos fixed)? Or do you
want us to route the fix via bpf with your ACK? (I'm fine either way.)

Thanks,
Daniel


Re: [PATCH] libbpf: avoid inline hint definition from 'linux/stddef.h'

2021-03-16 Thread Daniel Borkmann

On 3/16/21 10:34 PM, Andrii Nakryiko wrote:

On Tue, Mar 16, 2021 at 2:01 PM Daniel Borkmann  wrote:

On 3/14/21 6:38 PM, Pedro Tammela wrote:

Linux headers might pull 'linux/stddef.h' which defines
'__always_inline' as the following:

 #ifndef __always_inline
 #define __always_inline __inline__
 #endif

This becomes an issue if the program picks up the 'linux/stddef.h'
definition as the macro now just hints inline to clang.


How did the program end up including linux/stddef.h ? Would be good to
also have some more details on how we got here for the commit desc.


It's an UAPI header, so why not? Is there anything special about
stddef.h that makes it unsuitable to be included?


Hm, fair enough, looks like linux/types.h already pulls it in, so no. We
defined our own stddef.h longer time ago, so looks like we never ran into
this issue.


This change now enforces the proper definition for BPF programs
regardless of the include order.

Signed-off-by: Pedro Tammela 
---
   tools/lib/bpf/bpf_helpers.h | 7 +--
   1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index ae6c975e0b87..5fa483c0b508 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -29,9 +29,12 @@
*/
   #define SEC(NAME) __attribute__((section(NAME), used))

-#ifndef __always_inline
+/*
+ * Avoid 'linux/stddef.h' definition of '__always_inline'.
+ */


I think the comment should have more details on 'why' we undef it as in
few months looking at it again, the next question to dig into would be
what was wrong with linux/stddef.h. Providing a better rationale would
be nice for readers here.


So for whatever reason commit bot didn't send notification, but I've
already landed this yesterday. To me, with #undef + #define it's
pretty clear that we "force-define" __always_inline exactly how we
want it, but we can certainly add clarifying comment in the follow up,
if you think it's needed.


Up to you, but given you applied it it's probably not worth the trouble;
missed it earlier given I didn't see the patchbot message in the thread
initially. :/


+#undef __always_inline
   #define __always_inline inline __attribute__((always_inline))
-#endif
+
   #ifndef __noinline
   #define __noinline __attribute__((noinline))
   #endif







Re: [PATCH] libbpf: avoid inline hint definition from 'linux/stddef.h'

2021-03-16 Thread Daniel Borkmann

On 3/14/21 6:38 PM, Pedro Tammela wrote:

Linux headers might pull 'linux/stddef.h' which defines
'__always_inline' as the following:

#ifndef __always_inline
#define __always_inline __inline__
#endif

This becomes an issue if the program picks up the 'linux/stddef.h'
definition as the macro now just hints inline to clang.


How did the program end up including linux/stddef.h ? Would be good to
also have some more details on how we got here for the commit desc.


This change now enforces the proper definition for BPF programs
regardless of the include order.

Signed-off-by: Pedro Tammela 
---
  tools/lib/bpf/bpf_helpers.h | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index ae6c975e0b87..5fa483c0b508 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -29,9 +29,12 @@
   */
  #define SEC(NAME) __attribute__((section(NAME), used))
  
-#ifndef __always_inline

+/*
+ * Avoid 'linux/stddef.h' definition of '__always_inline'.
+ */


I think the comment should have more details on 'why' we undef it as in
few months looking at it again, the next question to dig into would be
what was wrong with linux/stddef.h. Providing a better rationale would
be nice for readers here.


+#undef __always_inline
  #define __always_inline inline __attribute__((always_inline))
-#endif
+
  #ifndef __noinline
  #define __noinline __attribute__((noinline))
  #endif





Re: [PATCH v2] bpf: Fix memory leak in copy_process()

2021-03-16 Thread Daniel Borkmann

On 3/15/21 9:58 AM, qiang.zh...@windriver.com wrote:

From: Zqiang 


nit: I presume it should be s/Zqiang/Qiang Zhang/ as real name for 'From'
instead of abbreviation?


The syzbot report a memleak follow:
BUG: memory leak
unreferenced object 0x888101b41d00 (size 120):
   comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] alloc_pid+0x66/0x560
 [] copy_process+0x1465/0x25e0
 [] kernel_clone+0xf3/0x670
 [] kernel_thread+0x61/0x80
 [] call_usermodehelper_exec_work
 [] call_usermodehelper_exec_work+0xc4/0x120
 [] process_one_work+0x2c9/0x600
 [] worker_thread+0x59/0x5d0
 [] kthread+0x178/0x1b0
 [] ret_from_fork+0x1f/0x30

unreferenced object 0x888110ef5c00 (size 232):
   comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] kmem_cache_zalloc
 [] __alloc_file+0x1f/0xf0
 [] alloc_empty_file+0x69/0x120
 [] alloc_file+0x33/0x1b0
 [] alloc_file_pseudo+0xb2/0x140
 [] create_pipe_files+0x138/0x2e0
 [] umd_setup+0x33/0x220
 [] call_usermodehelper_exec_async+0xb4/0x1b0
 [] ret_from_fork+0x1f/0x30

after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid
need to be release.

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates 
bpffs.")
Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com
Signed-off-by: Zqiang 


nit: Ditto


---
  v1->v2:
  Judge whether the pointer variable tgid is valid.

  kernel/bpf/preload/bpf_preload_kern.c | 24 
  1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/preload/bpf_preload_kern.c 
b/kernel/bpf/preload/bpf_preload_kern.c
index 79c5772465f1..5009875f01d3 100644
--- a/kernel/bpf/preload/bpf_preload_kern.c
+++ b/kernel/bpf/preload/bpf_preload_kern.c
@@ -4,6 +4,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "bpf_preload.h"
  
@@ -20,6 +21,14 @@ static struct bpf_preload_ops umd_ops = {

.owner = THIS_MODULE,
  };
  
+static void bpf_preload_umh_cleanup(struct umd_info *info)

+{
+   fput(info->pipe_to_umh);
+   fput(info->pipe_from_umh);
+   put_pid(info->tgid);
+   info->tgid = NULL;
+}


The above is pretty much a reimplementation of ...

static void umd_cleanup(struct subprocess_info *info)
{
struct umd_info *umd_info = info->data;

/* cleanup if umh_setup() was successful but exec failed */
if (info->retval) {
fput(umd_info->pipe_to_umh);
fput(umd_info->pipe_from_umh);
put_pid(umd_info->tgid);
umd_info->tgid = NULL;
}
}

... so if there are ever changes to umd_cleanup() for additional resource
cleanup, we'd be missing those easily in bpf_preload_umh_cleanup(). I'd
suggest to refactor a common helper inside kernel/usermode_driver.c that
is then exported as symbol which the driver here can use.


  static int preload(struct bpf_preload_info *obj)
  {
int magic = BPF_PRELOAD_START;
@@ -61,8 +70,10 @@ static int finish(void)
if (n != sizeof(magic))
return -EPIPE;
tgid = umd_ops.info.tgid;
-   wait_event(tgid->wait_pidfd, thread_group_exited(tgid));
-   umd_ops.info.tgid = NULL;
+   if (tgid) {
+   wait_event(tgid->wait_pidfd, thread_group_exited(tgid));
+   bpf_preload_umh_cleanup(_ops.info);
+   }
return 0;
  }
  
@@ -80,10 +91,15 @@ static int __init load_umd(void)
  
  static void __exit fini_umd(void)

  {
+   struct pid *tgid;
bpf_preload_ops = NULL;
/* kill UMD in case it's still there due to earlier error */
-   kill_pid(umd_ops.info.tgid, SIGKILL, 1);
-   umd_ops.info.tgid = NULL;
+   tgid = umd_ops.info.tgid;
+   if (tgid) {
+   kill_pid(tgid, SIGKILL, 1);
+   wait_event(tgid->wait_pidfd, thread_group_exited(tgid));
+   bpf_preload_umh_cleanup(_ops.info);
+   }
umd_unload_blob(_ops.info);
  }
  late_initcall(load_umd);





Re: [PATCH v2] bpf: Fix memory leak in copy_process()

2021-03-15 Thread Daniel Borkmann

On 3/15/21 9:18 AM, qiang.zh...@windriver.com wrote:

From: Zqiang 


Hello Zqiang, please resend this patch with b...@vger.kernel.org in Cc, so it
actually reaches the rest of BPF community for review, thanks!


The syzbot report a memleak follow:
BUG: memory leak
unreferenced object 0x888101b41d00 (size 120):
   comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] alloc_pid+0x66/0x560
 [] copy_process+0x1465/0x25e0
 [] kernel_clone+0xf3/0x670
 [] kernel_thread+0x61/0x80
 [] call_usermodehelper_exec_work
 [] call_usermodehelper_exec_work+0xc4/0x120
 [] process_one_work+0x2c9/0x600
 [] worker_thread+0x59/0x5d0
 [] kthread+0x178/0x1b0
 [] ret_from_fork+0x1f/0x30

unreferenced object 0x888110ef5c00 (size 232):
   comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s)
   backtrace:
 [] kmem_cache_zalloc
 [] __alloc_file+0x1f/0xf0
 [] alloc_empty_file+0x69/0x120
 [] alloc_file+0x33/0x1b0
 [] alloc_file_pseudo+0xb2/0x140
 [] create_pipe_files+0x138/0x2e0
 [] umd_setup+0x33/0x220
 [] call_usermodehelper_exec_async+0xb4/0x1b0
 [] ret_from_fork+0x1f/0x30

after the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid
need to be release.

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates 
bpffs.")
Reported-by: syzbot+44908bb56d2bfe56b...@syzkaller.appspotmail.com
Signed-off-by: Zqiang 
---
  v1->v2:
  Judge whether the pointer variable tgid is valid.


Re: [PATCH] tools include: Add __sum16 and __wsum definitions.

2021-03-08 Thread Daniel Borkmann

On 3/7/21 11:30 PM, Ian Rogers wrote:

This adds definitions available in the uapi version.

Explanation:
In the kernel include of types.h the uapi version is included.
In tools the uapi/linux/types.h and linux/types.h are distinct.
For BPF programs a definition of __wsum is needed by the generated
bpf_helpers.h. The definition comes either from a generated vmlinux.h or
from  that may be transitively included from bpf.h. The
perf build prefers linux/types.h over uapi/linux/types.h for
*. To allow tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
to compile with the same include path used for perf then these
definitions are necessary.

There is likely a wider conversation about exactly how types.h should be
specified and the include order used by the perf build - it is somewhat
confusing that tools/include/uapi/linux/bpf.h is using the non-uapi
types.h.

*see tools/perf/Makefile.config:
...
INC_FLAGS += -I$(srctree)/tools/include/
INC_FLAGS += -I$(srctree)/tools/arch/$(SRCARCH)/include/uapi
INC_FLAGS += -I$(srctree)/tools/include/uapi
...
The include directories are scanned from left-to-right:
https://gcc.gnu.org/onlinedocs/gcc/Directory-Options.html
As tools/include/linux/types.h appears before
tools/include/uapi/linux/types.h then I say it is preferred.

Signed-off-by: Ian Rogers 


Given more related to perf build infra, I presume Arnaldo would pick
this one up?

Thanks,
Daniel


Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan

2021-03-05 Thread Daniel Borkmann

On 3/5/21 5:15 PM, Willem de Bruijn wrote:

On Fri, Mar 5, 2021 at 11:10 AM Daniel Borkmann  wrote:


On 3/5/21 4:08 PM, Willem de Bruijn wrote:

On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang  wrote:


From: Xuesen Huang 

Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which
encapsulates the ethernet as the inner l2 header.

Update a vxlan encapsulation test case.

Signed-off-by: Xuesen Huang 
Signed-off-by: Li Wang 
Signed-off-by: Willem de Bruijn 


Please don't add my signed off by without asking.


Agree, I can remove it if you prefer while applying and only keep the
ack instead.


That would be great. Thanks, Daniel!


Done & applied, thanks everyone!


Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan

2021-03-05 Thread Daniel Borkmann

On 3/5/21 4:08 PM, Willem de Bruijn wrote:

On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang  wrote:


From: Xuesen Huang 

Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which
encapsulates the ethernet as the inner l2 header.

Update a vxlan encapsulation test case.

Signed-off-by: Xuesen Huang 
Signed-off-by: Li Wang 
Signed-off-by: Willem de Bruijn 


Please don't add my signed off by without asking.


Agree, I can remove it if you prefer while applying and only keep the
ack instead.


That said,

Acked-by: Willem de Bruijn 


Re: [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-24 Thread Daniel Borkmann

On 2/18/21 9:49 PM, Alexander Lobakin wrote:

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

[...]

Applied, thanks!


Re: [PATCH] bpf: fix a warning message in mark_ptr_not_null_reg()

2021-02-16 Thread Daniel Borkmann

On 2/16/21 10:10 PM, KP Singh wrote:

On Tue, Feb 16, 2021 at 8:37 PM Dan Carpenter  wrote:


The WARN_ON() argument is a condition, and it generates a stack trace
but it doesn't print the warning.

Fixes: 4ddb74165ae5 ("bpf: Extract nullable reg type conversion into a helper 
function")
Signed-off-by: Dan Carpenter 
---
  kernel/bpf/verifier.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 056df6be3e30..bd4d1dfca73c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1120,7 +1120,7 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state 
*reg)
 reg->type = PTR_TO_RDWR_BUF;
 break;
 default:
-   WARN_ON("unknown nullable register type");
+   WARN(1, "unknown nullable register type");


Should we use WARN_ONCE here? Also, I think the fix should be targeted
for bpf-next as
the patch that introduced this hasn't made it to bpf yet.

[...]


Usually we have something like `verbose(env, "kernel subsystem misconfigured 
verifier\n")`,
but in this case we'd need to drag env all the way to here. :/ I agree with 
WARN_ONCE().


Re: [PATCH bpf] devmap: Use GFP_KERNEL for xdp bulk queue allocation

2021-02-12 Thread Daniel Borkmann

On 2/9/21 9:24 AM, NOMURA JUNICHI(野村 淳一) wrote:

The devmap bulk queue is allocated with GFP_ATOMIC and the allocation may
fail if there is no available space in existing percpu pool.

Since commit 75ccae62cb8d42 ("xdp: Move devmap bulk queue into struct 
net_device")
moved the bulk queue allocation to NETDEV_REGISTER callback, whose context
is allowed to sleep, use GFP_KERNEL instead of GFP_ATOMIC to let percpu
allocator extend the pool when needed and avoid possible failure of netdev
registration.

As the required alignment is natural, we can simply use alloc_percpu().

Signed-off-by: Jun'ichi Nomura 


Applied, thanks!


Re: [PATCH/v2] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-02-11 Thread Daniel Borkmann

On 2/10/21 3:50 PM, Willem de Bruijn wrote:

On Wed, Feb 10, 2021 at 1:59 AM huangxuesen  wrote:


From: huangxuesen 

bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets
encapsulation. But that is not appropriate when pushing Ethernet header.

Add an option to further specify encap L2 type and set the inner_protocol
as ETH_P_TEB.

Suggested-by: Willem de Bruijn 
Signed-off-by: huangxuesen 
Signed-off-by: chengzhiyong 
Signed-off-by: wangli 


Thanks, this is exactly what I meant.

Acked-by: Willem de Bruijn 

One small point regarding Signed-off-by: It is customary to capitalize
family and given names.


+1, huangxuesen, would be great if you could resubmit with capitalized names in
your SoB as well as From (both seem affected).

Thanks,
Daniel


Re: [PATCH bpf-next v6 2/5] bpf: Expose bpf_get_socket_cookie to tracing programs

2021-02-01 Thread Daniel Borkmann

On 1/30/21 12:45 PM, Florent Revest wrote:

On Fri, Jan 29, 2021 at 1:49 PM Daniel Borkmann  wrote:

On 1/29/21 11:57 AM, Daniel Borkmann wrote:

On 1/27/21 10:01 PM, Andrii Nakryiko wrote:

On Tue, Jan 26, 2021 at 10:36 AM Florent Revest  wrote:


This needs a new helper that:
- can work in a sleepable context (using sock_gen_cookie)
- takes a struct sock pointer and checks that it's not NULL

Signed-off-by: Florent Revest 
Acked-by: KP Singh 
---
   include/linux/bpf.h|  1 +
   include/uapi/linux/bpf.h   |  8 
   kernel/trace/bpf_trace.c   |  2 ++
   net/core/filter.c  | 12 
   tools/include/uapi/linux/bpf.h |  8 
   5 files changed, 31 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1aac2af12fed..26219465e1f7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1874,6 +1874,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
   extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
   extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
   extern const struct bpf_func_proto bpf_sock_from_file_proto;
+extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;

   const struct bpf_func_proto *bpf_tracing_func_proto(
  enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0b735c2729b2..5855c398d685 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1673,6 +1673,14 @@ union bpf_attr {
* Return
* A 8-byte long unique number.
*
+ * u64 bpf_get_socket_cookie(void *sk)


should the type be `struct sock *` then?


Checking libbpf's generated bpf_helper_defs.h it generates:

/*
   * bpf_get_socket_cookie
   *
   *  If the **struct sk_buff** pointed by *skb* has a known socket,
   *  retrieve the cookie (generated by the kernel) of this socket.
   *  If no cookie has been set yet, generate a new cookie. Once
   *  generated, the socket cookie remains stable for the life of the
   *  socket. This helper can be useful for monitoring per socket
   *  networking traffic statistics as it provides a global socket
   *  identifier that can be assumed unique.
   *
   * Returns
   *  A 8-byte long non-decreasing number on success, or 0 if the
   *  socket field is missing inside *skb*.
   */
static __u64 (*bpf_get_socket_cookie)(void *ctx) = (void *) 46;

So in terms of helper comment it's picking up the description from the
`u64 bpf_get_socket_cookie(struct sk_buff *skb)` signature. With that
in mind it would likely make sense to add the actual `struct sock *` type
to the comment to make it more clear in here.


One thought that still came to mind when looking over the series again, do
we need to blacklist certain functions from bpf_get_socket_cookie() under
tracing e.g. when attaching to, say fexit? For example, if sk_prot_free()
would be temporary uninlined/exported for testing and bpf_get_socket_cookie()
was invoked from a prog upon fexit where sock was already passed back to
allocator, I presume there's risk of mem corruption, no?


Mh, this is interesting. I can try to add a deny list in v7 but I'm
not sure whether I'll be able to catch them all. I'm assuming that
__sk_destruct, sk_destruct, __sk_free, sk_free would be other
problematic functions but potentially there would be more.


I was just looking at bpf_skb_output() from a7658e1a4164 ("bpf: Check types of
arguments passed into helpers") which afaiu has similar issue, back at the time
this was introduced there was no fentry/fexit but rather raw tp progs, so you
could still safely dump skb this way including for kfree_skb() tp. Presumably if
you bpf_skb_output() at 'fexit/kfree_skb' you might be able to similarly crash
your kernel which I don't think is intentional (also given we go above and 
beyond
in other areas to avoid crashing or destabilizing e.g. [0] to mention one). 
Maybe
these should really only be for BPF_TRACE_RAW_TP (or BPF_PROG_TYPE_LSM) where it
can be audited that it's safe to use like a7658e1a4164's original intention ...
or have some sort of function annotation like __acquires/__releases but for 
tracing
e.g. __frees(skb) where use would then be blocked (not sure iff feasible).

  [0] https://lore.kernel.org/bpf/20210126001219.845816-1-...@fb.com/


Re: [PATCH] bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc

2021-01-27 Thread Daniel Borkmann

On 1/27/21 5:23 AM, Bui Quang Minh wrote:

On Tue, Jan 26, 2021 at 09:36:57AM +, Lorenz Bauer wrote:

On Tue, 26 Jan 2021 at 08:26, Bui Quang Minh  wrote:


In 32-bit architecture, the result of sizeof() is a 32-bit integer so
the expression becomes the multiplication between 2 32-bit integer which
can potentially leads to integer overflow. As a result,
bpf_map_area_alloc() allocates less memory than needed.

Fix this by casting 1 operand to u64.


Some quick thoughts:
* Should this have a Fixes tag?


Ok, I will add Fixes tag in later version patch.


* Seems like there are quite a few similar calls scattered around
(cpumap, etc.). Did you audit these as well?



[...]

In cpumap,

static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
{
cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
   sizeof(struct 
bpf_cpu_map_entry *),
   cmap->map.numa_node);
}

I think this is safe because max_entries is not permitted to be larger than 
NR_CPUS.


Yes.


In stackmap, there is a place that I'm not very sure about

static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
{
u32 elem_size = sizeof(struct stack_map_bucket) + 
smap->map.value_size;
smap->elems = bpf_map_area_alloc(elem_size * 
smap->map.max_entries,
 smap->map.numa_node);
}

This is called after another bpf_map_area_alloc in stack_map_alloc(). In the 
first
bpf_map_area_alloc() the argument is calculated in an u64 variable; so if in 
the second
one, there is an integer overflow then the first one must be called with size > 
4GB. I
think the first one will probably fail (I am not sure about the actual limit of 
vmalloc()),
so the second one might not be called.


I would sanity check this as well. Looks like k*alloc()/v*alloc() call sites 
typically
use array_size() which returns SIZE_MAX on overflow, 610b15c50e86 ("overflow.h: 
Add
allocation size calculation helpers").

Thanks,
Daniel


Re: [PATCH bpf-next] samples/bpf: Add include dir for MIPS Loongson64 to fix build errors

2021-01-26 Thread Daniel Borkmann

On 1/26/21 3:05 PM, Tiezhu Yang wrote:

There exists many build errors when make M=samples/bpf on the Loongson
platform, this issue is MIPS related, x86 compiles just fine.

Here are some errors:

[...]


So we can do the similar things in samples/bpf/Makefile, just add
platform specific and generic include dir for MIPS Loongson64 to
fix the build errors.


Your patch from [0] said ...

  There exists many build warnings when make M=samples/bpf on the Loongson
  platform, this issue is MIPS related, x86 compiles just fine.

  Here are some warnings:
  [...]

  With #ifndef __SANE_USERSPACE_TYPES__  in tools/include/linux/types.h,
  the above error has gone and this ifndef change does not hurt other
  compilations.

... which ave the impression that all the issues were fixed. What else
is needed aside from this patch here? More samples/bpf fixes coming? If
yes, please all submit them as a series instead of individual ones.

 [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=190d1c921ad0862da14807e1670f54020f48e889


Signed-off-by: Tiezhu Yang 
---
  samples/bpf/Makefile | 4 
  1 file changed, 4 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 362f314..45ceca4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -185,6 +185,10 @@ endif
  
  ifeq ($(ARCH), mips)

  TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__
+ifdef CONFIG_MACH_LOONGSON64
+BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-loongson64
+BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic
+endif
  endif
  
  TPROGS_CFLAGS += -Wall -O2






Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling

2021-01-12 Thread Daniel Borkmann

On 1/12/21 8:46 PM, Andrii Nakryiko wrote:

On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti  wrote:


Add support for pointer to mem register spilling, to allow the verifier
to track pointer to valid memory addresses. Such pointers are returned
for example by a successful call of the bpf_ringbuf_reserve helper.

This patch was suggested as a solution by Yonghong Song.

The patch was partially contibuted by CyberArk Software, Inc.

Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier
support for it")


Surprised no one mentioned this yet. Fixes tag should always be on a
single unwrapped line, however long it is, please fix.


Especially for first-time contributors there is usually some luxury that we
would have fixed this up on the fly while applying. ;) But given a v2 is going
to be sent anyway it's good to point it out for future reference, agree.

Thanks,
Daniel


Re: [PATCH 2/2] selftests/bpf: add verifier test for PTR_TO_MEM spill

2021-01-12 Thread Daniel Borkmann

On 1/12/21 4:35 PM, Gilad Reti wrote:

On Tue, Jan 12, 2021 at 4:56 PM KP Singh  wrote:

On Tue, Jan 12, 2021 at 10:16 AM Gilad Reti  wrote:


Add test to check that the verifier is able to recognize spilling of
PTR_TO_MEM registers.


It would be nice to have some explanation of what the test does to
recognize the spilling of the PTR_TO_MEM registers in the commit
log as well.

Would it be possible to augment an existing test_progs
program like tools/testing/selftests/bpf/progs/test_ringbuf.c to test
this functionality?


How would you guarantee that LLVM generates the spill/fill, via inline asm?


It may be possible, but from what I understood from Daniel's comment here

https://lore.kernel.org/bpf/17629073-4fab-a922-ecc3-25b019960...@iogearbox.net/

the test should be a part of the verifier tests (which is reasonable
to me since it is
a verifier bugfix)


Yeah, the test_verifier case as you have is definitely the most straight
forward way to add coverage in this case.


Re: [PATCH] Signed-off-by: giladreti

2021-01-11 Thread Daniel Borkmann

Hello Gilad,

On 1/11/21 4:31 PM, giladreti wrote:

Added support for pointer to mem register spilling, to allow the verifier
to track pointer to valid memory addresses. Such pointers are returned
for example by a successful call of the bpf_ringbuf_reserve helper.

This patch was suggested as a solution by Yonghong Song.


The SoB should not be in subject line but as part of the commit message instead
and with proper name, e.g.

Signed-off-by: Gilad Reti 

For subject line, please use a short summary that fits the patch prefixed with
the subsystem "bpf: [...]", see also [0] as an example. Thanks.

It would be good if you could also add a BPF selftest for this [1].

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=e22d7f05e445165e58feddb4e40cc9c0f94453bc
  [1] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/
  
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/verifier/spill_fill.c


---
  kernel/bpf/verifier.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 17270b8404f1..36af69fac591 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2217,6 +2217,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
case PTR_TO_RDWR_BUF:
case PTR_TO_RDWR_BUF_OR_NULL:
case PTR_TO_PERCPU_BTF_ID:
+   case PTR_TO_MEM:
+   case PTR_TO_MEM_OR_NULL:
return true;
default:
return false;





Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-08 Thread Daniel Borkmann

On 1/7/21 3:44 PM, Willem de Bruijn wrote:

On Thu, Jan 7, 2021 at 8:33 AM Daniel Borkmann  wrote:

On 1/7/21 2:05 PM, Willem de Bruijn wrote:

On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann  wrote:

On 1/7/21 12:40 PM, Dongseok Yi wrote:

On 2021-01-07 20:05, Daniel Borkmann wrote:

On 1/7/21 1:39 AM, Dongseok Yi wrote:

[...]

PF_PACKET handles untouched fraglist. To modify the payload only
for udp_rcv_segment, skb_unclone is necessary.


I don't parse this last sentence here, please elaborate in more detail on why
it is necessary.

For example, if tc layer would modify mark on the skb, then __copy_skb_header()
in skb_segment_list() will propagate it. If tc layer would modify payload, the
skb_ensure_writable() will take care of that internally and if needed pull in
parts from fraglist into linear section to make it private. The purpose of the
skb_clone() above iff shared is to make the struct itself private (to safely
modify its struct members). What am I missing?


If tc writes, it will call skb_make_writable and thus pskb_expand_head
to create a private linear section for the head_skb.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in pf_packet, so we need a
pskb_expand_head here, too.


Ok, got it, thanks for the explanation. Would be great to have it in the v3 
commit
log as well as that was more clear than the above. Too bad in that case (otoh
the pf_packet situation could be considered corner case ...); ether way, fix 
makes
sense then.


Thanks for double checking the tricky logic. Pf_packet + BPF is indeed
a peculiar corner case. But there are perhaps more, like raw sockets,
and other BPF hooks that can trigger an skb_make_writable.

Come to think of it, the no touching shared data rule is also violated
without a BPF program? Then there is nothing in the frag_skbs
themselves signifying that they are shared, but the head_skb is
cloned, so its data may still not be modified.


The skb_ensure_writable() is used in plenty of places from bpf, ovs, netfilter
to core stack (e.g. vlan, mpls, icmp), but either way there should be nothing
wrong with that as it's making sure to pull the data into its linear section
before modification. Uncareful use of skb_store_bits() with offset into 
skb_frags
for example could defeat that, but I don't see any in-tree occurrence that would
be problematic at this point..


This modification happens to be safe in practice, as the pf_packet
clones don't access the bytes before skb->data where this path inserts
headers. But still.


Re: [PATCH net v3] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-08 Thread Daniel Borkmann

On 1/8/21 3:28 AM, Dongseok Yi wrote:

skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] [ cut here ]
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi 
Acked-by: Willem de Bruijn 


Acked-by: Daniel Borkmann 


Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-07 Thread Daniel Borkmann

On 1/7/21 2:05 PM, Willem de Bruijn wrote:

On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann  wrote:

On 1/7/21 12:40 PM, Dongseok Yi wrote:

On 2021-01-07 20:05, Daniel Borkmann wrote:

On 1/7/21 1:39 AM, Dongseok Yi wrote:

skbs in fraglist could be shared by a BPF filter loaded at TC. It
triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get on each skb in the fraglist.

While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
chain made by skb_segment_list.

If the new skb (not fraglist) is queued to one of the sk_receive_queue,
multiple ptypes can see this. The skb could be released by ptypes and
it causes use-after-free.

[ 4443.426215] [ cut here ]
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi 
Acked-by: Willem de Bruijn 
---
net/core/skbuff.c | 20 +++-
1 file changed, 19 insertions(+), 1 deletion(-)

v2: Expand the commit message to clarify a BPF filter loaded

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f62cae3..1dcbda8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
 unsigned int delta_truesize = 0;
 unsigned int delta_len = 0;
 struct sk_buff *tail = NULL;
-   struct sk_buff *nskb;
+   struct sk_buff *nskb, *tmp;
+   int err;

 skb_push(skb, -skb_network_offset(skb) + offset);

@@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
 nskb = list_skb;
 list_skb = list_skb->next;

+   err = 0;
+   if (skb_shared(nskb)) {
+   tmp = skb_clone(nskb, GFP_ATOMIC);
+   if (tmp) {
+   kfree_skb(nskb);


Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
for drops in the stack.


I will use to consume_skb() on the next version.


+   nskb = tmp;
+   err = skb_unclone(nskb, GFP_ATOMIC);


Could you elaborate why you also need to unclone? This looks odd here. tc layer
(independent of BPF) from ingress & egress side generally assumes unshared skb,
so above clone + dropping ref of nskb looks okay to make the main skb struct 
private
for mangling attributes (e.g. mark) & should suffice. What is the exact purpose 
of
the additional skb_unclone() in this context?


Willem de Bruijn said:
udp_rcv_segment later converts the udp-gro-list skb to a list of
regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
Now all the frags are fully fledged packets, with headers pushed
before the payload.


Yes.


PF_PACKET handles untouched fraglist. To modify the payload only
for udp_rcv_segment, skb_unclone is necessary.


I don't parse this last sentence here, please elaborate in more detail on why
it is necessary.

For example, if tc layer would modify mark on the skb, then __copy_skb_header()
in skb_segment_list() will propagate it. If tc layer would modify payload, the
skb_ensure_writable() will take care of that internally and if needed pull in
parts from fraglist into linear section to make it private. The purpose of the
skb_clone() above iff shared is to make the struct itself private (to safely
modify its struct members). What am I missing?


If tc writes, it will call skb_make_writable and thus pskb_expand_head
to create a private linear section for the head_skb.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in pf_packet, so we need a
pskb_expand_head here, too.


Ok, got it, thanks for the explanation. Would be great to have it in the v3 
commit
log as well as that was more clear than the above. Too bad in that case (otoh
the pf_packet situation could be considered corner case ...); ether way, fix 
makes
sense then.


Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-07 Thread Daniel Borkmann

On 1/7/21 12:40 PM, Dongseok Yi wrote:

On 2021-01-07 20:05, Daniel Borkmann wrote:

On 1/7/21 1:39 AM, Dongseok Yi wrote:

skbs in fraglist could be shared by a BPF filter loaded at TC. It
triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get on each skb in the fraglist.

While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
chain made by skb_segment_list.

If the new skb (not fraglist) is queued to one of the sk_receive_queue,
multiple ptypes can see this. The skb could be released by ptypes and
it causes use-after-free.

[ 4443.426215] [ cut here ]
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi 
Acked-by: Willem de Bruijn 
---
   net/core/skbuff.c | 20 +++-
   1 file changed, 19 insertions(+), 1 deletion(-)

v2: Expand the commit message to clarify a BPF filter loaded

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f62cae3..1dcbda8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
unsigned int delta_truesize = 0;
unsigned int delta_len = 0;
struct sk_buff *tail = NULL;
-   struct sk_buff *nskb;
+   struct sk_buff *nskb, *tmp;
+   int err;

skb_push(skb, -skb_network_offset(skb) + offset);

@@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
nskb = list_skb;
list_skb = list_skb->next;

+   err = 0;
+   if (skb_shared(nskb)) {
+   tmp = skb_clone(nskb, GFP_ATOMIC);
+   if (tmp) {
+   kfree_skb(nskb);


Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
for drops in the stack.


I will use to consume_skb() on the next version.


+   nskb = tmp;
+   err = skb_unclone(nskb, GFP_ATOMIC);


Could you elaborate why you also need to unclone? This looks odd here. tc layer
(independent of BPF) from ingress & egress side generally assumes unshared skb,
so above clone + dropping ref of nskb looks okay to make the main skb struct 
private
for mangling attributes (e.g. mark) & should suffice. What is the exact purpose 
of
the additional skb_unclone() in this context?


Willem de Bruijn said:
udp_rcv_segment later converts the udp-gro-list skb to a list of
regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
Now all the frags are fully fledged packets, with headers pushed
before the payload.


Yes.


PF_PACKET handles untouched fraglist. To modify the payload only
for udp_rcv_segment, skb_unclone is necessary.


I don't parse this last sentence here, please elaborate in more detail on why
it is necessary.

For example, if tc layer would modify mark on the skb, then __copy_skb_header()
in skb_segment_list() will propagate it. If tc layer would modify payload, the
skb_ensure_writable() will take care of that internally and if needed pull in
parts from fraglist into linear section to make it private. The purpose of the
skb_clone() above iff shared is to make the struct itself private (to safely
modify its struct members). What am I missing?


+   } else {
+   err = -ENOMEM;
+   }
+   }
+
if (!tail)
skb->next = nskb;
else
tail->next = nskb;

+   if (unlikely(err)) {
+   nskb->next = list_skb;
+   goto err_linearize;
+   }
+
tail = nskb;

delta_len += nskb->len;








Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-07 Thread Daniel Borkmann

On 1/7/21 1:39 AM, Dongseok Yi wrote:

skbs in fraglist could be shared by a BPF filter loaded at TC. It
triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get on each skb in the fraglist.

While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
chain made by skb_segment_list.

If the new skb (not fraglist) is queued to one of the sk_receive_queue,
multiple ptypes can see this. The skb could be released by ptypes and
it causes use-after-free.

[ 4443.426215] [ cut here ]
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi 
Acked-by: Willem de Bruijn 
---
  net/core/skbuff.c | 20 +++-
  1 file changed, 19 insertions(+), 1 deletion(-)

v2: Expand the commit message to clarify a BPF filter loaded

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f62cae3..1dcbda8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
unsigned int delta_truesize = 0;
unsigned int delta_len = 0;
struct sk_buff *tail = NULL;
-   struct sk_buff *nskb;
+   struct sk_buff *nskb, *tmp;
+   int err;
  
  	skb_push(skb, -skb_network_offset(skb) + offset);
  
@@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,

nskb = list_skb;
list_skb = list_skb->next;
  
+		err = 0;

+   if (skb_shared(nskb)) {
+   tmp = skb_clone(nskb, GFP_ATOMIC);
+   if (tmp) {
+   kfree_skb(nskb);


Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
for drops in the stack.


+   nskb = tmp;
+   err = skb_unclone(nskb, GFP_ATOMIC);


Could you elaborate why you also need to unclone? This looks odd here. tc layer
(independent of BPF) from ingress & egress side generally assumes unshared skb,
so above clone + dropping ref of nskb looks okay to make the main skb struct 
private
for mangling attributes (e.g. mark) & should suffice. What is the exact purpose 
of
the additional skb_unclone() in this context?


+   } else {
+   err = -ENOMEM;
+   }
+   }
+
if (!tail)
skb->next = nskb;
else
tail->next = nskb;
  
+		if (unlikely(err)) {

+   nskb->next = list_skb;
+   goto err_linearize;
+   }
+
tail = nskb;
  
  		delta_len += nskb->len;






Re: [PATCH 03/15] perf: Add build id data in mmap2 event

2020-12-15 Thread Daniel Borkmann

Hey Arnaldo,

On 12/15/20 4:52 PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Dec 14, 2020 at 11:54:45AM +0100, Jiri Olsa escreveu:

Adding support to carry build id data in mmap2 event.

The build id data replaces maj/min/ino/ino_generation
fields, which are also used to identify map's binary,
so it's ok to replace them with build id data:

   union {
   struct {
   u32   maj;
   u32   min;
   u64   ino;
   u64   ino_generation;
   };
   struct {
   u8build_id_size;
   u8__reserved_1;
   u16   __reserved_2;
   u8build_id[20];
   };
   };


Alexei/Daniel, this one depends on BPFs build id routines to be exported
for use by the perf kernel subsys, PeterZ already acked this, so can you
guys consider getting the first three patches in this series via the bpf
tree?

The BPF bits were acked by Song.


All the net-next and therefore also bpf-next bits for 5.11 were just merged
by Linus into his tree. If you need the first 3 from [0] to land for this merge
window, it's probably easiest if you take them in and send them via perf tree
directly in case you didn't send out a pull-req yet.. (alternatively I'll ping
David/Jakub if they plan to make a 2nd net-next pull-req end of this week).

Thanks,
Daniel

 [0] https://lore.kernel.org/lkml/20201214105457.543111-1-jo...@kernel.org/


Re: [PATCH bpf-next v2] libbpf: Expose libbpf ringbufer epoll_fd

2020-12-14 Thread Daniel Borkmann

On 12/14/20 12:38 PM, Brendan Jackman wrote:

This provides a convenient perf ringbuf -> libbpf ringbuf migration
path for users of external polling systems. It is analogous to
perf_buffer__epoll_fd.

Signed-off-by: Brendan Jackman 
---
Difference from v1: Added entry to libbpf.map.

  tools/lib/bpf/libbpf.h   | 1 +
  tools/lib/bpf/libbpf.map | 1 +
  tools/lib/bpf/ringbuf.c  | 6 ++
  3 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 6909ee81113a..cde07f64771e 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -536,6 +536,7 @@ LIBBPF_API int ring_buffer__add(struct ring_buffer *rb, int 
map_fd,
ring_buffer_sample_fn sample_cb, void *ctx);
  LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, int timeout_ms);
  LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb);
+LIBBPF_API int ring_buffer__epoll_fd(struct ring_buffer *rb);

  /* Perf buffer APIs */
  struct perf_buffer;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7c4126542e2b..7be850271be6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -348,4 +348,5 @@ LIBBPF_0.3.0 {
btf__new_split;
xsk_setup_xdp_prog;
xsk_socket__update_xskmap;
+ring_buffer__epoll_fd;


Fyi, this had a whitespace issue, Andrii fixed it up while applying.


  } LIBBPF_0.2.0;
diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
index 5c6522c89af1..45a36648b403 100644
--- a/tools/lib/bpf/ringbuf.c
+++ b/tools/lib/bpf/ringbuf.c
@@ -282,3 +282,9 @@ int ring_buffer__poll(struct ring_buffer *rb, int 
timeout_ms)
}
return cnt < 0 ? -errno : res;
  }
+
+/* Get an fd that can be used to sleep until data is available in the ring(s) 
*/
+int ring_buffer__epoll_fd(struct ring_buffer *rb)
+{
+   return rb->epoll_fd;
+}

base-commit: b4fe9fec51ef48011f11c2da4099f0b530449c92
--
2.29.2.576.ga3fc446d84-goog





Re: [PATCH bpf-next v4 2/4] bpf: Expose bpf_get_socket_cookie to tracing programs

2020-12-09 Thread Daniel Borkmann

On 12/9/20 2:26 PM, Florent Revest wrote:

This needs two new helpers, one that works in a sleepable context (using
sock_gen_cookie which disables/enables preemption) and one that does not
(for performance reasons). Both take a struct sock pointer and need to
check it for NULLness.

This helper could also be useful to other BPF program types such as LSM.


Looks like this commit description is now stale and needs to be updated
since we only really add one helper?


Signed-off-by: Florent Revest 
---
  include/linux/bpf.h|  1 +
  include/uapi/linux/bpf.h   |  7 +++
  kernel/trace/bpf_trace.c   |  2 ++
  net/core/filter.c  | 12 
  tools/include/uapi/linux/bpf.h |  7 +++
  5 files changed, 29 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 07cb5d15e743..5a858e8c3f1a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1860,6 +1860,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
  extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
  extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
  extern const struct bpf_func_proto bpf_sock_from_file_proto;
+extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
  
  const struct bpf_func_proto *bpf_tracing_func_proto(

enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ba59309f4d18..9ac66cf25959 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1667,6 +1667,13 @@ union bpf_attr {
   *Return
   *A 8-byte long unique number.
   *
+ * u64 bpf_get_socket_cookie(void *sk)
+ * Description
+ * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * *sk*, but gets socket from a BTF **struct sock**.


Maybe add a small comment that this one also works for sleepable [tracing] 
progs?


+ * Return
+ * A 8-byte long unique number.


... or 0 if *sk* is NULL.


   * u32 bpf_get_socket_uid(struct sk_buff *skb)
   *Return
   *The owner UID of the socket associated to *skb*. If the socket
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 52ddd217d6a1..be5e96de306d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1760,6 +1760,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return _sk_storage_delete_tracing_proto;
case BPF_FUNC_sock_from_file:
return _sock_from_file_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_ptr_cookie_proto;
  #endif
case BPF_FUNC_seq_printf:
return prog->expected_attach_type == BPF_TRACE_ITER ?
diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..13ad9a64f04f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4631,6 +4631,18 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_sock_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
  };
  
+BPF_CALL_1(bpf_get_socket_ptr_cookie, struct sock *, sk)

+{
+   return sk ? sock_gen_cookie(sk) : 0;
+}
+
+const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto = {
+   .func   = bpf_get_socket_ptr_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+};
+
  BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx)
  {
return __sock_gen_cookie(ctx->sk);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ba59309f4d18..9ac66cf25959 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1667,6 +1667,13 @@ union bpf_attr {
   *Return
   *A 8-byte long unique number.
   *
+ * u64 bpf_get_socket_cookie(void *sk)
+ * Description
+ * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * *sk*, but gets socket from a BTF **struct sock**.
+ * Return
+ * A 8-byte long unique number.
+ *
   * u32 bpf_get_socket_uid(struct sk_buff *skb)
   *Return
   *The owner UID of the socket associated to *skb*. If the socket





Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs

2020-12-09 Thread Daniel Borkmann

On 12/8/20 8:30 PM, Florent Revest wrote:

On Fri, 2020-12-04 at 20:03 +0100, Daniel Borkmann wrote:

On 12/4/20 7:56 PM, Daniel Borkmann wrote:

On 12/3/20 10:33 PM, Florent Revest wrote:

This creates a new helper proto because the existing
bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument
and only
works for BPF programs where the context is a sock.

This helper could also be useful to other BPF program types such
as LSM.

Signed-off-by: Florent Revest 
---
   include/uapi/linux/bpf.h   | 7 +++
   kernel/trace/bpf_trace.c   | 4 
   net/core/filter.c  | 7 +++
   tools/include/uapi/linux/bpf.h | 7 +++
   4 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c3458ec1f30a..3e0e33c43998 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1662,6 +1662,13 @@ union bpf_attr {
* Return
* A 8-byte long non-decreasing number.
*
+ * u64 bpf_get_socket_cookie(void *sk)
+ * Description
+ * Equivalent to **bpf_get_socket_cookie**\ () helper
that accepts
+ * *sk*, but gets socket from a BTF **struct sock**.
+ * Return
+ * A 8-byte long non-decreasing number.


I would not mention this here since it's not fully correct and we
should avoid users taking non-decreasing granted in their progs.
The only assumption you can make is that it can be considered a
unique number. See also [0] with reverse counter..

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac


Ah this is a good point, thank you! I will send a v3 with an extra
patch that s/non-decreasing/unique/ in the other descriptions. I had
not given it any extra thought, I just stupidly copied/pasted existing
descriptions. :)


One more thought, in case you plan to use this from sleepable
context, you would need to use sock_gen_cookie() variant in the BPF
helper instead.


Out of curiosity, why don't we just always call sock_gen_cookie? Is it
to avoid the performance impact of increasing the preempt counter and
introducing a memory barriers ?


Yes, all the other contexts where the existing helpers are used already have
preemption disabled, so the extra preempt_{disable,enable}() is unnecessary
overhead given we want the cookie generation be efficient.

Thanks,
Daniel


Re: [PATCH bpf-next v9 00/34] bpf: switch to memcg-based memory accounting

2020-12-04 Thread Daniel Borkmann

On 12/3/20 4:26 AM, Roman Gushchin wrote:

On Wed, Dec 02, 2020 at 06:54:46PM -0800, Alexei Starovoitov wrote:

On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin  wrote:


5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.

...

v9:
   - always charge the saved memory cgroup, by Daniel, Toke and Alexei
   - added bpf_map_kzalloc()
   - rebase and minor fixes


This looks great. Applied.


Thanks!


Please follow up with a change to libbpf's pr_perm_msg().
That helpful warning should stay for old kernels, but it would be
misleading for new kernels.
libbpf probably needs a feature check to make this warning conditional.


I think we've discussed it several months ago and at that time we didn't
find a good way to check this feature. I'll think again, but if somebody
has any ideas here, I'll appreciate a lot.


Hm, bit tricky, agree .. given we only throw the warning in pr_perm_msg() for
non-root and thus probing options are also limited, otherwise just probing for
a helper that was added in this same cycle would have been good enough as a
simple heuristic. I wonder if it would make sense to add some hint inside the
bpf_{prog,map}_show_fdinfo() to indicate that accounting with memcg is enabled
for the prog/map one way or another? Not just for the sake of pr_perm_msg(), but
in general for apps to stop messing with rlimit at this point. Maybe also 
bpftool
feature probe could be extended to indicate that as well (e.g. the json output
can be fed into Go natively).


Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs

2020-12-04 Thread Daniel Borkmann

On 12/4/20 7:56 PM, Daniel Borkmann wrote:

On 12/3/20 10:33 PM, Florent Revest wrote:

This creates a new helper proto because the existing
bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument and only
works for BPF programs where the context is a sock.

This helper could also be useful to other BPF program types such as LSM.

Signed-off-by: Florent Revest 
---
  include/uapi/linux/bpf.h   | 7 +++
  kernel/trace/bpf_trace.c   | 4 
  net/core/filter.c  | 7 +++
  tools/include/uapi/linux/bpf.h | 7 +++
  4 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c3458ec1f30a..3e0e33c43998 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1662,6 +1662,13 @@ union bpf_attr {
   * Return
   * A 8-byte long non-decreasing number.
   *
+ * u64 bpf_get_socket_cookie(void *sk)
+ * Description
+ * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * *sk*, but gets socket from a BTF **struct sock**.
+ * Return
+ * A 8-byte long non-decreasing number.


I would not mention this here since it's not fully correct and we should avoid 
users
taking non-decreasing granted in their progs. The only assumption you can make 
is
that it can be considered a unique number. See also [0] with reverse counter..

   [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac


One more thought, in case you plan to use this from sleepable context, you would
need to use sock_gen_cookie() variant in the BPF helper instead.


Re: [PATCH bpf-next v2 1/3] bpf: Expose bpf_get_socket_cookie to tracing programs

2020-12-04 Thread Daniel Borkmann

On 12/3/20 10:33 PM, Florent Revest wrote:

This creates a new helper proto because the existing
bpf_get_socket_cookie_sock_proto has a ARG_PTR_TO_CTX argument and only
works for BPF programs where the context is a sock.

This helper could also be useful to other BPF program types such as LSM.

Signed-off-by: Florent Revest 
---
  include/uapi/linux/bpf.h   | 7 +++
  kernel/trace/bpf_trace.c   | 4 
  net/core/filter.c  | 7 +++
  tools/include/uapi/linux/bpf.h | 7 +++
  4 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c3458ec1f30a..3e0e33c43998 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1662,6 +1662,13 @@ union bpf_attr {
   *Return
   *A 8-byte long non-decreasing number.
   *
+ * u64 bpf_get_socket_cookie(void *sk)
+ * Description
+ * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * *sk*, but gets socket from a BTF **struct sock**.
+ * Return
+ * A 8-byte long non-decreasing number.


I would not mention this here since it's not fully correct and we should avoid 
users
taking non-decreasing granted in their progs. The only assumption you can make 
is
that it can be considered a unique number. See also [0] with reverse counter..

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac


+ *
   * u32 bpf_get_socket_uid(struct sk_buff *skb)
   *Return
   *The owner UID of the socket associated to *skb*. If the socket
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d255bc9b2bfa..14ad96579813 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1725,6 +1725,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
}
  }
  
+extern const struct bpf_func_proto bpf_get_socket_cookie_sock_tracing_proto;

+
  const struct bpf_func_proto *
  tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
@@ -1748,6 +1750,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return _sk_storage_get_tracing_proto;
case BPF_FUNC_sk_storage_delete:
return _sk_storage_delete_tracing_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_sock_tracing_proto;
  #endif
case BPF_FUNC_seq_printf:
return prog->expected_attach_type == BPF_TRACE_ITER ?
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..177c4e5e529d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4631,6 +4631,13 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_sock_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
  };
  
+const struct bpf_func_proto bpf_get_socket_cookie_sock_tracing_proto = {

+   .func   = bpf_get_socket_cookie_sock,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+};
+


Re: linux-next: build failure after merge of the bpf-next tree

2020-12-01 Thread Daniel Borkmann

On 12/1/20 9:07 AM, Stephen Rothwell wrote:

Hi all,

After merging the bpf-next tree, today's linux-next build (x86_64
allnoconfig) failed like this:

In file included from fs/select.c:32:
include/net/busy_poll.h: In function 'sk_mark_napi_id_once':
include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member 
named 'napi_id'
   150 |  __sk_mark_napi_id_once_xdp(sk, skb->napi_id);
   |^~

Caused by commit

   b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path")



Fixed it up in bpf-next, thanks for reporting!


Re: [PATCH] bpf: remove trailing semicolon in macro definition

2020-11-27 Thread Daniel Borkmann

On 11/27/20 8:27 PM, t...@redhat.com wrote:

From: Tom Rix 

The macro use will already have a semicolon.

Signed-off-by: Tom Rix 
---
  include/trace/events/xdp.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index cd24e8a59529..65ffedf8386f 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -146,13 +146,13 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect_err,
  );
  
  #define _trace_xdp_redirect(dev, xdp, to)		\

-trace_xdp_redirect(dev, xdp, NULL, 0, NULL, to);
+trace_xdp_redirect(dev, xdp, NULL, 0, NULL, to)
  
  #define _trace_xdp_redirect_err(dev, xdp, to, err)	\

 trace_xdp_redirect_err(dev, xdp, NULL, err, NULL, to);
  
  #define _trace_xdp_redirect_map(dev, xdp, to, map, index)		\

-trace_xdp_redirect(dev, xdp, to, 0, map, index);
+trace_xdp_redirect(dev, xdp, to, 0, map, index)
  
  #define _trace_xdp_redirect_map_err(dev, xdp, to, map, index, err)	\

 trace_xdp_redirect_err(dev, xdp, to, err, map, index);



This looks random, why those but not other locations ?

Thanks,
Daniel


Re: [PATCH bpf v2 2/2] xsk: change the tx writeable condition

2020-11-27 Thread Daniel Borkmann

On 11/25/20 7:48 AM, Xuan Zhuo wrote:

Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Signed-off-by: Xuan Zhuo 


This one doesn't apply cleanly against bpf tree, please rebase. Small comment
inline while looking over the patch:


---
  net/xdp/xsk.c   | 16 +---
  net/xdp/xsk_queue.h |  6 ++
  2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0df8651..22e35e9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp, u32 len,
return 0;
  }
  
+static bool xsk_tx_writeable(struct xdp_sock *xs)

+{
+   if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
+   return false;
+
+   return true;
+}
+
  static bool xsk_is_bound(struct xdp_sock *xs)
  {
if (READ_ONCE(xs->state) == XSK_BOUND) {
@@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
rcu_read_lock();
list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
__xskq_cons_release(xs->tx);
-   xs->sk.sk_write_space(>sk);
+   if (xsk_tx_writeable(xs))
+   xs->sk.sk_write_space(>sk);
}
rcu_read_unlock();
  }
@@ -499,7 +508,8 @@ static int xsk_generic_xmit(struct sock *sk)
  
  out:

if (sent_frame)
-   sk->sk_write_space(sk);
+   if (xsk_tx_writeable(xs))
+   sk->sk_write_space(sk);
  
  	mutex_unlock(>mutex);

return err;
@@ -556,7 +566,7 @@ static __poll_t xsk_poll(struct file *file, struct socket 
*sock,
  
  	if (xs->rx && !xskq_prod_is_empty(xs->rx))

mask |= EPOLLIN | EPOLLRDNORM;
-   if (xs->tx && !xskq_cons_is_full(xs->tx))
+   if (xs->tx && xsk_tx_writeable(xs))
mask |= EPOLLOUT | EPOLLWRNORM;
  
  	return mask;

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index b936c46..b655004 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -307,6 +307,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
q->nentries;
  }
  
+static inline __u64 xskq_cons_present_entries(struct xsk_queue *q)


Types prefixed with __ are mainly for user-space facing things like uapi 
headers,
so in-kernel should be u64. Is there a reason this is not done as u32 (and thus
same as producer and producer)?


+{
+   /* No barriers needed since data is not accessed */
+   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
+}
+
  /* Functions for producers */
  
  static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)






Re: [PATCH bpf-next v8 06/34] bpf: prepare for memcg-based memory accounting for bpf maps

2020-11-25 Thread Daniel Borkmann

On 11/25/20 4:00 AM, Roman Gushchin wrote:

In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.

Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To do it, a pointer to the memory cgroup of the process, which created
the map, is saved, and this cgroup can be charged for all allocations
made from an interrupt context. This commit introduces 2 helpers:
bpf_map_kmalloc_node() and bpf_map_alloc_percpu(). They can be used in
the bpf code for accounted memory allocations, both in the process and
interrupt contexts. In the interrupt context they're using the saved
memory cgroup, otherwise the current cgroup is getting charged.

Signed-off-by: Roman Gushchin 


Thanks for updating the cover letter; replying in this series instead
on one more item that came to mind:

[...]

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3fe9f53f93c..4154c616788c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  
  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \

  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -456,6 +458,77 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
__release(_idr_lock);
  }
  
+#ifdef CONFIG_MEMCG_KMEM

+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+   map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+   mem_cgroup_put(map->memcg);
+}
+
+void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
+  int node)
+{
+   struct mem_cgroup *old_memcg;
+   bool in_interrupt;
+   void *ptr;
+
+   /*
+* If the memory allocation is performed from an interrupt context,
+* the memory cgroup to charge can't be determined from the context
+* of the current task. Instead, we charge the memory cgroup, which
+* contained the process created the map.
+*/
+   in_interrupt = in_interrupt();
+   if (in_interrupt)
+   old_memcg = set_active_memcg(map->memcg);
+
+   ptr = kmalloc_node(size, flags, node);
+
+   if (in_interrupt)
+   set_active_memcg(old_memcg);
+
+   return ptr;
+}
+
+void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
+   size_t align, gfp_t gfp)
+{
+   struct mem_cgroup *old_memcg;
+   bool in_interrupt;
+   void *ptr;
+
+   /*
+* If the memory allocation is performed from an interrupt context,
+* the memory cgroup to charge can't be determined from the context
+* of the current task. Instead, we charge the memory cgroup, which
+* contained the process created the map.
+*/
+   in_interrupt = in_interrupt();
+   if (in_interrupt)
+   old_memcg = set_active_memcg(map->memcg);
+
+   ptr = __alloc_percpu_gfp(size, align, gfp);
+
+   if (in_interrupt)
+   set_active_memcg(old_memcg);


For this and above bpf_map_kmalloc_node() one, wouldn't it make more sense to
perform the temporary memcg unconditionally?

old_memcg = set_active_memcg(map->memcg);
ptr = kmalloc_node(size, flags, node);
set_active_memcg(old_memcg);

I think the semantics are otherwise a bit weird and the charging unpredictable;
this way it would /always/ be accounted against the prog in the memcg that
originally created the map.

E.g. maps could be shared between progs attached to, say, XDP/tc where 
in_interrupt()
holds true with progs attached to skb-cgroup/egress where we're still in process
context. So some part of the memory is charged against the original map's memcg 
and
some other part against the current process' memcg which seems odd, no? Or, for 
example,
if we start to run a tracing BPF prog which updates state in a BPF map ... that 
tracing
prog now interferes with processes in other memcgs which may not be intentional 
& could
lead to potential failures there as opposed when the tracing prog is not run. 
My concern
is that the semantics are not quite clear and behavior unpredictable compared 
to always
charging against map->memcg.

Similarly, what if an orchestration prog creates dedicated memcg(s) for maps 
with
individual limits ... the assumed behavior (imho) would be that whatever memory 
is
accounted on the map it can be accurately retrieved from there & similarly 
limits
enforced, no? It seems that would not be the case currently.

Thoughts?


+   

Re: [PATCH bpf-next v3 1/3] ima: Implement ima_inode_hash

2020-11-25 Thread Daniel Borkmann

On 11/25/20 1:04 PM, KP Singh wrote:

On Tue, Nov 24, 2020 at 6:35 PM Yonghong Song  wrote:

On 11/24/20 7:12 AM, KP Singh wrote:

From: KP Singh 

This is in preparation to add a helper for BPF LSM programs to use
IMA hashes when attached to LSM hooks. There are LSM hooks like
inode_unlink which do not have a struct file * argument and cannot
use the existing ima_file_hash API.

An inode based API is, therefore, useful in LSM based detections like an
executable trying to delete itself which rely on the inode_unlink LSM
hook.

Moreover, the ima_file_hash function does nothing with the struct file
pointer apart from calling file_inode on it and converting it to an
inode.

Signed-off-by: KP Singh 


There is no change for this patch compared to previous version,
so you can carry my Ack.

Acked-by: Yonghong Song 


I am guessing:

*  We need an Ack from Mimi/James.


Yes.


* As regards to which tree, I guess bpf-next would be better since the
BPF helper and the selftest depends on it


Yep, bpf-next is my preference as otherwise we're running into unnecessary
merge conflicts.

Thanks,
Daniel


Re: [PATCH bpf-next v7 00/34] bpf: switch to memcg-based memory accounting

2020-11-23 Thread Daniel Borkmann

On 11/19/20 6:37 PM, Roman Gushchin wrote:

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads. Of course, it
requires enabled cgroups and kernel memory accounting and properly 
configured
cgroup tree, but it's a default configuration for a modern Linux system.

2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

In general, if a process performs a bpf operation (e.g. creates or updates
a map), it's memory cgroup is charged. However map updates performed from
an interrupt context are charged to the memory cgroup which contained
the process, which created the map.

Providing a 1:1 replacement for the rlimit-based memory accounting is
a non-goal of this patchset. Users and memory cgroups are completely
orthogonal, so it's not possible even in theory.
Memcg-based memory accounting requires a properly configured cgroup tree
to be actually useful. However, it's the way how the memory is managed
on a modern Linux system.


The cover letter here only describes the advantages of this series, but leaves
out discussion of the disadvantages. They definitely must be part of the series
to provide a clear description of the semantic changes to readers. Last time we
discussed them, they were i) no mem limits in general on unprivileged users when
memory cgroups was not configured in the kernel, and ii) no mem limits by 
default
if not configured in the cgroup specifically. Did we made any progress on these
in the meantime? How do we want to address them? What is the concrete 
justification
to not address them?

Also I wonder what are the risk of regressions here, for example, if an existing
orchestrator has configured memory cgroup limits that are tailored to the 
application's
needs.. now, with kernel upgrade BPF will start to interfere, e.g. if a BPF 
program
attached to cgroups (e.g. connect/sendmsg/recvmsg or general cgroup skb egress 
hook)
starts charging to the process' memcg due to map updates?

  [0] 
https://lore.kernel.org/bpf/20200803190639.gd1020...@carbon.dhcp.thefacebook.com/


The patchset consists of the following parts:
1) 4 mm patches, which are already in the mm tree, but are required
to avoid a regression (otherwise vmallocs cannot be mapped to userspace).
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples

First 4 patches are not supposed to be merged via the bpf tree. I'm including
them to make sure bpf tests will pass.

v7:
   - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
   - switched allocations made from an interrupt context to new helpers,
 by Daniel
   - rebase and minor fixes


Re: [PATCH] bpf: Check the return value of dev_get_by_index_rcu()

2020-11-20 Thread Daniel Borkmann

On 11/20/20 4:19 PM, David Ahern wrote:

On 11/20/20 8:13 AM, Daniel Borkmann wrote:

[ +David ]

On 11/19/20 8:04 AM, xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

The return value of dev_get_by_index_rcu() can be NULL, so here it
is need to check the return value and return error code if it is NULL.

Signed-off-by: Kaixu Xia 
---
   net/core/filter.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..1263fe07170a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5573,6 +5573,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *,
skb,
   struct net_device *dev;
     dev = dev_get_by_index_rcu(net, params->ifindex);
+    if (unlikely(!dev))
+    return -EINVAL;
   if (!is_skb_forwardable(dev, skb))
   rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;


rcu lock is held right? It is impossible for dev to return NULL here.


Yes, we're under RCU read side. Was wondering whether we could unlink it
from the list but not yet free it, but also that seems not possible since
we'd first need to close it which already has a synchronize_net(). So not
an issue what Kaixu describes in the commit msg, agree.


Re: [PATCH] bpf: Check the return value of dev_get_by_index_rcu()

2020-11-20 Thread Daniel Borkmann

[ +David ]

On 11/19/20 8:04 AM, xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

The return value of dev_get_by_index_rcu() can be NULL, so here it
is need to check the return value and return error code if it is NULL.

Signed-off-by: Kaixu Xia 
---
  net/core/filter.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..1263fe07170a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5573,6 +5573,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
struct net_device *dev;
  
  		dev = dev_get_by_index_rcu(net, params->ifindex);

+   if (unlikely(!dev))
+   return -EINVAL;
if (!is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;


The above logic is quite ugly anyway given we fetched the dev pointer already 
earlier
in bpf_ipv{4,6}_fib_lookup() and now need to redo it again ... so yeah there 
could be
a tiny race in here. We wanted do bring this logic closer to what XDP does 
anyway,
something like below, for example. David, thoughts? Thx

Subject: [PATCH] diff mtu check

Signed-off-by: Daniel Borkmann 
---
 net/core/filter.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..3bab0a97fa38 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5547,9 +5547,6 @@ static const struct bpf_func_proto 
bpf_xdp_fib_lookup_proto = {
 BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
   struct bpf_fib_lookup *, params, int, plen, u32, flags)
 {
-   struct net *net = dev_net(skb->dev);
-   int rc = -EAFNOSUPPORT;
-
if (plen < sizeof(*params))
return -EINVAL;

@@ -5559,25 +5556,16 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
case AF_INET:
-   rc = bpf_ipv4_fib_lookup(net, params, flags, false);
-   break;
+   return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags,
+  !skb_is_gso(skb));
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
-   rc = bpf_ipv6_fib_lookup(net, params, flags, false);
-   break;
+   return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags,
+  !skb_is_gso(skb));
 #endif
}
-
-   if (!rc) {
-   struct net_device *dev;
-
-   dev = dev_get_by_index_rcu(net, params->ifindex);
-   if (!is_skb_forwardable(dev, skb))
-   rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
-   }
-
-   return rc;
+   return -EAFNOSUPPORT;
 }

 static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
--
2.21.0



Re: [PATCH bpf-next v6 06/34] bpf: prepare for memcg-based memory accounting for bpf maps

2020-11-18 Thread Daniel Borkmann

On 11/18/20 2:28 AM, Roman Gushchin wrote:

On Tue, Nov 17, 2020 at 05:11:00PM -0800, Alexei Starovoitov wrote:

On Tue, Nov 17, 2020 at 5:07 PM Roman Gushchin  wrote:

On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:

On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:

On 11/17/20 4:40 AM, Roman Gushchin wrote:

In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.

Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To do it, a pointer to the memory cgroup of the process which created
the map is saved, and this cgroup is getting charged for all
allocations made from an interrupt context.

Allocations made from a process context will be accounted in a usual way.

Signed-off-by: Roman Gushchin 
Acked-by: Song Liu 

[...]

+#ifdef CONFIG_MEMCG_KMEM
+static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void 
*key,
+  void *value, u64 flags)
+{
+ struct mem_cgroup *old_memcg;
+ bool in_interrupt;
+ int ret;
+
+ /*
+  * If update from an interrupt context results in a memory allocation,
+  * the memory cgroup to charge can't be determined from the context
+  * of the current task. Instead, we charge the memory cgroup, which
+  * contained a process created the map.
+  */
+ in_interrupt = in_interrupt();
+ if (in_interrupt)
+ old_memcg = set_active_memcg(map->memcg);
+
+ ret = map->ops->map_update_elem(map, key, value, flags);
+
+ if (in_interrupt)
+ set_active_memcg(old_memcg);
+
+ return ret;


Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
retpoline for lookup/update/delete calls on maps") which removes the indirect
call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
not invoked for the vast majority of cases.


I see. Well, the first option is to move these calls into map-specific update
functions, but the list is relatively long:
   nsim_map_update_elem()
   cgroup_storage_update_elem()
   htab_map_update_elem()
   htab_percpu_map_update_elem()
   dev_map_update_elem()
   dev_map_hash_update_elem()
   trie_update_elem()
   cpu_map_update_elem()
   bpf_pid_task_storage_update_elem()
   bpf_fd_inode_storage_update_elem()
   bpf_fd_sk_storage_update_elem()
   sock_map_update_elem()
   xsk_map_update_elem()

Alternatively, we can set the active memcg for the whole duration of bpf
execution. It's simpler, but will add some overhead. Maybe we can somehow
mark programs calling into update helpers and skip all others?


Actually, this is problematic if a program updates several maps, because
in theory they can belong to different cgroups.
So it seems that the first option is the way to go. Do you agree?


May be instead of kmalloc_node() that is used by most of the map updates
introduce bpf_map_kmalloc_node() that takes a map pointer as an argument?
And do set_memcg inside?


I suspect it's not only kmalloc_node(), but if there will be 2-3 allocation
helpers, it sounds like a good idea to me! I'll try and get back with v7 soon.


Could this be baked into kmalloc*() API itself given we also need to pass in
__GFP_ACCOUNT everywhere, so we'd have a new API with additional argument where
we pass the memcg pointer to tell it directly where to account it for instead of
having to have the extra set_active_memcg() set/restore dance via BPF wrapper?
It seems there would be not much specifics on BPF itself and if it's more 
generic
it could also be used by other subsystems.

Thanks,
Daniel


Re: [PATCH bpf-next v6 06/34] bpf: prepare for memcg-based memory accounting for bpf maps

2020-11-17 Thread Daniel Borkmann

On 11/17/20 4:40 AM, Roman Gushchin wrote:

In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.

Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To do it, a pointer to the memory cgroup of the process which created
the map is saved, and this cgroup is getting charged for all
allocations made from an interrupt context.

Allocations made from a process context will be accounted in a usual way.

Signed-off-by: Roman Gushchin 
Acked-by: Song Liu 

[...]
  
+#ifdef CONFIG_MEMCG_KMEM

+static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void 
*key,
+void *value, u64 flags)
+{
+   struct mem_cgroup *old_memcg;
+   bool in_interrupt;
+   int ret;
+
+   /*
+* If update from an interrupt context results in a memory allocation,
+* the memory cgroup to charge can't be determined from the context
+* of the current task. Instead, we charge the memory cgroup, which
+* contained a process created the map.
+*/
+   in_interrupt = in_interrupt();
+   if (in_interrupt)
+   old_memcg = set_active_memcg(map->memcg);
+
+   ret = map->ops->map_update_elem(map, key, value, flags);
+
+   if (in_interrupt)
+   set_active_memcg(old_memcg);
+
+   return ret;


Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
retpoline for lookup/update/delete calls on maps") which removes the indirect
call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
not invoked for the vast majority of cases.


+}
+#else
+static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void 
*key,
+void *value, u64 flags)
+{
+   return map->ops->map_update_elem(map, key, value, flags);
+}
+#endif
+
  BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
   void *, value, u64, flags)
  {
WARN_ON_ONCE(!rcu_read_lock_held());
-   return map->ops->map_update_elem(map, key, value, flags);
+
+   return __bpf_map_update_elem(map, key, value, flags);
  }
  
  const struct bpf_func_proto bpf_map_update_elem_proto = {

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3fe9f53f93c..2d77fc2496da 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \

  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -456,6 +457,27 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
__release(_idr_lock);
  }
  
+#ifdef CONFIG_MEMCG_KMEM

+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+   map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+   mem_cgroup_put(map->memcg);
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+#endif
+
  /* called from workqueue */
  static void bpf_map_free_deferred(struct work_struct *work)
  {
@@ -464,6 +486,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
  
  	bpf_map_charge_move(, >memory);

security_bpf_map_free(map);
+   bpf_map_release_memcg(map);
/* implementation dependent freeing */
map->ops->map_free(map);
bpf_map_charge_finish();
@@ -875,6 +898,8 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_sec;
  
+	bpf_map_save_memcg(map);

+
err = bpf_map_new_fd(map, f_flags);
if (err < 0) {
/* failed to allocate fd.





Re: [PATCH bpf-next v3 2/2] bpf: Add tests for bpf_lsm_set_bprm_opts

2020-11-17 Thread Daniel Borkmann

On 11/17/20 3:13 AM, KP Singh wrote:
[...]

+
+static int run_set_secureexec(int map_fd, int secureexec)
+{
+


^ same here


+   int child_pid, child_status, ret, null_fd;
+
+   child_pid = fork();
+   if (child_pid == 0) {
+   null_fd = open("/dev/null", O_WRONLY);
+   if (null_fd == -1)
+   exit(errno);
+   dup2(null_fd, STDOUT_FILENO);
+   dup2(null_fd, STDERR_FILENO);
+   close(null_fd);
+
+   /* Ensure that all executions from hereon are
+* secure by setting a local storage which is read by
+* the bprm_creds_for_exec hook and sets bprm->secureexec.
+*/
+   ret = update_storage(map_fd, secureexec);
+   if (ret)
+   exit(ret);
+
+   /* If the binary is executed with securexec=1, the dynamic
+* loader ingores and unsets certain variables like LD_PRELOAD,
+* TMPDIR etc. TMPDIR is used here to simplify the example, as
+* LD_PRELOAD requires a real .so file.
+*
+* If the value of TMPDIR is set, the bash command returns 10
+* and if the value is unset, it returns 20.
+*/
+   execle("/bin/bash", "bash", "-c",
+  "[[ -z \"${TMPDIR}\" ]] || exit 10 && exit 20", NULL,
+  bash_envp);
+   exit(errno);
+   } else if (child_pid > 0) {
+   waitpid(child_pid, _status, 0);
+   ret = WEXITSTATUS(child_status);
+
+   /* If a secureexec occured, the exit status should be 20.
+*/
+   if (secureexec && ret == 20)
+   return 0;
+
+   /* If normal execution happened the exit code should be 10.
+*/
+   if (!secureexec && ret == 10)
+   return 0;
+


and here (rest looks good to me)


+   }
+
+   return -EINVAL;
+}
+


Re: [PATCH bpf-next v3 1/2] bpf: Add bpf_lsm_set_bprm_opts helper

2020-11-17 Thread Daniel Borkmann

On 11/17/20 3:13 AM, KP Singh wrote:

From: KP Singh 

The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_LSM_F_BPRM_SECUREEXEC flag.

secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc.  When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).

Signed-off-by: KP Singh 
---
  include/uapi/linux/bpf.h   | 18 ++
  kernel/bpf/bpf_lsm.c   | 27 +++
  scripts/bpf_helpers_doc.py |  2 ++
  tools/include/uapi/linux/bpf.h | 18 ++
  4 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 162999b12790..bfa79054d106 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3787,6 +3787,18 @@ union bpf_attr {
   **ARG_PTR_TO_BTF_ID* of type *task_struct*.
   *Return
   *Pointer to the current task.
+ *
+ * long bpf_lsm_set_bprm_opts(struct linux_binprm *bprm, u64 flags)
+ *


small nit: should have no extra newline (same for the tools/ copy)


+ * Description
+ * Set or clear certain options on *bprm*:
+ *
+ * **BPF_LSM_F_BPRM_SECUREEXEC** Set the secureexec bit
+ * which sets the **AT_SECURE** auxv for glibc. The bit
+ * is cleared if the flag is not specified.
+ * Return
+ * **-EINVAL** if invalid *flags* are passed.
+ *
   */
  #define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3948,6 +3960,7 @@ union bpf_attr {
FN(task_storage_get),   \
FN(task_storage_delete),\
FN(get_current_task_btf),   \
+   FN(lsm_set_bprm_opts),  \
/* */
  
  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

@@ -4119,6 +4132,11 @@ enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_IP,
  };
  
+/* Flags for LSM helpers */

+enum {
+   BPF_LSM_F_BPRM_SECUREEXEC   = (1ULL << 0),
+};
+
  #define __bpf_md_ptr(type, name)  \
  union {   \
type name;  \
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 553107f4706a..cd85482228a0 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -7,6 +7,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -51,6 +52,30 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
return 0;
  }
  
+/* Mask for all the currently supported BPRM option flags */

+#define BPF_LSM_F_BRPM_OPTS_MASK   BPF_LSM_F_BPRM_SECUREEXEC
+
+BPF_CALL_2(bpf_lsm_set_bprm_opts, struct linux_binprm *, bprm, u64, flags)
+{
+


ditto

Would have fixed up these things on the fly while applying, but one small item
I wanted to bring up here given uapi which will then freeze: it would be cleaner
to call the helper just bpf_bprm_opts_set() or so given it's implied that we
attach to lsm here and we don't use _lsm in the naming for the others either.
Similarly, I'd drop the _LSM from the flag/mask.


+   if (flags & ~BPF_LSM_F_BRPM_OPTS_MASK)
+   return -EINVAL;
+
+   bprm->secureexec = (flags & BPF_LSM_F_BPRM_SECUREEXEC);
+   return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_lsm_set_bprm_opts_btf_ids, struct, linux_binprm)
+
+const static struct bpf_func_proto bpf_lsm_set_bprm_opts_proto = {
+   .func   = bpf_lsm_set_bprm_opts,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_BTF_ID,
+   .arg1_btf_id= _lsm_set_bprm_opts_btf_ids[0],
+   .arg2_type  = ARG_ANYTHING,
+};
+
  static const struct bpf_func_proto *
  bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
@@ -71,6 +96,8 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _task_storage_get_proto;
case BPF_FUNC_task_storage_delete:
return _task_storage_delete_proto;
+   case BPF_FUNC_lsm_set_bprm_opts:
+   return _lsm_set_bprm_opts_proto;
default:
return tracing_prog_func_proto(func_id, prog);
}


Re: [PATCH bpf-next 1/2] bpf: Add bpf_lsm_set_bprm_opts helper

2020-11-16 Thread Daniel Borkmann

On 11/16/20 3:01 PM, KP Singh wrote:

From: KP Singh 

The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_LSM_F_BPRM_SECUREEXEC flag.

secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc.  When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).

Signed-off-by: KP Singh 

[...]

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4119,6 +4128,11 @@ enum bpf_lwt_encap_mode {
BPF_LWT_ENCAP_IP,
  };
  
+/* Flags for LSM helpers */

+enum {
+   BPF_LSM_F_BPRM_SECUREEXEC   = (1ULL << 0),
+};
+
  #define __bpf_md_ptr(type, name)  \
  union {   \
type name;  \
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 553107f4706a..4d04fc490a14 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -7,6 +7,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -51,6 +52,23 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
return 0;
  }
  
+BPF_CALL_2(bpf_lsm_set_bprm_opts, struct linux_binprm *, bprm, u64, flags)

+{


This should also reject invalid flags. I'd rather change this helper from 
RET_VOID
to RET_INTEGER and throw -EINVAL for everything other than 
BPF_LSM_F_BPRM_SECUREEXEC
passed in here including zero so it can be extended in future.


+   bprm->secureexec = (flags & BPF_LSM_F_BPRM_SECUREEXEC);
+   return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_lsm_set_bprm_opts_btf_ids, struct, linux_binprm)
+
+const static struct bpf_func_proto bpf_lsm_set_bprm_opts_proto = {
+   .func   = bpf_lsm_set_bprm_opts,
+   .gpl_only   = false,
+   .ret_type   = RET_VOID,
+   .arg1_type  = ARG_PTR_TO_BTF_ID,
+   .arg1_btf_id= _lsm_set_bprm_opts_btf_ids[0],
+   .arg2_type  = ARG_ANYTHING,
+};
+


Re: [PATCH bpf-next 2/2] bpf: Expose bpf_d_path helper to sleepable LSM hooks

2020-11-13 Thread Daniel Borkmann

On 11/13/20 4:18 AM, Yonghong Song wrote:



On 11/12/20 9:19 AM, KP Singh wrote:

From: KP Singh 

Sleepable hooks are never called from an NMI/interrupt context, so it is
safe to use the bpf_d_path helper in LSM programs attaching to these
hooks.

The helper is not restricted to sleepable programs and merely uses the
list of sleeable hooks as the initial subset of LSM hooks where it can


sleeable => sleepable

probably not need to resend if no other major changes. The maintainer
can just fix it up before merging.


Did while rebasing & applying, thanks everyone!


be used.

Signed-off-by: KP Singh 


Acked-by: Yonghong Song 




Re: [PATCH v4 bpf] tools: bpftool: Add missing close before bpftool net attach exit

2020-11-13 Thread Daniel Borkmann

On 11/13/20 12:51 PM, Wang Hai wrote:

progfd is created by prog_parse_fd(), before 'bpftool net attach' exit,
it should be closed.

Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP on 
interface")
Signed-off-by: Wang Hai 


Applied & improved commit msg a bit, thanks!


Re: [PATCH bpf-next v2 1/2] bpf: Augment the set of sleepable LSM hooks

2020-11-12 Thread Daniel Borkmann

On 11/12/20 9:03 PM, KP Singh wrote:

From: KP Singh 

Update the set of sleepable hooks with the ones that do not trigger
a warning with might_fault() when exercised with the correct kernel
config options enabled, i.e.

DEBUG_ATOMIC_SLEEP=y
LOCKDEP=y
PROVE_LOCKING=y

This means that a sleepable LSM eBPF program can be attached to these
LSM hooks. A new helper method bpf_lsm_is_sleepable_hook is added and
the set is maintained locally in bpf_lsm.c

A comment is added about the list of LSM hooks that have been observed
to be called from softirqs, atomic contexts, or the ones that can
trigger pagefaults and thus should not be added to this list.


[...]
  
+/* The set of hooks which are called without pagefaults disabled and are allowed

+ * to "sleep" and thus can be used for sleeable BPF programs.
+ *
+ * There are some hooks which have been observed to be called from a
+ * non-sleepable context and should not be added to this set:
+ *
+ *  bpf_lsm_bpf_prog_free_security
+ *  bpf_lsm_capable
+ *  bpf_lsm_cred_free
+ *  bpf_lsm_d_instantiate
+ *  bpf_lsm_file_alloc_security
+ *  bpf_lsm_file_mprotect
+ *  bpf_lsm_file_send_sigiotask
+ *  bpf_lsm_inet_conn_request
+ *  bpf_lsm_inet_csk_clone
+ *  bpf_lsm_inode_alloc_security
+ *  bpf_lsm_inode_follow_link
+ *  bpf_lsm_inode_permission
+ *  bpf_lsm_key_permission
+ *  bpf_lsm_locked_down
+ *  bpf_lsm_mmap_addr
+ *  bpf_lsm_perf_event_read
+ *  bpf_lsm_ptrace_access_check
+ *  bpf_lsm_req_classify_flow
+ *  bpf_lsm_sb_free_security
+ *  bpf_lsm_sk_alloc_security
+ *  bpf_lsm_sk_clone_security
+ *  bpf_lsm_sk_free_security
+ *  bpf_lsm_sk_getsecid
+ *  bpf_lsm_socket_sock_rcv_skb
+ *  bpf_lsm_sock_graft
+ *  bpf_lsm_task_free
+ *  bpf_lsm_task_getioprio
+ *  bpf_lsm_task_getscheduler
+ *  bpf_lsm_task_kill
+ *  bpf_lsm_task_setioprio
+ *  bpf_lsm_task_setnice
+ *  bpf_lsm_task_setpgid
+ *  bpf_lsm_task_setrlimit
+ *  bpf_lsm_unix_may_send
+ *  bpf_lsm_unix_stream_connect
+ *  bpf_lsm_vm_enough_memory
+ */


I think this is very useful info. I was wondering whether it would make sense
to annotate these more closely to the code so there's less chance this info
becomes stale? Maybe something like below, not sure ... issue is if you would
just place a cant_sleep() in there it might be wrong since this should just
document that it can be invoked from non-sleepable context but it might not
have to.

diff --git a/security/security.c b/security/security.c
index a28045dc9e7f..7899bf32cdaa 100644
--- a/security/security.c
+++ b/security/security.c
@@ -94,6 +94,11 @@ static __initdata bool debug;
pr_info(__VA_ARGS__);   \
} while (0)

+/*
+ * Placeholder for now to document that hook implementation cannot sleep
+ * since it could potentially be called from non-sleepable context, too.
+ */
+#define hook_cant_sleep()  do { } while (0)
+
 static bool __init is_enabled(struct lsm_info *lsm)
 {
if (!lsm->enabled)
@@ -2522,6 +2527,7 @@ void security_bpf_map_free(struct bpf_map *map)
 }
 void security_bpf_prog_free(struct bpf_prog_aux *aux)
 {
+   hook_cant_sleep();
call_void_hook(bpf_prog_free_security, aux);
 }
 #endif /* CONFIG_BPF_SYSCALL */


Re: [PATCH v3 bpf] tools: bpftool: Add missing close before bpftool net attach exit

2020-11-11 Thread Daniel Borkmann

On 11/11/20 2:54 PM, Wang Hai wrote:

progfd is created by prog_parse_fd(), before 'bpftool net attach' exit,
it should be closed.

Fixes: 04949ccc273e ("tools: bpftool: add net attach command to attach XDP on 
interface")
Signed-off-by: Wang Hai 
---
v2->v3: add 'err = 0' before successful return
v1->v2: use cleanup tag instead of repeated closes
  tools/bpf/bpftool/net.c | 18 +++---
  1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 910e7bac6e9e..f927392271cc 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -578,8 +578,8 @@ static int do_attach(int argc, char **argv)
  
  	ifindex = net_parse_dev(, );

if (ifindex < 1) {
-   close(progfd);
-   return -EINVAL;
+   err = -EINVAL;
+   goto cleanup;
}
  
  	if (argc) {

@@ -587,8 +587,8 @@ static int do_attach(int argc, char **argv)
overwrite = true;
} else {
p_err("expected 'overwrite', got: '%s'?", *argv);
-   close(progfd);
-   return -EINVAL;
+   err = -EINVAL;
+   goto cleanup;
}
}
  
@@ -597,16 +597,20 @@ static int do_attach(int argc, char **argv)

err = do_attach_detach_xdp(progfd, attach_type, ifindex,
   overwrite);
  
-	if (err < 0) {

+   if (err) {
p_err("interface %s attach failed: %s",
  attach_type_strings[attach_type], strerror(-err));
-   return err;
+   goto cleanup;
}
  
  	if (json_output)

jsonw_null(json_wtr);
  
-	return 0;

+   err = 0;


Why is the 'err = 0' still needed here given we test for err != 0 earlier?
Would just remove it, otherwise looks good.


+cleanup:
+   close(progfd);
+   return err;
  }
  
  static int do_detach(int argc, char **argv)






Re: [PATCH v3] bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id

2020-11-11 Thread Daniel Borkmann

On 11/11/20 6:03 AM, xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

The unsigned variable datasec_id is assigned a return value from the call
to check_pseudo_btf_id(), which may return negative error code.

Fixes coccicheck warning:

./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared with 
zero: datasec_id > 0

Reported-by: Tosk Robot 
Signed-off-by: Kaixu Xia 


Looks good, applied & also added Fixes tags, thanks!


Re: [selftest/bpf] b83590ee1a: BUG:KASAN:slab-out-of-bounds_in_l

2020-11-09 Thread Daniel Borkmann

Hi Daniel,

On 11/9/20 3:54 PM, kernel test robot wrote:

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: b83590ee1add052518603bae607b0524632b7793 ("[PATCH bpf v3 2/2] selftest/bpf: 
Test bpf_probe_read_user_str() strips trailing bytes after NUL")
url: 
https://github.com/0day-ci/linux/commits/Daniel-Xu/Fix-bpf_probe_read_user_str-overcopying/20201106-033210
base: https://git.kernel.org/cgit/linux/kernel/git/bpf/bpf.git master


I've tossed them from the tree for now as it looks like these are adding 
regressions
for regular strncpy_from_user() calls, please take a look.

Thanks!


in testcase: trinity
version: trinity-x86_64-af355e9-1_2019-12-03
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


++++
|| 
e65411d04b | b83590ee1a |
++++
| BUG:KASAN:slab-out-of-bounds_in_l  | 
0  | 4  |
++++


If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


[   54.933739] BUG: KASAN: slab-out-of-bounds in link_path_walk+0x8f5/0xa80
[   54.935295] Read of size 1 at addr 88815f726951 by task modprobe/114
[   54.936720]
[   54.937199] CPU: 1 PID: 114 Comm: modprobe Not tainted 
5.9.0-13439-gb83590ee1add #1
[   54.938907] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
[   54.940683] Call Trace:
[   54.941008]  dump_stack+0x84/0xad
[   54.941008]  print_address_description+0x2f/0x220
[   54.941008]  ? pm_suspend.cold+0x70e/0x70e
[   54.941008]  ? _raw_write_lock_irqsave+0x150/0x150
[   54.941008]  ? link_path_walk+0x8f5/0xa80
[   54.941008]  kasan_report.cold+0x37/0x7c
[   54.941008]  ? link_path_walk+0x8f5/0xa80
[   54.941008]  __asan_report_load1_noabort+0x14/0x20
[   54.941008]  link_path_walk+0x8f5/0xa80
[   54.941008]  ? walk_component+0x670/0x670
[   54.941008]  ? deactivate_slab+0x3d9/0x690
[   54.941008]  link_path_walk+0x91/0xb0
[   54.941008]  path_lookupat+0x12f/0x430
[   54.941008]  filename_lookup+0x19a/0x2d0
[   54.941008]  ? may_linkat+0x180/0x180
[   54.941008]  ? __check_object_size+0x2bf/0x390
[   54.941008]  ? strncpy_from_user+0x24b/0x490
[   54.941008]  ? getname_flags+0x13a/0x4a0
[   54.941008]  user_path_at_empty+0x3f/0x50
[   54.941008]  do_faccessat+0xc1/0x5d0
[   54.941008]  ? stream_open+0x60/0x60
[   54.941008]  ? exit_to_user_mode_prepare+0xb9/0x190
[   54.941008]  __x64_sys_access+0x56/0x80
[   54.941008]  do_syscall_64+0x5d/0x70
[   54.941008]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   54.941008] RIP: 0033:0x7f3cc3f345f7
[   54.941008] Code: c8 ff c3 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 01 c3 48 8d 
0d 19 9b 20 00 f7 d8 89 01 48 83 c8 ff c3 b8 15 00 00 00 0f 05 <48> 3d 01 f0 ff 
ff 73 01 c3 48 8d 0d f9 9a 20 00 f7 d8 89 01 48 83
[   54.941008] RSP: 002b:7ffde47f0b68 EFLAGS: 0246 ORIG_RAX: 
0015
[   54.941008] RAX: ffda RBX:  RCX: 7f3cc3f345f7
[   54.941008] RDX: 0006 RSI: 0004 RDI: 7f3cc3f39bd0
[   54.941008] RBP: 7ffde47f1c70 R08:  R09: 
[   54.941008] R10: 0022 R11: 0246 R12: 
[   54.941008] R13: 7ffde47f9330 R14: 000f R15: 7f3cc413e150
[   54.941008]
[   54.941008] Allocated by task 114:
[   54.941008]  kasan_save_stack+0x23/0x50
[   54.941008]  __kasan_kmalloc+0xe1/0xf0
[   54.941008]  kasan_slab_alloc+0xe/0x10
[   54.941008]  kmem_cache_alloc+0x166/0x360
[   54.941008]  getname_flags+0x4e/0x4a0
[   54.941008]  user_path_at_empty+0x2b/0x50
[   54.941008]  do_faccessat+0xc1/0x5d0
[   54.941008]  __x64_sys_access+0x56/0x80
[   54.941008]  do_syscall_64+0x5d/0x70
[   54.941008]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   54.941008]
[   54.941008] The buggy address belongs to the object at 88815f725900
[   54.941008]  which belongs to the cache names_cache of size 4096
[   54.941008] The buggy address is located 81 bytes to the right of
[   54.941008]  4096-byte region [88815f725900, 88815f726900)
[   54.941008] The buggy address belongs to the page:
[   54.941008] page:(ptrval) refcount:1 mapcount:0 
mapping: index:0x0 pfn:0x15f720
[   54.941008] head:(ptrval) order:3 compound_mapcount:0 
compound_pincount:0
[   54.941008] flags: 0x80010200(slab|head)
[   

Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-04 Thread Daniel Borkmann

On 11/4/20 9:18 PM, Daniel Xu wrote:

On Wed Nov 4, 2020 at 8:24 AM PST, Daniel Borkmann wrote:

On 11/4/20 3:29 AM, Daniel Xu wrote:

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Signed-off-by: Daniel Xu 


It looks like this is a regression from the recent refactoring of the
mem probing
util functions?


I think it was like this from the beginning, at 6ae08ae3dea2 ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers").
The old bpf_probe_read_str() used the kernel's byte-by-byte copying
routine. bpf_probe_read_user_str() started using strncpy_from_user()
which has been doing the long-sized strides since ~2012 or earlier.

I tried to build and test the kernel at that commit but it seems my
compiler is too new to build that old code. Bunch of build failures.

I assume the refactor you're referring to is 8d92db5c04d1 ("bpf: rework
the compat kernel probe handling").


Ah I see, it was just reusing 3d7081822f7f ("uaccess: Add non-pagefault 
user-space
read functions"). Potentially it might be safer choice to just rework the
strncpy_from_user_nofault() to mimic strncpy_from_kernel_nofault() in that
regard?


Could we add a Fixes tag and then we'd also need to target the fix
against bpf tree instead of bpf-next, no?


Sure, will do in v2.


Moreover, a BPF kselftest would help to make sure it doesn't regress in
future again.


Ditto.

[..]

Thanks,
Daniel





Re: [PATCH bpf-next] lib/strncpy_from_user.c: Don't overcopy bytes after NUL terminator

2020-11-04 Thread Daniel Borkmann

On 11/4/20 3:29 AM, Daniel Xu wrote:

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
resulting string. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit uses the proper word-at-a-time APIs to avoid overcopying.

Signed-off-by: Daniel Xu 


It looks like this is a regression from the recent refactoring of the mem 
probing
util functions? Could we add a Fixes tag and then we'd also need to target the 
fix
against bpf tree instead of bpf-next, no?

Moreover, a BPF kselftest would help to make sure it doesn't regress in future 
again.

Thanks,
Daniel


Re: [PATCH] bpf: don't rely on GCC __attribute__((optimize)) to disable GCSE

2020-10-27 Thread Daniel Borkmann

On 10/27/20 9:57 PM, Ard Biesheuvel wrote:

Commit 3193c0836f203 ("bpf: Disable GCC -fgcse optimization for
___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
function scope __attribute__((optimize("-fno-gcse"))), to disable a
GCC specific optimization that was causing trouble on x86 builds, and
was not expected to have any positive effect in the first place.

However, as the GCC manual documents, __attribute__((optimize))
is not for production use, and results in all other optimization
options to be forgotten for the function in question. This can
cause all kinds of trouble, but in one particular reported case,


Looks like there are couple more as well aside from __no_fgcse, are you
also planning to fix them?

  arch/powerpc/kernel/setup.h:14:#define __nostackprotector 
__attribute__((__optimize__("no-stack-protector")))
  tools/include/linux/compiler-gcc.h:37:#define __no_tail_call  
__attribute__((optimize("no-optimize-sibling-calls")))


it causes -fno-asynchronous-unwind-tables to be disregarded,
resulting in .eh_frame info to be emitted for the function
inadvertently.


Would have been useful to add a pointer to the original discussion with
Link tag.

Link: 
https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=gfhpzoj_hc...@mail.gmail.com/


This reverts commit 3193c0836f203, and instead, it disables the -fgcse
optimization for the entire source file, but only when building for
X86.

Cc: Nick Desaulniers 
Cc: Arvind Sankar 
Cc: Randy Dunlap 
Cc: Josh Poimboeuf 
Cc: Thomas Gleixner 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Peter Zijlstra (Intel) 
Cc: Geert Uytterhoeven 
Cc: Kees Cook 
Fixes: 3193c0836f203 ("bpf: Disable GCC -fgcse optimization for 
___bpf_prog_run()")
Signed-off-by: Ard Biesheuvel 

[...]

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index bdc8cd1b6767..02b58f44c479 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,6 +1,8 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-y := core.o
-CFLAGS_core.o += $(call cc-disable-warning, override-init)
+# ___bpf_prog_run() needs GCSE disabled on x86; see 3193c0836f203 for details
+cflags-core-$(CONFIG_X86) := -fno-gcse
+CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-core-y)


Also, this needs to be guarded behind !CONFIG_RETPOLINE and 
!CONFIG_BPF_JIT_ALWAYS_ON
in particular the latter since only in this case interpreter is compiled in ... 
most
distros have the CONFIG_BPF_JIT_ALWAYS_ON set these days for x86.

Do you have an analysis for the commit log on what else this penalizes in 
core.c if
it's now for the entire translation unit?


  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o 
bpf_iter.o map_iter.o task_iter.o prog_iter.o
  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9268d77898b7..55454d2278b1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1369,7 +1369,7 @@ u64 __weak bpf_probe_read_kernel(void *dst, u32 size, 
const void *unsafe_ptr)
   *
   * Decode and execute eBPF instructions.
   */
-static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, 
u64 *stack)
+static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
  {
  #define BPF_INSN_2_LBL(x, y)[BPF_##x | BPF_##y] = &##_##y
  #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = 
&##_##y##_##z





Re: linux-next: build failure after merge of the net-next tree

2020-10-06 Thread Daniel Borkmann

On 10/6/20 7:41 AM, Stephen Rothwell wrote:

On Tue, 6 Oct 2020 07:13:01 +0200 Christoph Hellwig  wrote:


On Tue, Oct 06, 2020 at 02:58:47PM +1100, Stephen Rothwell wrote:

Hi all,

After merging the net-next tree, today's linux-next build (x86_64
allmodconfig) failed like this:


It actually doesn't need that or the two other internal headers.
Bjoern has a fixed, and it was supposed to be queued up according to
patchwork.


Yeah, it is in the bpf-next tree but not merged into the net-next tree
yet.


Yep, applied yesterday. Given a3cf4abf ("dma-mapping: merge 

into ") is in dma-mapping tree and not yet affecting 
bpf-next
nor net-next, we were planning to ship bpf-next at the usual cadence this week, 
so it'll
be in net-next end of week for sure. (If there is urgent reason to have it in 
net-next
today, please let us know of course.)

Thanks,
Daniel


Re: mb2q experience and couple issues

2020-10-01 Thread Daniel Borkmann

On 10/1/20 11:13 AM, Thomas Gleixner wrote:

On Wed, Sep 30 2020 at 11:12, Alexei Starovoitov wrote:

For the last couple years we've been using mb2q tool to normalize patches
and it worked wonderfully.


Fun. I thought I'm the only user of it :)


We're using it pretty much daily since you've put it on korg :) It's in
a bunch of scripts^hacks we use for bpf trees:

  https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/pw.git/


Recently we've hit few bugs:
curl -s https://patchwork.kernel.org/patch/11807443/mbox/ >
/tmp/mbox.i; ~/bin/mb2q --mboxout mbox.o /tmp/mbox.i
Drop Message w/o Message-ID: No subject
No patches found in mbox

I've tried to debug it, but couldn't figure out what's going on.
The subject and message-id fields are parsed correctly,
but later something happens.
Could you please take a look?


The problem is the mbox storage format. The mbox created by curl has a
mail body which has a line starting with 'From' in the mail body:

   From the VAR btf_id, the verifier can also read the address of the
   ksym's corresponding kernel var from kallsyms and use that to fill
   dst_reg.

The mailbox parser trips over that From and takes it as start of the
next message.

  http://qmail.org/qmail-manual-html/man5/mbox.html

Usually mailbox storage escapes a From at the start of a
newline with '>':

   >From the VAR btf_id, the verifier can also read the address of the
   ksym's corresponding kernel var from kallsyms and use that to fill
   dst_reg.

Yes, it's ugly and I haven't figured out a proper way to deal with
that. There are quite some mbox formats out there and they all are
incompatible with each other and all of them have different horrors.

Let me think about it.


It seems these issues only appeared since maybe a month or so. Perhaps also
something changed on ozlabs/patchwork side.

Cheers,
Daniel


Re: [PATCH] powerpc: net: bpf_jit_comp: Fix misuse of fallthrough

2020-09-29 Thread Daniel Borkmann

On 9/28/20 11:00 AM, zhe...@windriver.com wrote:

From: He Zhe 

The user defined label following "fallthrough" is not considered by GCC
and causes build failure.

kernel-source/include/linux/compiler_attributes.h:208:41: error: attribute
'fallthrough' not preceding a case label or default label [-Werror]
  208   define fallthrough _attribute((fallthrough_))
   ^

Signed-off-by: He Zhe 


Applied, thanks! I've also added Fixes tag with df561f6688fe ("treewide: Use 
fallthrough pseudo-keyword")
which added the bug.


Re: [PATCH v7 bpf-next 4/8] selftests/bpf: add bpf_snprintf_btf helper tests

2020-09-29 Thread Daniel Borkmann

On 9/28/20 1:31 PM, Alan Maguire wrote:

Tests verifying snprintf()ing of various data structures,
flags combinations using a tp_btf program. Tests are skipped
if __builtin_btf_type_id is not available to retrieve BTF
type ids.

Signed-off-by: Alan Maguire 

[...]

+void test_snprintf_btf(void)
+{
+   struct netif_receive_skb *skel;
+   struct netif_receive_skb__bss *bss;
+   int err, duration = 0;
+
+   skel = netif_receive_skb__open();
+   if (CHECK(!skel, "skel_open", "failed to open skeleton\n"))
+   return;
+
+   err = netif_receive_skb__load(skel);
+   if (CHECK(err, "skel_load", "failed to load skeleton: %d\n", err))
+   goto cleanup;
+
+   bss = skel->bss;
+
+   err = netif_receive_skb__attach(skel);
+   if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err))
+   goto cleanup;
+
+   /* generate receive event */
+   system("ping -c 1 127.0.0.1 > /dev/null");


This generates the following new warning when compiling BPF selftests:

  [...]
  EXT-OBJ  [test_progs] cgroup_helpers.o
  EXT-OBJ  [test_progs] trace_helpers.o
  EXT-OBJ  [test_progs] network_helpers.o
  EXT-OBJ  [test_progs] testing_helpers.o
  TEST-OBJ [test_progs] snprintf_btf.test.o
/root/bpf-next/tools/testing/selftests/bpf/prog_tests/snprintf_btf.c: In 
function ‘test_snprintf_btf’:
/root/bpf-next/tools/testing/selftests/bpf/prog_tests/snprintf_btf.c:30:2: 
warning: ignoring return value of ‘system’, declared with attribute 
warn_unused_result [-Wunused-result]
  system("ping -c 1 127.0.0.1 > /dev/null");
  ^
  [...]

Please fix, thx!


Re: [PATCH] tools build feature: cleanup feature files on make clean

2020-09-03 Thread Daniel Borkmann

Hi Arnaldo,

On 9/3/20 9:03 PM, Arnaldo Carvalho de Melo wrote:

Em Thu, Aug 27, 2020 at 10:53:36AM +0200, Jesper Dangaard Brouer escreveu:

The system for "Auto-detecting system features" located under
tools/build/ are (currently) used by perf, libbpf and bpftool. It can
contain stalled feature detection files, which are not cleaned up by
libbpf and bpftool on make clean (side-note: perf tool is correct).

Fix this by making the users invoke the make clean target.

Some details about the changes. The libbpf Makefile already had a
clean-config target (which seems to be copy-pasted from perf), but this
target was not "connected" (a make dependency) to clean target. Choose
not to rename target as someone might be using it. Did change the output
from "CLEAN config" to "CLEAN feature-detect", to make it more clear
what happens.


Since this mostly touches BPF, should it go via the BPF tree?


Already applied roughly a week ago:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=661b37cd437ef49cd28444f79b9b0c71ea76e8c8

Thanks,
Daniel


Re: [PATCH] xsk: Free variable on error path

2020-09-02 Thread Daniel Borkmann

On 9/2/20 6:33 PM, Alex Dewar wrote:

In xp_create_dma_map(), memory is allocated to dma_map->dma_pages, but
then dma_map is erroneously compared to NULL, rather than the member.
Fix this.

Addresses-Coverity: ("Dead code")
Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Signed-off-by: Alex Dewar 


Thanks, already applied a fix that was sent earlier [0].

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1d6fd78a213ee3874f46bdce083b7a41d208886d


Re: [PATCH][next] xsk: fix incorrect memory allocation failure check on dma_map->dma_pages

2020-09-02 Thread Daniel Borkmann

On 9/2/20 6:13 PM, Colin King wrote:

From: Colin Ian King 

The failed memory allocation check for dma_map->dma_pages is incorrect,
it is null checking dma_map and not dma_map->dma_pages. Fix this.

Addresses-Coverity: ("Logicall dead code")
Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Signed-off-by: Colin Ian King 


Thanks, already applied a fix that was sent earlier [0].

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1d6fd78a213ee3874f46bdce083b7a41d208886d


Re: [PATCH][next] xsk: Fix null check on error return path

2020-09-02 Thread Daniel Borkmann

On 9/2/20 5:07 PM, Gustavo A. R. Silva wrote:

Currently, dma_map is being checked, when the right object identifier
to be null-checked is dma_map->dma_pages, instead.

Fix this by null-checking dma_map->dma_pages.

Addresses-Coverity-ID: 1496811 ("Logically dead code")
Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Signed-off-by: Gustavo A. R. Silva 


Applied, thanks!


Re: KASAN: use-after-free Write in xp_put_pool

2020-09-02 Thread Daniel Borkmann

On 9/2/20 8:57 AM, syzbot wrote:

Hello,

syzbot found the following issue on:


Magnus/Bjorn, ptal, thanks!


HEAD commit:dc1a9bf2 octeontx2-pf: Add UDP segmentation offload support
git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16ff67de90
kernel config:  https://syzkaller.appspot.com/x/.config?x=b6856d16f78d8fa9
dashboard link: https://syzkaller.appspot.com/bug?extid=5334f62e4d22804e646a
compiler:   gcc (GCC) 10.1.0-syz 20200507
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12e9f27990
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=125f3e1e90

The issue was bisected to:

commit a1132430c2c55af62d13e9fca752d46f14d548b3
Author: Magnus Karlsson 
Date:   Fri Aug 28 08:26:26 2020 +

 xsk: Add shared umem support between devices

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10a373de90
final oops: https://syzkaller.appspot.com/x/report.txt?x=12a373de90
console output: https://syzkaller.appspot.com/x/log.txt?x=14a373de90

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+5334f62e4d22804e6...@syzkaller.appspotmail.com
Fixes: a1132430c2c5 ("xsk: Add shared umem support between devices")

==
BUG: KASAN: use-after-free in instrument_atomic_write 
include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_fetch_sub_release 
include/asm-generic/atomic-instrumented.h:220 [inline]
BUG: KASAN: use-after-free in refcount_sub_and_test 
include/linux/refcount.h:266 [inline]
BUG: KASAN: use-after-free in refcount_dec_and_test 
include/linux/refcount.h:294 [inline]
BUG: KASAN: use-after-free in xp_put_pool+0x2c/0x1e0 net/xdp/xsk_buff_pool.c:262
Write of size 4 at addr 8880a6a4d860 by task ksoftirqd/0/9

CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.9.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x18f/0x20d lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
  __kasan_report mm/kasan/report.c:513 [inline]
  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
  check_memory_region_inline mm/kasan/generic.c:186 [inline]
  check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
  instrument_atomic_write include/linux/instrumented.h:71 [inline]
  atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 
[inline]
  refcount_sub_and_test include/linux/refcount.h:266 [inline]
  refcount_dec_and_test include/linux/refcount.h:294 [inline]
  xp_put_pool+0x2c/0x1e0 net/xdp/xsk_buff_pool.c:262
  xsk_destruct+0x7d/0xa0 net/xdp/xsk.c:1138
  __sk_destruct+0x4b/0x860 net/core/sock.c:1764
  rcu_do_batch kernel/rcu/tree.c:2428 [inline]
  rcu_core+0x5c7/0x1190 kernel/rcu/tree.c:2656
  __do_softirq+0x2de/0xa24 kernel/softirq.c:298
  run_ksoftirqd kernel/softirq.c:652 [inline]
  run_ksoftirqd+0x89/0x100 kernel/softirq.c:644
  smpboot_thread_fn+0x655/0x9e0 kernel/smpboot.c:165
  kthread+0x3b5/0x4a0 kernel/kthread.c:292
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Allocated by task 6854:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
  kasan_set_track mm/kasan/common.c:56 [inline]
  __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
  kmalloc_node include/linux/slab.h:577 [inline]
  kvmalloc_node+0x61/0xf0 mm/util.c:574
  kvmalloc include/linux/mm.h:750 [inline]
  kvzalloc include/linux/mm.h:758 [inline]
  xp_create_and_assign_umem+0x58/0x8d0 net/xdp/xsk_buff_pool.c:54
  xsk_bind+0x9a0/0xed0 net/xdp/xsk.c:709
  __sys_bind+0x1e9/0x250 net/socket.c:1656
  __do_sys_bind net/socket.c:1667 [inline]
  __se_sys_bind net/socket.c:1665 [inline]
  __x64_sys_bind+0x6f/0xb0 net/socket.c:1665
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 6854:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
  kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
  kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
  __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
  __cache_free mm/slab.c:3418 [inline]
  kfree+0x103/0x2c0 mm/slab.c:3756
  kvfree+0x42/0x50 mm/util.c:603
  xp_destroy net/xdp/xsk_buff_pool.c:44 [inline]
  xp_destroy+0x45/0x60 net/xdp/xsk_buff_pool.c:38
  xsk_bind+0xbdd/0xed0 net/xdp/xsk.c:719
  __sys_bind+0x1e9/0x250 net/socket.c:1656
  __do_sys_bind net/socket.c:1667 [inline]
  __se_sys_bind net/socket.c:1665 [inline]
  __x64_sys_bind+0x6f/0xb0 net/socket.c:1665
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at 8880a6a4d800
  which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 96 bytes inside of
  1024-byte region [8880a6a4d800, 8880a6a4dc00)
The buggy address belongs to the page:
page:dd5fc18f 

Re: [PATCH] tools build feature: cleanup feature files on make clean

2020-08-28 Thread Daniel Borkmann

On 8/27/20 10:53 AM, Jesper Dangaard Brouer wrote:

The system for "Auto-detecting system features" located under
tools/build/ are (currently) used by perf, libbpf and bpftool. It can
contain stalled feature detection files, which are not cleaned up by
libbpf and bpftool on make clean (side-note: perf tool is correct).

Fix this by making the users invoke the make clean target.

Some details about the changes. The libbpf Makefile already had a
clean-config target (which seems to be copy-pasted from perf), but this
target was not "connected" (a make dependency) to clean target. Choose
not to rename target as someone might be using it. Did change the output
from "CLEAN config" to "CLEAN feature-detect", to make it more clear
what happens.

This is related to the complaint and troubleshooting in link:
Link: https://lore.kernel.org/lkml/20200818122007.2d1cfe2d@carbon/

Signed-off-by: Jesper Dangaard Brouer 


Applied to bpf-next, thanks!


Re: [PATCH bpf-next] libbpf: simplify the return expression of build_map_pin_path()

2020-08-20 Thread Daniel Borkmann

On 8/19/20 4:53 AM, Xu Wang wrote:

Simplify the return expression.

Signed-off-by: Xu Wang 


Applied, thanks!


Re: [PATCH bpf-next v2] bpf: fix segmentation fault of test_progs

2020-08-11 Thread Daniel Borkmann

On 8/10/20 5:39 PM, Jianlin Lv wrote:

test_progs reports the segmentation fault as below

$ sudo ./test_progs -t mmap --verbose
test_mmap:PASS:skel_open_and_load 0 nsec
..
test_mmap:PASS:adv_mmap1 0 nsec
test_mmap:PASS:adv_mmap2 0 nsec
test_mmap:PASS:adv_mmap3 0 nsec
test_mmap:PASS:adv_mmap4 0 nsec
Segmentation fault

This issue was triggered because mmap() and munmap() used inconsistent
length parameters; mmap() creates a new mapping of 3*page_size, but the
length parameter set in the subsequent re-map and munmap() functions is
4*page_size; this leads to the destruction of the process space.

To fix this issue, first create 4 pages of anonymous mapping,then do all
the mmap() with MAP_FIXED.

Another issue is that when unmap the second page fails, the length
parameter to delete tmp1 mappings should be 4*page_size.

Signed-off-by: Jianlin Lv 


Applied, thanks!


Re: [PATCH] kernel: bpf: delete repeated words in comments

2020-08-07 Thread Daniel Borkmann

On 8/7/20 5:31 AM, Randy Dunlap wrote:

Drop repeated words in kernel/bpf/.
{has, the}

Signed-off-by: Randy Dunlap 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: net...@vger.kernel.org
Cc: b...@vger.kernel.org


Applied, thanks!


Re: [PATCH bpf] bpf: doc: remove references to warning message when using bpf_trace_printk()

2020-08-07 Thread Daniel Borkmann

On 8/7/20 1:50 PM, Alan Maguire wrote:

The BPF helper bpf_trace_printk() no longer uses trace_printk();
it is now triggers a dedicated trace event.  Hence the described
warning is no longer present, so remove the discussion of it as
it may confuse people.

Fixes: ac5a72ea5c89 ("bpf: Use dedicated bpf_trace_printk event instead of 
trace_printk()")
Signed-off-by: Alan Maguire 


Applied, thanks!


Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting

2020-08-03 Thread Daniel Borkmann

On 8/3/20 7:05 PM, Roman Gushchin wrote:

On Mon, Aug 03, 2020 at 06:39:01PM +0200, Daniel Borkmann wrote:

On 8/3/20 5:34 PM, Roman Gushchin wrote:

On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:

On 7/30/20 11:22 PM, Roman Gushchin wrote:

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
  as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
  the counter is shared with non-bpf users (e.g. memlock() users).
  Any specific value is either too low and creates false failures
  or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
  should manually calculate the estimated cost and precharge the counter,
  and then take care of uncharging, including all fail paths.
  It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
  We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
  a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
  a better control over memory usage by different workloads.

2) The actual memory consumption is taken into account. It happens automatically
  on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is 
also
  performed automatically on releasing the memory. So the code on the bpf 
side
  becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

The patchset consists of the following parts:
1) memcg-based accounting for various bpf objects: progs and maps
2) removal of the rlimit-based accounting
3) removal of rlimit adjustments in userspace samples



The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is 
set
is supposed to work reliably, at least I currently fail to see it. Elaborating 
on this
in more depth especially for the case of unprivileged users should be a 
/fundamental/
part of the commit message.

Lets take an example: unprivileged user adds a max sized hashtable to one of its
programs, and configures the map that it will perform runtime allocation. The 
load
succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
processes packets from softirq. Given the runtime allocations, we end up 
mischarging
to whoever ended up triggering __do_softirq(). If, for example, ksoftirq 
thread, then
it's probably reasonable to assume that this might not be accounted e.g. limits 
are
not imposed on the root cgroup. If so we would probably need to drag the 
context of
/where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. 
Otherwise
how do you protect unprivileged users to OOM the machine?


this is a valid concern, thank you for bringing it in. It can be resolved by
associating a map with a memory cgroup on creation, so that we can charge
this memory cgroup later, even from a soft-irq context. The question here is
whether we want to do it for all maps, or just for dynamic hashtables
(or any similar cases, if there are any)? I think the second option
is better. With the first option we have to annotate all memory allocations
in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
so it's easy to mess it up in the future.
What do you think?


We would need to do it for all maps that are configured with non-prealloc, e.g. 
not
only hash/LRU table but also others like LPM maps etc. I wonder whether program 
entry/
exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then 
everything
would be accounted against the prog's memcg from runtime side, but then there's 
the
usual issue with 'unuse'-restore on tail calls, and it doesn't solve the 
syscall side.
But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
anyway compared to some of the other map update work such as taking bucket lock 
etc.


I'll explore it and address in the next version. Thank you for suggestions!


Ok.

I'm probably still missing one more thing, but could you elaborate what limits 
would
be enforced if an unprivileged user creates a prog/map on the host (w/o further 
action
such as moving to a specific cgroup)?

From what I can tell via looking at systemd:

  $ cat /proc/self/cgroup
  11:cpuset:/
  10:hugetlb:/
  9:devices:/user.slice
  8:cpu,cpuacct

Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting

2020-08-03 Thread Daniel Borkmann

On 8/3/20 5:34 PM, Roman Gushchin wrote:

On Mon, Aug 03, 2020 at 02:05:29PM +0200, Daniel Borkmann wrote:

On 7/30/20 11:22 PM, Roman Gushchin wrote:

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
 as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
 the counter is shared with non-bpf users (e.g. memlock() users).
 Any specific value is either too low and creates false failures
 or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
 should manually calculate the estimated cost and precharge the counter,
 and then take care of uncharging, including all fail paths.
 It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
 We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
 a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
 a better control over memory usage by different workloads.

2) The actual memory consumption is taken into account. It happens automatically
 on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
 performed automatically on releasing the memory. So the code on the bpf 
side
 becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

The patchset consists of the following parts:
1) memcg-based accounting for various bpf objects: progs and maps
2) removal of the rlimit-based accounting
3) removal of rlimit adjustments in userspace samples



The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is 
set
is supposed to work reliably, at least I currently fail to see it. Elaborating 
on this
in more depth especially for the case of unprivileged users should be a 
/fundamental/
part of the commit message.

Lets take an example: unprivileged user adds a max sized hashtable to one of its
programs, and configures the map that it will perform runtime allocation. The 
load
succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
processes packets from softirq. Given the runtime allocations, we end up 
mischarging
to whoever ended up triggering __do_softirq(). If, for example, ksoftirq 
thread, then
it's probably reasonable to assume that this might not be accounted e.g. limits 
are
not imposed on the root cgroup. If so we would probably need to drag the 
context of
/where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. 
Otherwise
how do you protect unprivileged users to OOM the machine?


this is a valid concern, thank you for bringing it in. It can be resolved by
associating a map with a memory cgroup on creation, so that we can charge
this memory cgroup later, even from a soft-irq context. The question here is
whether we want to do it for all maps, or just for dynamic hashtables
(or any similar cases, if there are any)? I think the second option
is better. With the first option we have to annotate all memory allocations
in bpf maps code with memalloc_use_memcg()/memalloc_unuse_memcg(),
so it's easy to mess it up in the future.
What do you think?


We would need to do it for all maps that are configured with non-prealloc, e.g. 
not
only hash/LRU table but also others like LPM maps etc. I wonder whether program 
entry/
exit could do the memalloc_use_memcg() / memalloc_unuse_memcg() and then 
everything
would be accounted against the prog's memcg from runtime side, but then there's 
the
usual issue with 'unuse'-restore on tail calls, and it doesn't solve the 
syscall side.
But seems like the memalloc_{use,unuse}_memcg()'s remote charging is lightweight
anyway compared to some of the other map update work such as taking bucket lock 
etc.


Similarly, what happens to unprivileged users if kmemcg was not configured into 
the
kernel or has been disabled?


Well, I don't think we can address it. Memcg-based memory accounting requires
enabled memory cgroups, a properly configured cgroup tree and also the kernel
memory accounting turned on to function properly.
Because we at Facebook are using cgroup for the memory accounting and control
everywhere, I might be biased. If there are real !memcg systems which are
actively using non-privileged bpf, we should keep the old system in place
and make i

Re: [PATCH] tools/bpf/bpftool: Fix wrong return value in do_dump()

2020-08-03 Thread Daniel Borkmann

On 8/2/20 1:15 PM, Tianjia Zhang wrote:

In case of btf_id does not exist, a negative error code -ENOENT
should be returned.

Fixes: c93cc69004df3 ("bpftool: add ability to dump BTF types")
Cc: Andrii Nakryiko 
Signed-off-by: Tianjia Zhang 


Applied, thanks!


Re: [PATCH bpf-next v3 00/29] bpf: switch to memcg-based memory accounting

2020-08-03 Thread Daniel Borkmann

On 7/30/20 11:22 PM, Roman Gushchin wrote:

Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
the counter is shared with non-bpf users (e.g. memlock() users).
Any specific value is either too low and creates false failures
or too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
should manually calculate the estimated cost and precharge the counter,
and then take care of uncharging, including all fail paths.
It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
a function to "explain" this case for users.

In order to overcome these problems let's switch to the memcg-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
a better control over memory usage by different workloads.

2) The actual memory consumption is taken into account. It happens automatically
on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
performed automatically on releasing the memory. So the code on the bpf side
becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

The patchset consists of the following parts:
1) memcg-based accounting for various bpf objects: progs and maps
2) removal of the rlimit-based accounting
3) removal of rlimit adjustments in userspace samples


The diff stat looks nice & agree that rlimit sucks, but I'm missing how this is 
set
is supposed to work reliably, at least I currently fail to see it. Elaborating 
on this
in more depth especially for the case of unprivileged users should be a 
/fundamental/
part of the commit message.

Lets take an example: unprivileged user adds a max sized hashtable to one of its
programs, and configures the map that it will perform runtime allocation. The 
load
succeeds as it doesn't surpass the limits set for the current memcg. Kernel then
processes packets from softirq. Given the runtime allocations, we end up 
mischarging
to whoever ended up triggering __do_softirq(). If, for example, ksoftirq 
thread, then
it's probably reasonable to assume that this might not be accounted e.g. limits 
are
not imposed on the root cgroup. If so we would probably need to drag the 
context of
/where/ this must be charged to __memcg_kmem_charge_page() to do it reliably. 
Otherwise
how do you protect unprivileged users to OOM the machine?

Similarly, what happens to unprivileged users if kmemcg was not configured into 
the
kernel or has been disabled?

Thanks,
Daniel


Re: [PATCH bpf-next v3] Documentation/bpf: Use valid and new links in index.rst

2020-07-31 Thread Daniel Borkmann

On 7/31/20 10:29 AM, Tiezhu Yang wrote:

There exists an error "404 Not Found" when I click the html link of
"Documentation/networking/filter.rst" in the BPF documentation [1],
fix it.

Additionally, use the new links about "BPF and XDP Reference Guide"
and "bpf(2)" to avoid redirects.

[1] https://www.kernel.org/doc/html/latest/bpf/

Fixes: d9b9170a2653 ("docs: bpf: Rename README.rst to index.rst")
Fixes: cb3f0d56e153 ("docs: networking: convert filter.txt to ReST")
Signed-off-by: Tiezhu Yang 


That looks better, applied, thanks!


Re: [PATCH bpf-next] bpf: fix compilation warning of selftests

2020-07-31 Thread Daniel Borkmann

On 7/31/20 8:16 AM, Jianlin Lv wrote:

Clang compiler version: 12.0.0
The following warning appears during the selftests/bpf compilation:

prog_tests/send_signal.c:51:3: warning: ignoring return value of ‘write’,
declared with attribute warn_unused_result [-Wunused-result]
51 |   write(pipe_c2p[1], buf, 1);
   |   ^~
prog_tests/send_signal.c:54:3: warning: ignoring return value of ‘read’,
declared with attribute warn_unused_result [-Wunused-result]
54 |   read(pipe_p2c[0], buf, 1);
   |   ^
..

prog_tests/stacktrace_build_id_nmi.c:13:2: warning: ignoring return value
of ‘fscanf’,declared with attribute warn_unused_result [-Wunused-resul]
13 |  fscanf(f, "%llu", _freq);
   |  ^~~

test_tcpnotify_user.c:133:2: warning:ignoring return value of ‘system’,
declared with attribute warn_unused_result [-Wunused-result]
   133 |  system(test_script);
   |  ^~~
test_tcpnotify_user.c:138:2: warning:ignoring return value of ‘system’,
declared with attribute warn_unused_result [-Wunused-result]
   138 |  system(test_script);
   |  ^~~
test_tcpnotify_user.c:143:2: warning:ignoring return value of ‘system’,
declared with attribute warn_unused_result [-Wunused-result]
   143 |  system(test_script);
   |  ^~~

Add code that fix compilation warning about ignoring return value and
handles any errors; Check return value of library`s API make the code
more secure.

Signed-off-by: Jianlin Lv 


Looks good overall, there is one small bug that slipped in, see below:

[...]

diff --git a/tools/testing/selftests/bpf/test_tcpnotify_user.c 
b/tools/testing/selftests/bpf/test_tcpnotify_user.c
index f9765ddf0761..869e28c92d73 100644
--- a/tools/testing/selftests/bpf/test_tcpnotify_user.c
+++ b/tools/testing/selftests/bpf/test_tcpnotify_user.c
@@ -130,17 +130,26 @@ int main(int argc, char **argv)
sprintf(test_script,
"iptables -A INPUT -p tcp --dport %d -j DROP",
TESTPORT);
-   system(test_script);
+   if (system(test_script)) {
+   printf("FAILED: execute command: %s\n", test_script);
+   goto err;
+   }
  
  	sprintf(test_script,

"nc 127.0.0.1 %d < /etc/passwd > /dev/null 2>&1 ",
TESTPORT);
-   system(test_script);
+   if (system(test_script)) {
+   printf("FAILED: execute command: %s\n", test_script);
+   goto err;
+   }


Did you try to run this test case? With the patch here it will fail:

  # ./test_tcpnotify_user
  FAILED: execute command: nc 127.0.0.1 12877 < /etc/passwd > /dev/null 2>&1

This is because nc returns 1 as exit code and for the test it is actually 
expected
to fail given the iptables rule we installed for TESTPORT right above and remove
again below.

Please adapt this and send a v2, thanks!


sprintf(test_script,
"iptables -D INPUT -p tcp --dport %d -j DROP",
TESTPORT);
-   system(test_script);
+   if (system(test_script)) {
+   printf("FAILED: execute command: %s\n", test_script);
+   goto err;
+   }
  
  	rv = bpf_map_lookup_elem(bpf_map__fd(global_map), , );

if (rv != 0) {





Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables

2020-07-30 Thread Daniel Borkmann

On 7/30/20 11:14 PM, Jean-Philippe Brucker wrote:

On Thu, Jul 30, 2020 at 09:47:39PM +0200, Daniel Borkmann wrote:

On 7/30/20 4:22 PM, Jean-Philippe Brucker wrote:

On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote:

On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote:

When a tracing BPF program attempts to read memory without using the
bpf_probe_read() helper, the verifier marks the load instruction with
the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
this flag it falls back to the interpreter.

Add support for BPF_PROBE_MEM, by appending an exception table to the
BPF program. If the load instruction causes a data abort, the fixup
infrastructure finds the exception table and fixes up the fault, by
clearing the destination register and jumping over the faulting
instruction.

To keep the compact exception table entry format, inspect the pc in
fixup_exception(). A more generic solution would add a "handler" field
to the table entry, like on x86 and s390.

Signed-off-by: Jean-Philippe Brucker 


This will fail to compile on arm64,

https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config

arch/arm64/mm/extable.o: In function `fixup_exception':
arch/arm64/mm/extable.c:19: undefined reference to `arm64_bpf_fixup_exception'


Thanks for the report, I attached a fix. Daniel, can I squash it and
resend as v2 or is it too late?


If you want I can squash your attached snippet into the original patch of
yours. If you want to send a v2 that is fine as well of course. Let me know.


Yes please squash it into the original patch, sorry for the mess


Done, thanks!


Re: [PATCH bpf-next 1/1] arm64: bpf: Add BPF exception tables

2020-07-30 Thread Daniel Borkmann

On 7/30/20 4:22 PM, Jean-Philippe Brucker wrote:

On Thu, Jul 30, 2020 at 08:28:56AM -0400, Qian Cai wrote:

On Tue, Jul 28, 2020 at 05:21:26PM +0200, Jean-Philippe Brucker wrote:

When a tracing BPF program attempts to read memory without using the
bpf_probe_read() helper, the verifier marks the load instruction with
the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
this flag it falls back to the interpreter.

Add support for BPF_PROBE_MEM, by appending an exception table to the
BPF program. If the load instruction causes a data abort, the fixup
infrastructure finds the exception table and fixes up the fault, by
clearing the destination register and jumping over the faulting
instruction.

To keep the compact exception table entry format, inspect the pc in
fixup_exception(). A more generic solution would add a "handler" field
to the table entry, like on x86 and s390.

Signed-off-by: Jean-Philippe Brucker 


This will fail to compile on arm64,

https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config

arch/arm64/mm/extable.o: In function `fixup_exception':
arch/arm64/mm/extable.c:19: undefined reference to `arm64_bpf_fixup_exception'


Thanks for the report, I attached a fix. Daniel, can I squash it and
resend as v2 or is it too late?


If you want I can squash your attached snippet into the original patch of
yours. If you want to send a v2 that is fine as well of course. Let me know.

Thanks,
Daniel


Re: [Linux-kernel-mentees] [PATCH net v2] xdp: Prevent kernel-infoleak in xsk_getsockopt()

2020-07-28 Thread Daniel Borkmann

On 7/28/20 7:36 AM, Peilin Ye wrote:

xsk_getsockopt() is copying uninitialized stack memory to userspace when
`extra_stats` is `false`. Fix it.

Fixes: 8aa5a33578e9 ("xsk: Add new statistics")
Suggested-by: Dan Carpenter 
Signed-off-by: Peilin Ye 
---
Doing `= {};` is sufficient since currently `struct xdp_statistics` is
defined as follows:

struct xdp_statistics {
__u64 rx_dropped;
__u64 rx_invalid_descs;
__u64 tx_invalid_descs;
__u64 rx_ring_full;
__u64 rx_fill_ring_empty_descs;
__u64 tx_ring_empty_descs;
};

When being copied to the userspace, `stats` will not contain any
uninitialized "holes" between struct fields.


I've added above explanation to the commit log since it's useful reasoning for 
later
on 'why' something has been done a certain way. Applied, thanks Peilin!


Re: [PATCH][next] bpf: fix swapped arguments in calls to check_buffer_access

2020-07-28 Thread Daniel Borkmann

On 7/27/20 11:39 PM, Yonghong Song wrote:

On 7/27/20 10:54 AM, Colin King wrote:

From: Colin Ian King 

There are a couple of arguments of the boolean flag zero_size_allowed
and the char pointer buf_info when calling to function check_buffer_access
that are swapped by mistake. Fix these by swapping them to correct
the argument ordering.

Addresses-Coverity: ("Array compared to 0")
Fixes: afbf21dce668 ("bpf: Support readonly/readwrite buffers in verifier")
Signed-off-by: Colin Ian King 


Thanks for the fix!
Acked-by: Yonghong Song 


Sigh, thanks for the fix Colin, applied! Yonghong, could you follow-up with
BPF selftest test cases that exercise these paths? Thx


Re: [PATCH bpf-next] bpf: Generate cookie for new non-initial net NS

2020-07-21 Thread Daniel Borkmann

On 7/20/20 4:09 PM, Jianlin Lv wrote:

For non-initial network NS, the net cookie is generated when
bpf_get_netns_cookie_sock is called for the first time, but it is more
reasonable to complete the cookie generation work when creating a new
network NS, just like init_net.
net_gen_cookie() be moved into setup_net() that it can serve the initial
and non-initial network namespace.

Signed-off-by: Jianlin Lv 


What use-case are you trying to solve? Why should it be different than, say,
socket cookie generation? I'm currently not seeing much of a point in moving
this. When it's not used in the system, it would actually create more work.


---
  net/core/net_namespace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index dcd61aca343e..5937bd0df56d 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -336,6 +336,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
idr_init(>netns_ids);
spin_lock_init(>nsid_lock);
mutex_init(>ipv4.ra_mutex);
+   net_gen_cookie(net);
  
  	list_for_each_entry(ops, _list, list) {

error = ops_init(ops, net);
@@ -1101,7 +1102,6 @@ static int __init net_ns_init(void)
panic("Could not allocate generic netns");
  
  	rcu_assign_pointer(init_net.gen, ng);

-   net_gen_cookie(_net);
  
  	down_write(_ops_rwsem);

if (setup_net(_net, _user_ns))





Re: [PATCH] Revert "test_bpf: flag tests that cannot be jited on s390"

2020-07-16 Thread Daniel Borkmann

On 7/16/20 4:39 PM, Seth Forshee wrote:

This reverts commit 3203c9010060806ff88c9989aeab4dc8d9a474dc.

The s390 bpf JIT previously had a restriction on the maximum
program size, which required some tests in test_bpf to be flagged
as expected failures. The program size limitation has been removed,
and the tests now pass, so these tests should no longer be flagged.

Fixes: d1242b10ff03 ("s390/bpf: Remove JITed image size limitations")
Signed-off-by: Seth Forshee 


Applied, thanks!


Re: [Linux-kernel-mentees] [PATCH v3] bpf: Fix NULL pointer dereference in __btf_resolve_helper_id()

2020-07-15 Thread Daniel Borkmann

On 7/14/20 8:09 PM, Peilin Ye wrote:

Prevent __btf_resolve_helper_id() from dereferencing `btf_vmlinux`
as NULL. This patch fixes the following syzbot bug:

 
https://syzkaller.appspot.com/bug?id=f823224ada908fa5c207902a5a62065e53ca0fcc

Reported-by: syzbot+ee09bda7017345f1f...@syzkaller.appspotmail.com
Signed-off-by: Peilin Ye 


Looks good, applied, thanks! As far as I can tell all the other occurrences are
gated behind btf_parse_vmlinux() where we also init struct_opts, etc.

So for bpf-next this would then end up looking like ...

int btf_resolve_helper_id(struct bpf_verifier_log *log,
  const struct bpf_func_proto *fn, int arg)
{
int id;

if (fn->arg_type[arg] != ARG_PTR_TO_BTF_ID)
return -EINVAL;
id = fn->btf_id[arg];
if (!id || !btf_vmlinux || id > btf_vmlinux->nr_types)
return -EINVAL;
return id;
}


---
Sorry, I got the link wrong. Thank you for pointing that out.

Change in v3:
 - Fix incorrect syzbot dashboard link.

Change in v2:
 - Split NULL and IS_ERR cases.

  kernel/bpf/btf.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 30721f2c2d10..092116a311f4 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -4088,6 +4088,11 @@ static int __btf_resolve_helper_id(struct 
bpf_verifier_log *log, void *fn,
const char *tname, *sym;
u32 btf_id, i;
  
+	if (!btf_vmlinux) {

+   bpf_log(log, "btf_vmlinux doesn't exist\n");
+   return -EINVAL;
+   }
+
if (IS_ERR(btf_vmlinux)) {
bpf_log(log, "btf_vmlinux is malformed\n");
return -EINVAL;





Re: [PATCH] tools/bpftool: Fix error return code in do_skeleton()

2020-07-15 Thread Daniel Borkmann

On 7/15/20 5:13 AM, YueHaibing wrote:

The error return code should be PTR_ERR(obj) other than
PTR_ERR(NULL).

Fixes: 5dc7a8b21144 ("bpftool, selftests/bpf: Embed object file inside 
skeleton")
Signed-off-by: YueHaibing 
---
  tools/bpf/bpftool/gen.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
index 10de76b296ba..35f62273cdbd 100644
--- a/tools/bpf/bpftool/gen.c
+++ b/tools/bpf/bpftool/gen.c
@@ -305,8 +305,9 @@ static int do_skeleton(int argc, char **argv)
opts.object_name = obj_name;
obj = bpf_object__open_mem(obj_data, file_sz, );
if (IS_ERR(obj)) {
+   err = PTR_ERR(obj);
+   p_err("failed to open BPF object file: %ld", err);
obj = NULL;
-   p_err("failed to open BPF object file: %ld", PTR_ERR(obj));
goto out;


Instead of just the error number, could we dump something useful to the user 
here
via libbpf_strerror() given you touch this line? Also, I think the convention in
do_skeleton() is to just return {0,-1} given this is propagated as return code
for bpftool.


}
  





Re: [PATCH v2 bpf-next 1/2] bpf: use dedicated bpf_trace_printk event instead of trace_printk()

2020-07-10 Thread Daniel Borkmann

On 7/10/20 4:22 PM, Alan Maguire wrote:

The bpf helper bpf_trace_printk() uses trace_printk() under the hood.
This leads to an alarming warning message originating from trace
buffer allocation which occurs the first time a program using
bpf_trace_printk() is loaded.

We can instead create a trace event for bpf_trace_printk() and enable
it in-kernel when/if we encounter a program using the
bpf_trace_printk() helper.  With this approach, trace_printk()
is not used directly and no warning message appears.

This work was started by Steven (see Link) and finished by Alan; added
Steven's Signed-off-by with his permission.

Link: https://lore.kernel.org/r/20200628194334.6238b...@oasis.local.home
Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Alan Maguire 
---
  kernel/trace/Makefile|  2 ++
  kernel/trace/bpf_trace.c | 41 -
  kernel/trace/bpf_trace.h | 34 ++
  3 files changed, 72 insertions(+), 5 deletions(-)
  create mode 100644 kernel/trace/bpf_trace.h

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 6575bb0..aeba5ee 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -31,6 +31,8 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
  GCOV_PROFILE := y
  endif
  
+CFLAGS_bpf_trace.o := -I$(src)

+
  CFLAGS_trace_benchmark.o := -I$(src)
  CFLAGS_trace_events_filter.o := -I$(src)
  
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c

index 1d874d8..1414bf5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -19,6 +20,9 @@

  #include "trace_probe.h"
  #include "trace.h"
  
+#define CREATE_TRACE_POINTS

+#include "bpf_trace.h"
+
  #define bpf_event_rcu_dereference(p)  \
rcu_dereference_protected(p, lockdep_is_held(_event_mutex))
  
@@ -374,6 +378,29 @@ static void bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype,

}
  }
  
+static DEFINE_RAW_SPINLOCK(trace_printk_lock);

+
+#define BPF_TRACE_PRINTK_SIZE   1024
+
+


nit: double newline


+static inline __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...)
+{
+   static char buf[BPF_TRACE_PRINTK_SIZE];
+   unsigned long flags;
+   va_list ap;
+   int ret;
+
+   raw_spin_lock_irqsave(_printk_lock, flags);
+   va_start(ap, fmt);
+   ret = vsnprintf(buf, BPF_TRACE_PRINTK_SIZE, fmt, ap);


nit: s/BPF_TRACE_PRINTK_SIZE/sizeof(buf)/


+   va_end(ap);
+   if (ret >= 0)
+   trace_bpf_trace_printk(buf);


Is there a specific reason you added the 'ret >= 0' check on top of [0]? Given
the vsnprintf() internals you either return 0 or number of characters generated,
no?

  [0] https://lore.kernel.org/bpf/20200628194334.6238b...@oasis.local.home/

Rest lgtm, thanks!


+   raw_spin_unlock_irqrestore(_printk_lock, flags);
+
+   return ret;
+}
+
  /*
   * Only limited trace_printk() conversion specifiers allowed:
   * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pB %pks %pus %s
@@ -483,8 +510,7 @@ static void bpf_trace_copy_string(char *buf, void 
*unsafe_ptr, char fmt_ptype,
   */
  #define __BPF_TP_EMIT()   __BPF_ARG3_TP()
  #define __BPF_TP(...) \
-   __trace_printk(0 /* Fake ip */, \
-  fmt, ##__VA_ARGS__)
+   bpf_do_trace_printk(fmt, ##__VA_ARGS__)
  
  #define __BPF_ARG1_TP(...)		\

((mod[0] == 2 || (mod[0] == 1 && __BITS_PER_LONG == 64))\
@@ -521,10 +547,15 @@ static void bpf_trace_copy_string(char *buf, void 
*unsafe_ptr, char fmt_ptype,
  const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
  {
/*
-* this program might be calling bpf_trace_printk,
-* so allocate per-cpu printk buffers
+* This program might be calling bpf_trace_printk,
+* so enable the associated bpf_trace/bpf_trace_printk event.
+* Repeat this each time as it is possible a user has
+* disabled bpf_trace_printk events.  By loading a program
+* calling bpf_trace_printk() however the user has expressed
+* the intent to see such events.
 */
-   trace_printk_init_buffers();
+   if (trace_set_clr_event("bpf_trace", "bpf_trace_printk", 1))
+   pr_warn_ratelimited("could not enable bpf_trace_printk events");
  
  	return _trace_printk_proto;

  }
diff --git a/kernel/trace/bpf_trace.h b/kernel/trace/bpf_trace.h
new file mode 100644
index 000..9acbc11
--- /dev/null
+++ b/kernel/trace/bpf_trace.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM bpf_trace
+
+#if !defined(_TRACE_BPF_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+
+#define _TRACE_BPF_TRACE_H
+
+#include 
+
+TRACE_EVENT(bpf_trace_printk,
+
+   TP_PROTO(const char *bpf_string),
+
+   TP_ARGS(bpf_string),

Re: [PATCH] MAINTAINERS: XDP: restrict N: and K:

2020-07-10 Thread Daniel Borkmann

On 7/10/20 8:17 AM, Alexander A. Klimov wrote:

Am 09.07.20 um 22:37 schrieb Daniel Borkmann:

On 7/9/20 9:42 PM, Alexander A. Klimov wrote:

Rationale:
Documentation/arm/ixp4xx.rst contains "xdp" as part of "ixdp465"
which has nothing to do with XDP.

Signed-off-by: Alexander A. Klimov 
---
  See also: https://lore.kernel.org/lkml/20200709132607.7fb42415@carbon/

  MAINTAINERS | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d4aa7f942de..2bb7feb838af 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18708,8 +18708,8 @@ F:    include/trace/events/xdp.h
  F:    kernel/bpf/cpumap.c
  F:    kernel/bpf/devmap.c
  F:    net/core/xdp.c
-N:    xdp
-K:    xdp
+N:    (?:\b|_)xdp(?:\b|_)
+K:    (?:\b|_)xdp(?:\b|_)


Please also include \W to generally match on non-alphanumeric char given you
explicitly want to avoid [a-z0-9] around the term xdp.

Aren't \W, ^ and $ already covered by \b?


Ah, true; it says '\b really means (?:(?<=\w)(?!\w)|(?


Re: [PATCH bpf] selftests: bpf: fix detach from sockmap tests

2020-07-09 Thread Daniel Borkmann

On 7/9/20 1:51 PM, Lorenz Bauer wrote:

Fix sockmap tests which rely on old bpf_prog_dispatch behaviour.
In the first case, the tests check that detaching without giving
a program succeeds. Since these are not the desired semantics,
invert the condition. In the second case, the clean up code doesn't
supply the necessary program fds.

Reported-by: Martin KaFai Lau 
Signed-off-by: Lorenz Bauer 
Fixes: bb0de3131f4c ("bpf: sockmap: Require attach_bpf_fd when detaching a 
program")


Applied, thanks!


  1   2   3   4   5   6   7   8   9   10   >