Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-19 Thread Maciej Żenczykowski
On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner  wrote:
>
> On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> > On Wed, Apr 14, 2021 at 2:14 AM Greg KH  wrote:
> >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
> >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
> >> attached below.
> >>
> >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
> >> HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
> >> not be happening, we can easily fix it but I guess no one has actually
> >> had fast USB devices that noticed this until now :)
> >
> > AIUI the purpose of this timer is to support packet aggregation. USB
> > transfers are relatively expensive/high latency. 6 Gbps is 500k
> > 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> > many packets as will fit into 16k (usually, 10 1500-byte packets), and
> > only initiates a USB transfer when those packets have arrived. That
> > ends up doing only one transfer every 20us. It sets a 300us timer to
> > ensure that if the 10 packets haven't arrived, it still sends out
> > whatever it has when the timer fires. The timer is set/updated on
> > every packet buffered by ncm.
> >
> > Is this somehow much more expensive in 5.10.24 than it was before?
> > Even if this driver is somehow "holding it wrong", might there not be
> > other workloads that have a similar problem? What about regressions on
> > those workloads?
>
> Let's put the question of whether this hrtimer usage is sensible or not
> aside for now.
>
> I stared at the change for a while and did some experiments to recreate
> the problem, but that didn't get me anywhere.
>
> Could you please do the following?
>
> Enable tracing and enable the following tracepoints:
>
> timers/hrtimer_cancel
> timers/hrtimer_start
> timers/hrtimer_expire_entry
> irq/softirq_raise
> irq/softirq_enter
> irq/softirq_exit
>
> and function tracing filtered on ncm_wrap_ntb() and
> package_for_tx() only (to reduce the noise).
>
> Run the test on a kernels with and without that commit and collect trace
> data for both.
>
> That should give me a pretty clear picture what's going on.

Lorenzo is trying to get the traces you asked for, or rather he’s
trying to get confirmation he can post them.

Our initial observation of these results seems to suggest that
updating the timer (hrtimer_start, which seems to also call
hrtimer_cancel) takes twice as long as it used to.

My gut feeling is that softirq_activated is usually false, and the old
code in such a case calls just __hrtimer_get_next_event(,
HRTIMER_ACTIVE_ALL).  While the new code will first call
__hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
__hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)

Perhaps __hrtimer_get_next_event() should return both soft and hard
event times in one function call?
Or perhaps hrtimer_start should not call hrtimer_cancel?

We've also been looking at the f_ncm driver itself, and trying to
reduce the number of hrtimer_cancel/start calls.
It's pretty easy to reduce this by a factor of 10x (we’re not yet 100%
certain this is race free), which does drastically improve
performance.
However, it still only takes the regression from 60% to 20%.

- Maciej


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-13 Thread Maciej Żenczykowski
This patch (or at least the version of it that showed up in 5.10.24
LTS when combined with Android Common Kernel and some arm phone
platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps
down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a
SuperSpeed+ 10gbps USB3 connection.

The regression is not there in 5.10.23, and is present in 5.10.24 and
5.10.26.  Reverting just this one patch is confirmed to restore
performance (both on 5.10.24 and 5.10.26).

We don't know the cause, as we know nothing about hrtimers... but we
lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not
firing as often as it should...

Unfortunately I have no idea how to replicate this on commonly
available hardware (or with a pure stable or 5.11/5.12 Linus tree)
since it requires a gadget capable usb3.1 10gbps controller (which I
only have access to in combination with a 5.10-based arm+dwc3 soc).

(the regression is visible with just usb3.0, but it's smaller due to
usb3.0 topping out at just under 4gbps, though it still drops to
2.2gbps -- and this still doesn't help since usb3 gadget capable
controllers are nearly as rare)

- Maciej & Lorenzo


Re: [PATCH] configfs: make directories inherit uid/gid from creator

2021-01-27 Thread Maciej Żenczykowski
> > Currently a non-root process can create directories, but cannot
> > create stuff in the directories it creates.
>
> Isn't that on purpose?

Is it?  What's the use case of being able to create empty directories
you can't use?
Why allow their creation in the first place then?

> > + (void)configfs_setattr(dentry, );
>
> No need for (void), here, right?

Just being clear we're ignoring any potential error return.

> And this feels like a big user-visible change, what is going to break
> with this?

That I don't know, but it's unlikely to break anything, since it's virtually
impossible to use as it is now.  If non-root creates the directory, it's
currently root-owned, so can only be used by root (or something with
appropriate caps).
Something with capabilities will be able to use it even if it is no
longer owned by root.


[PATCH] configfs: make directories inherit uid/gid from creator

2021-01-23 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

Currently a non-root process can create directories, but cannot
create stuff in the directories it creates.

Example (before this patch):
  phone:/ $ id
  uid=1000(system) gid=1000(system) groups=1000(system),... context=u:r:su:s0

  phone:/ $ cd /config/usb_gadget/g1/configs/

  phone:/config/usb_gadget/g1/configs $ ls -lZ
  drwxr-xr-x 3 system system u:object_r:configfs:s0  0 2020-12-28 06:03 b.1

  phone:/config/usb_gadget/g1/configs $ mkdir b.2

  phone:/config/usb_gadget/g1/configs $ ls -lZ
  drwxr-xr-x 3 system system u:object_r:configfs:s0  0 2020-12-28 06:03 b.1
  drwxr-xr-x 3 root   root   u:object_r:configfs:s0  0 2020-12-28 06:51 b.2

  phone:/config/usb_gadget/g1/configs $ chown system:system b.2
  chown: 'b.2' to 'system:system': Operation not permitted

  phone:/config/usb_gadget/g1/configs $ ls -lZ b.1
  -rw-r--r-- 1 system system u:object_r:configfs:s0  4096 2020-12-28 05:23 
MaxPower
  -rw-r--r-- 1 system system u:object_r:configfs:s0  4096 2020-12-28 05:23 
bmAttributes
  lrwxrwxrwx 1 root   root   u:object_r:configfs:s0 0 2020-12-28 05:23 
function0 -> ../../../../usb_gadget/g1/functions/ffs.adb
  drwxr-xr-x 2 system system u:object_r:configfs:s0 0 2020-12-28 05:23 
strings

  phone:/config/usb_gadget/g1/configs $ ln -s 
../../../../usb_gadget/g1/functions/ffs.adb b.2/function0
  ln: cannot create symbolic link from 
'../../../../usb_gadget/g1/functions/ffs.adb' to 'b.2/function0': Permission 
denied

Cc: Greg Kroah-Hartman 
Cc: Joel Becker 
Cc: Christoph Hellwig 
Signed-off-by: Maciej Żenczykowski 
Change-Id: Ia907b2def940197b44aa87b337d37c5dde9c5b91
---
 fs/configfs/dir.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index b839dd1b459f..04f18402ef7c 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -1410,6 +1410,21 @@ static int configfs_mkdir(struct inode *dir, struct 
dentry *dentry, umode_t mode
else
ret = configfs_attach_item(parent_item, item, dentry, frag);
 
+   /* inherit uid/gid from process creating the directory */
+   if (!uid_eq(current_fsuid(), GLOBAL_ROOT_UID) ||
+   !gid_eq(current_fsgid(), GLOBAL_ROOT_GID)) {
+   struct inode * inode = d_inode(dentry);
+   struct iattr ia = {
+   .ia_uid = current_fsuid(),
+   .ia_gid = current_fsgid(),
+   .ia_valid = ATTR_UID | ATTR_GID,
+   };
+   inode->i_uid = ia.ia_uid;
+   inode->i_gid = ia.ia_gid;
+   /* (note: the above manual assignments skip the permission 
checks) */
+   (void)configfs_setattr(dentry, );
+   }
+
spin_lock(_dirent_lock);
sd->s_type &= ~CONFIGFS_USET_IN_MKDIR;
if (!ret)
-- 
2.30.0.280.ga3ce27912f-goog



Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface

2020-11-09 Thread Maciej Żenczykowski
On Mon, Nov 9, 2020 at 1:58 AM Steffen Klassert
 wrote:
>
> On Thu, Nov 05, 2020 at 01:52:01PM +0900, Lorenzo Colitti wrote:
> > On Tue, Sep 15, 2020 at 4:30 PM Steffen Klassert
> >  wrote:
> > > > In esp's tunnel mode,if inner interface is ipv4,outer is ipv4,one big
> > > > packet which travels through tunnel will be fragmented with outer
> > > > interface's mtu,peer server will remove tunnelled esp header and 
> > > > assemble
> > > > them in big packet.After forwarding such packet to next endpoint,it will
> > > > be dropped because of exceeding mtu or be returned ICMP(packet-too-big).
> > >
> > > What is the exact case where packets are dropped? Given that the packet
> > > was fragmented (and reassembled), I'd assume the DF bit was not set. So
> > > every router along the path is allowed to fragment again if needed.
> >
> > In general, isn't it just suboptimal to rely on fragmentation if the
> > sender already knows the packet is too big? That's why we have things
> > like path MTU discovery (RFC 1191).
>
> When we setup packets that are sent from a local socket, we take
> MTU/PMTU info we have into account. So we don't create fragments in
> that case.
>
> When forwarding packets it is different. The router that can not
> TX the packet because it exceeds the MTU of the sending interface
> is responsible to either fragment (if DF is not set), or send a
> PMTU notification (if DF is set). So if we are able to transmit
> the packet, we do it.
>
> > Fragmentation is generally
> > expensive, increases the chance of packet loss, and has historically
> > caused lots of security vulnerabilities. Also, in real world networks,
> > fragments sometimes just don't work, either because intermediate
> > routers don't fragment, or because firewalls drop the fragments due to
> > security reasons.
> >
> > While it's possible in theory to ask these operators to configure
> > their routers to fragment packets, that may not result in the network
> > being fixed, due to hardware constraints, security policy or other
> > reasons.
>
> We can not really do anything here. If a flow has no DF bit set
> on the packets, we can not rely on PMTU information. If we have PMTU
> info on the route, then we have it because some other flow (that has
> DF bit set on the packets) triggered PMTU discovery. That means that
> the PMTU information is reset when this flow (with DF set) stops
> sending packets. So the other flow (with DF not set) will send
> big packets again.

PMTU is by default ignored by forwarding - because it's spoofable.

That said I wonder if my recent changes to honour route mtu (for ipv4)
haven't fixed this particular issue in the presence of correctly
configured device/route mtus...

I don't understand if the problem here is locally generated packets,
or forwarded packets.

It does seem like there is (or was) a bug somewhere... but it might
already be fixed (see above) or might be caused by a misconfiguration
of device mtu or routing rules.

I don't really understand the example.

>
> > Those operators may also be in a position to place
> > requirements on devices that have to use their network. If the Linux
> > stack does not work as is on these networks, then those devices will
> > have to meet those requirements by making out-of-tree changes. It
> > would be good to avoid that if there's a better solution (e.g., make
> > this configurable via sysctl).
>
> We should not try to workaround broken configurations, there are just
> too many possibilities to configure a broken network.


Re: [PATCH] xfrm:fragmented ipv4 tunnel packets in inner interface

2020-11-04 Thread Maciej Żenczykowski
On Wed, Nov 4, 2020 at 7:53 PM lina.wang  wrote:
>
> Besides several operators donot fragment packets even DF bit is not set,
> and instead just drop packets which they think big, maybe they have a
> consideration--- fragmentation is expensive for both the router that
> fragments and for the host that reassembles.
>
> BTW, ipv6 has the different behaviour when it meets such scenary, which
> is what we expected,ipv4 should follow the same thing. otherwise,
> packets are drop, it is a serious thing, and there is no hints. What we
> do is just fragmenting smaller packets, or is it possible to give us
> some flag or a sysctl to allow us to change the behaviour?
>
> On Thu, 2020-09-17 at 19:19 +0800, lina.wang wrote:
> > But it is not just one operator's broken router or misconfigured
> > router.In the whole network, it is common to meet that router will drop
> > bigger packet silently.I think we should make code more compatible.and
> > the scenary is wifi calling, which mostly used udp,you know, udp
> > wouldnot retransmit.It is serious when packet is dropped
> >
> > On Thu, 2020-09-17 at 09:46 +0200, Steffen Klassert wrote:
> > > On Tue, Sep 15, 2020 at 08:17:40PM +0800, lina.wang wrote:
> > > > We didnot get the router's log, which is some operator's.Actually, after
> > > > we patched, there is no such issue. Sometimes,router will return
> > > > packet-too-big, most of times nothing returned,dropped silently
> > >
> > > This looks like a broken router, we can't fix that in the code.
> > >
> > > > On Tue, 2020-09-15 at 11:32 +0200, Steffen Klassert wrote:
> > > > > On Tue, Sep 15, 2020 at 05:05:22PM +0800, lina.wang wrote:
> > > > > >
> > > > > > Yes, DF bit is not set.
> > > > >
> > > > > ...
> > > > >
> > > > > > Those two packets are fragmented to one big udp packet, which 
> > > > > > payload is 1516B.
> > > > > > After getting rid of tunnel header, it is also a udp packet, which 
> > > > > > payload is
> > > > > > 1466 bytes.It didnot get any response for this packet.We guess it 
> > > > > > is dropped
> > > > > > by router. Because if we reduced the length, it got response.
> > > > >
> > > > > If the DF bit is not set, the router should fragment the packet. If it
> > > > > does not do so, it is misconfigured. Do you have access to that 
> > > > > router?

I'm afraid I don't really understand the exact scenario from the
description of the patch.

However... a few questions come to mind.
(a) why not set the DF flag, does the protocol require > mtu udp frames
(b) what is the mtu of the inner tunnel device, and what is the mtu on
the route through the device?
- shouldn't we be udp fragmenting to the route/device mtu?
maybe stuff is simply misconfigured...


[PATCH bpf-next] net-veth: add type safety to veth_xdp_to_ptr() and veth_ptr_to_xdp()

2020-08-18 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This reduces likelihood of incorrect use.

Test: builds
Signed-off-by: Maciej Żenczykowski 
---
 drivers/net/veth.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e56cd562a664..b80cbffeb88e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -234,14 +234,14 @@ static bool veth_is_xdp_frame(void *ptr)
return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
-static void *veth_ptr_to_xdp(void *ptr)
+static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
-static void *veth_xdp_to_ptr(void *ptr)
+static void *veth_xdp_to_ptr(struct xdp_frame *xdp)
 {
-   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+   return (void *)((unsigned long)xdp | VETH_XDP_FLAG);
 }
 
 static void veth_ptr_free(void *ptr)
-- 
2.28.0.220.ged08abb693-goog



[PATCH bpf-next 1/2] net-tun: add type safety to tun_xdp_to_ptr() and tun_ptr_to_xdp()

2020-08-18 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This reduces likelihood of incorrect use.

Test: builds
Signed-off-by: Maciej Żenczykowski 
---
 drivers/net/tun.c  | 6 +++---
 include/linux/if_tun.h | 8 
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3c11a77f5709..5dd7f353eeef 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -225,13 +225,13 @@ bool tun_is_xdp_frame(void *ptr)
 }
 EXPORT_SYMBOL(tun_is_xdp_frame);
 
-void *tun_xdp_to_ptr(void *ptr)
+void *tun_xdp_to_ptr(struct xdp_frame *xdp)
 {
-   return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
+   return (void *)((unsigned long)xdp | TUN_XDP_FLAG);
 }
 EXPORT_SYMBOL(tun_xdp_to_ptr);
 
-void *tun_ptr_to_xdp(void *ptr)
+struct xdp_frame *tun_ptr_to_xdp(void *ptr)
 {
return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 5bda8cf457b6..6c37e1dbc5df 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -28,8 +28,8 @@ struct tun_xdp_hdr {
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 bool tun_is_xdp_frame(void *ptr);
-void *tun_xdp_to_ptr(void *ptr);
-void *tun_ptr_to_xdp(void *ptr);
+void *tun_xdp_to_ptr(struct xdp_frame *xdp);
+struct xdp_frame *tun_ptr_to_xdp(void *ptr);
 void tun_ptr_free(void *ptr);
 #else
 #include 
@@ -48,11 +48,11 @@ static inline bool tun_is_xdp_frame(void *ptr)
 {
return false;
 }
-static inline void *tun_xdp_to_ptr(void *ptr)
+static inline void *tun_xdp_to_ptr(struct xdp_frame *xdp)
 {
return NULL;
 }
-static inline void *tun_ptr_to_xdp(void *ptr)
+static inline struct xdp_frame *tun_ptr_to_xdp(void *ptr)
 {
return NULL;
 }
-- 
2.28.0.220.ged08abb693-goog



[PATCH bpf-next 2/2] net-tun: eliminate two tun/xdp related function calls from vhost-net

2020-08-18 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This provides a minor performance boost by virtue of inlining
instead of cross module function calls.

Test: builds
Signed-off-by: Maciej Żenczykowski 
---
 drivers/net/tun.c  | 18 --
 include/linux/if_tun.h | 15 ---
 2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5dd7f353eeef..efaef83b8897 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -219,24 +219,6 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_frame(void *ptr)
-{
-   return (unsigned long)ptr & TUN_XDP_FLAG;
-}
-EXPORT_SYMBOL(tun_is_xdp_frame);
-
-void *tun_xdp_to_ptr(struct xdp_frame *xdp)
-{
-   return (void *)((unsigned long)xdp | TUN_XDP_FLAG);
-}
-EXPORT_SYMBOL(tun_xdp_to_ptr);
-
-struct xdp_frame *tun_ptr_to_xdp(void *ptr)
-{
-   return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
-}
-EXPORT_SYMBOL(tun_ptr_to_xdp);
-
 static int tun_napi_receive(struct napi_struct *napi, int budget)
 {
struct tun_file *tfile = container_of(napi, struct tun_file, napi);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 6c37e1dbc5df..2a7660843444 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -27,9 +27,18 @@ struct tun_xdp_hdr {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
-bool tun_is_xdp_frame(void *ptr);
-void *tun_xdp_to_ptr(struct xdp_frame *xdp);
-struct xdp_frame *tun_ptr_to_xdp(void *ptr);
+static inline bool tun_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & TUN_XDP_FLAG;
+}
+static inline void *tun_xdp_to_ptr(struct xdp_frame *xdp)
+{
+   return (void *)((unsigned long)xdp | TUN_XDP_FLAG);
+}
+static inline struct xdp_frame *tun_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
+}
 void tun_ptr_free(void *ptr);
 #else
 #include 
-- 
2.28.0.220.ged08abb693-goog



[PATCH bpf] net-tun: add type safety to tun_xdp_to_ptr() and tun_ptr_to_xdp()

2020-08-18 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

Test: builds
Signed-off-by: Maciej Żenczykowski 
---
 drivers/net/tun.c  | 6 +++---
 include/linux/if_tun.h | 8 
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3c11a77f5709..5dd7f353eeef 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -225,13 +225,13 @@ bool tun_is_xdp_frame(void *ptr)
 }
 EXPORT_SYMBOL(tun_is_xdp_frame);
 
-void *tun_xdp_to_ptr(void *ptr)
+void *tun_xdp_to_ptr(struct xdp_frame *xdp)
 {
-   return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
+   return (void *)((unsigned long)xdp | TUN_XDP_FLAG);
 }
 EXPORT_SYMBOL(tun_xdp_to_ptr);
 
-void *tun_ptr_to_xdp(void *ptr)
+struct xdp_frame *tun_ptr_to_xdp(void *ptr)
 {
return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
 }
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 5bda8cf457b6..6c37e1dbc5df 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -28,8 +28,8 @@ struct tun_xdp_hdr {
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
 bool tun_is_xdp_frame(void *ptr);
-void *tun_xdp_to_ptr(void *ptr);
-void *tun_ptr_to_xdp(void *ptr);
+void *tun_xdp_to_ptr(struct xdp_frame *xdp);
+struct xdp_frame *tun_ptr_to_xdp(void *ptr);
 void tun_ptr_free(void *ptr);
 #else
 #include 
@@ -48,11 +48,11 @@ static inline bool tun_is_xdp_frame(void *ptr)
 {
return false;
 }
-static inline void *tun_xdp_to_ptr(void *ptr)
+static inline void *tun_xdp_to_ptr(struct xdp_frame *xdp)
 {
return NULL;
 }
-static inline void *tun_ptr_to_xdp(void *ptr)
+static inline struct xdp_frame *tun_ptr_to_xdp(void *ptr)
 {
return NULL;
 }
-- 
2.28.0.220.ged08abb693-goog



[PATCH] uml - fix incorrect assumptions about max pid length

2020-08-07 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

Fixes: is_umdir_used - pid filename too long

pids are no longer limited to 16-bits, bump to 32-bits,
ie. 9 decimal characters.  Additionally sizeof("/") already
returns 2 - ie. it already accounts for trailing zero.

Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Linux UM Mailing List 
Signed-off-by: Maciej Żenczykowski 
---
 arch/um/os-Linux/umid.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/um/os-Linux/umid.c b/arch/um/os-Linux/umid.c
index 9e16078a4bf8..1d7558dac75f 100644
--- a/arch/um/os-Linux/umid.c
+++ b/arch/um/os-Linux/umid.c
@@ -97,7 +97,7 @@ static int remove_files_and_dir(char *dir)
while ((ent = readdir(directory)) != NULL) {
if (!strcmp(ent->d_name, ".") || !strcmp(ent->d_name, ".."))
continue;
-   len = strlen(dir) + sizeof("/") + strlen(ent->d_name) + 1;
+   len = strlen(dir) + strlen("/") + strlen(ent->d_name) + 1;
if (len > sizeof(file)) {
ret = -E2BIG;
goto out;
@@ -135,7 +135,7 @@ static int remove_files_and_dir(char *dir)
  */
 static inline int is_umdir_used(char *dir)
 {
-   char pid[sizeof("n\0")], *end, *file;
+   char pid[sizeof("n")], *end, *file;
int dead, fd, p, n, err;
size_t filelen;
 
@@ -217,10 +217,10 @@ static int umdir_take_if_dead(char *dir)
 
 static void __init create_pid_file(void)
 {
-   char pid[sizeof("n\0")], *file;
+   char pid[sizeof("n")], *file;
int fd, n;
 
-   n = strlen(uml_dir) + UMID_LEN + sizeof("/pid\0");
+   n = strlen(uml_dir) + UMID_LEN + sizeof("/pid");
file = malloc(n);
if (!file)
return;
-- 
2.28.0.236.gb10cc79966-goog



[PATCH bpf v2] restore behaviour of CAP_SYS_ADMIN allowing the loading of networking bpf programs

2020-06-20 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This is a fix for a regression introduced in 5.8-rc1 by:
  commit 2c78ee898d8f10ae6fb2fa23a3fbaec96b1b7366
  'bpf: Implement CAP_BPF'

Before the above commit it was possible to load network bpf programs
with just the CAP_SYS_ADMIN privilege.

The Android bpfloader happens to run in such a configuration (it has
SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs
for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN).

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Reported-by: John Stultz 
Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF")
Signed-off-by: Maciej Żenczykowski 
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..7d946435587d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2121,7 +2121,7 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
!bpf_capable())
return -EPERM;
 
-   if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN))
+   if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && 
!capable(CAP_SYS_ADMIN))
return -EPERM;
if (is_perfmon_prog_type(type) && !perfmon_capable())
return -EPERM;
-- 
2.27.0.111.gc72c7da667-goog



[PATCH] restore behaviour of CAP_SYS_ADMIN allowing the loading of net bpf program

2020-06-18 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This is a 5.8-rc1 regression.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF")
Signed-off-by: Maciej Żenczykowski 
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..7d946435587d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2121,7 +2121,7 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
!bpf_capable())
return -EPERM;
 
-   if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN))
+   if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && 
!capable(CAP_SYS_ADMIN))
return -EPERM;
if (is_perfmon_prog_type(type) && !perfmon_capable())
return -EPERM;
-- 
2.27.0.290.gba653c62da-goog



Re: [PATCH v2] [RFC] security: allow using Clang's zero initialization for stack variables

2020-06-16 Thread Maciej Żenczykowski
> We could make INIT_STACK_ALL_ZERO fall back to INIT_STACK_ALL_PATTERN
> if the compiler flag goes away - does this make sense?

No, I'm pretty sure failing to build, or at least not setting anything
is better.
AFAIK pattern actually introduces new bugs that aren't visible at all
with neither of these flags set.
(because in practice the default no flag behaviour seems to zero some
stuff [probably padding] that it doesn't with pattern)


Re: [PATCH v2] [RFC] security: allow using Clang's zero initialization for stack variables

2020-06-16 Thread Maciej Żenczykowski
> Nit, your From: line of your email does not match your signed-off-by:
> line :(

This is *probably* a matter of configuring git:
git config --global user.name "Alexander Potapenko"
git config --global user.email "gli...@google.com"
git config --global sendemail.from "Alexander Potapenko "

> Gotta love the name...

I've just read through a long discussion thread on clang dev (got
there from this cl's llvm link)...
There's a lot of interesting info in there.  But ultimately it seems
to point out tons of projects already use this or want to use it.
And there's security and stability benefits for production builds,
while pattern mode can be used for debug builds.

> Anyway, if this is enabled, and clang changes the flag or drops it, does
> the build suddenly break?

(my understanding of the patch is that the option does compiler
testing before it becomes available...
in at least some of our build systems built around kbuild the option
going away would then cause a build error due to its lack in the final
.config)

> And does gcc have something like this as well, or does that have to come
> in a compiler plugin?


Re: [PATCH v2] [RFC] security: allow using Clang's zero initialization for stack variables

2020-06-16 Thread Maciej Żenczykowski
> In addition to -ftrivial-auto-var-init=pattern (used by
> CONFIG_INIT_STACK_ALL now) Clang also supports zero initialization for
> locals enabled by -ftrivial-auto-var-init=zero.
> The future of this flag is still being debated, see
> https://bugs.llvm.org/show_bug.cgi?id=45497
> Right now it is guarded by another flag,
> -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang,
> which means it may not be supported by future Clang releases.
> Another possible resolution is that -ftrivial-auto-var-init=zero will
> persist (as certain users have already started depending on it), but the
> name of the guard flag will change.
>
> In the meantime, zero initialization has proven itself as a good
> production mitigation measure against uninitialized locals. Unlike
> pattern initialization, which has a higher chance of triggering existing
> bugs, zero initialization provides safe defaults for strings, pointers,
> indexes, and sizes. On the other hand, pattern initialization remains
> safer for return values.
> Performance-wise, the difference between pattern and zero initialization
> is usually negligible, although the generated code for zero
> initialization is more compact.
>
> This patch renames CONFIG_INIT_STACK_ALL to
> CONFIG_INIT_STACK_ALL_PATTERN and introduces another config option,
> CONFIG_INIT_STACK_ALL_ZERO, that enables zero initialization for locals
> if the corresponding flags are supported by Clang.

I'm a great fan of zero initialization as opposed to pattern.
I don't understand why clang is refusing to make this a supported option.

Anyway:

Reviewed-by: Maciej Żenczykowski 


Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu

2020-05-07 Thread Maciej Żenczykowski
(a) not clear why the max is SKB_MAX_ALLOC in the first place (this is
PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k

(b) hmm, if we're not redirecting, then exceeding the ingress device's
mtu doesn't seem to be a problem.

Indeed AFAIK this can already happen, some devices will round mtu up
when they configure the device mru buffers.
(ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4
fcs], simply because 3 * 512 is a nice amount of buffers, or they'll
accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and
Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non
VLAN mru might be 1504 or 1508)

Indeed my corp dell workstation with some standard 1 gigabit
motherboard nic has a standard default mtu of 1500, and I've seen it
receive L3 mtu 1520 packets (apparently due to misconfiguration in our
hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500
mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without
fragmenting or generating icmp too big errors).  While it's obviously
wrong, it does just work (the network paths themselves are also
obviously 1520 clean).

(c) If we are redirecting we'll eventually (after bpf program returns)
hit dev_queue_xmit(), and shouldn't that be what returns an error?

btw. is_skb_forwardable() actually tests
- device is up && (packet is gso || skb->len < dev->mtu +
dev->hard_header_len + VLAN_HLEN);

which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on
non ethernet, and the __bpf_skb_max_len function doesn't account for
VLAN...  (which possibly has implications if you try to redirect to a
vlan interface)

---

I think having an mtu check is useful, but I think the mtu should be
selectable by the bpf program.  Because it might not even be device
mtu at all, it might be path mtu which we should be testing against.
It should also be checked for gso frames, since the max post
segmentation size should be enforced.

---

I agree we should expose dev->mtu (and dev->hard_header_len and hatype)

I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is.
There just is probably more we should do.


[PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu

2020-05-06 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

__bpf_skb_max_len(skb) is used from:
  bpf_skb_adjust_room
  __bpf_skb_change_tail
  __bpf_skb_change_head

but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device (input instead of output).

This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).

The mtu check prevents us from adding the 14 byte ethernet header prior
to forwarding the packet.

After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.

Cc: Alexei Starovoitov 
Cc: Jakub Kicinski 
Signed-off-by: Maciej Żenczykowski 
---
 net/core/filter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..5c8243930462 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
 
 static u32 __bpf_skb_max_len(const struct sk_buff *skb)
 {
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+   if (skb_at_tc_ingress(skb) || !skb->dev)
+   return SKB_MAX_ALLOC;
+   return skb->dev->mtu + skb->dev->hard_header_len;
 }
 
 BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
-- 
2.26.2.526.g744177e7f7-goog



Re: [PATCH v2] net: bpf: permit redirect from L3 to L2 devices at near max mtu

2020-05-06 Thread Maciej Żenczykowski
> > I thought we have established that checking device MTU (m*T*u)
> > at ingress makes a very limited amount of sense, no?
> >
> > Shooting from the hip here, but won't something like:
> >
> > if (!skb->dev || skb->tc_at_ingress)
> > return SKB_MAX_ALLOC;
> > return skb->dev->mtu + skb->dev->hard_header_len;
> >
> > Solve your problem?
>
> I believe that probably does indeed solve the ingress case of tc
> ingress hook on cellular redirecting to wifi.
>
> However, there's 2 possible uplinks - cellular (rawip, L3), and wifi
> (ethernet, L2).
> Thus, there's actually 4 things I'm trying to support:
>
> - ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4,
> forward to wifi/ethernet <- need to add ethernet header
>
> - ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward
> to wifi/ethernet <- trivial, no packet size change
>
> - ipv4 egressing through tun (L3), translate to ipv6, forward to
> cellular uplink <- trivial, no packet size change
>
> - ipv4 egressing through tun (L3), translate to ipv6, forward to wifi
> uplink <- need to add ethernet header [*]
>
> I think your approach doesn't solve the reverse path (* up above):
>
> ie. ipv4 packets hitting a tun device (owned by a clat daemon doing
> ipv4<->ipv6 translation in userspace), being stolen by a tc egress
> ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress
> through a wifi ipv6-only uplink.
>
> Though arguably in this case I could probably simply increase the tun
> device mtu by another 14, while keeping ipv4 route mtus low...
> (tun mtu already has to be 28 bytes lower then wifi mtu to allow
> replacement of ipv4 with ipv6 header (20 bytes extra), with possibly
> an ipv6 frag header (8 more bytes))
>
> Any further thoughts?

Thinking about this some more, that seems to solve the immediate need
(case 1 above),
and I can work around case 4 with tun mtu bumps.

And maybe the real correct fix would be to simply pass in the desired path mtu
to these 3 functions via 16-bits of the flags argument.


Re: [PATCH v2] net: bpf: permit redirect from L3 to L2 devices at near max mtu

2020-05-06 Thread Maciej Żenczykowski
> I thought we have established that checking device MTU (m*T*u)
> at ingress makes a very limited amount of sense, no?
>
> Shooting from the hip here, but won't something like:
>
> if (!skb->dev || skb->tc_at_ingress)
> return SKB_MAX_ALLOC;
> return skb->dev->mtu + skb->dev->hard_header_len;
>
> Solve your problem?

I believe that probably does indeed solve the ingress case of tc
ingress hook on cellular redirecting to wifi.

However, there's 2 possible uplinks - cellular (rawip, L3), and wifi
(ethernet, L2).
Thus, there's actually 4 things I'm trying to support:

- ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4,
forward to wifi/ethernet <- need to add ethernet header

- ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward
to wifi/ethernet <- trivial, no packet size change

- ipv4 egressing through tun (L3), translate to ipv6, forward to
cellular uplink <- trivial, no packet size change

- ipv4 egressing through tun (L3), translate to ipv6, forward to wifi
uplink <- need to add ethernet header [*]

I think your approach doesn't solve the reverse path (* up above):

ie. ipv4 packets hitting a tun device (owned by a clat daemon doing
ipv4<->ipv6 translation in userspace), being stolen by a tc egress
ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress
through a wifi ipv6-only uplink.

Though arguably in this case I could probably simply increase the tun
device mtu by another 14, while keeping ipv4 route mtus low...
(tun mtu already has to be 28 bytes lower then wifi mtu to allow
replacement of ipv4 with ipv6 header (20 bytes extra), with possibly
an ipv6 frag header (8 more bytes))

Any further thoughts?


[PATCH v2] net: bpf: permit redirect from L3 to L2 devices at near max mtu

2020-05-06 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

__bpf_skb_max_len(skb) is used from:
  bpf_skb_adjust_room
  __bpf_skb_change_tail
  __bpf_skb_change_head

but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device (input instead of output).

This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).

The mtu check prevents us from adding the 14 byte ethernet header prior
to forwarding the packet.

After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.

Credits to Alexei Starovoitov for the suggestion on how to 'fix' this.

This probably should be further fixed up *somehow*, just
not at all clear how.  This does at least make things work.

Cc: Alexei Starovoitov 
Signed-off-by: Maciej Żenczykowski 
---
 net/core/filter.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..811aba77e24b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,20 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
 
 static u32 __bpf_skb_max_len(const struct sk_buff *skb)
 {
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+   if (skb->dev) {
+   unsigned short header_len = skb->dev->hard_header_len;
+
+   /* HACK: Treat 0 as ETH_HLEN to allow redirect while
+* adding ethernet header from an L3 (tun, rawip, cellular)
+* to an L2 device (tap, ethernet, wifi)
+*/
+   if (!header_len)
+   header_len = ETH_HLEN;
+
+   return skb->dev->mtu + header_len;
+   } else {
+   return SKB_MAX_ALLOC;
+   }
 }
 
 BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
-- 
2.26.2.526.g744177e7f7-goog



[PATCH v2] uml: fix a boot splat wrt use of cpu_all_mask

2019-04-10 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

Memory: 509108K/542612K available (3835K kernel code, 919K rwdata, 1028K 
rodata, 129K init, 211K bss, 33504K reserved, 0K cma-reserved)
NR_IRQS: 15
clocksource: timer: mask: 0x max_cycles: 0x1cd42e205, 
max_idle_ns: 881590404426 ns
[ cut here ]
WARNING: CPU: 0 PID: 0 at kernel/time/clockevents.c:458 
clockevents_register_device+0x72/0x140
posix-timer cpumask == cpu_all_mask, using cpu_possible_mask instead
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc4-00048-ged79cc87302b #4
Stack:
 604ebda0 603c5370 604ebe20 6046fd17
  6006fcbb 604ebdb0 603c53b5
 604ebe10 6003bfc4 604ebdd0 901ca
Call Trace:
 [<6006fcbb>] ? printk+0x0/0x94
 [<60083160>] ? clockevents_register_device+0x72/0x140
 [<6001f16e>] show_stack+0x13b/0x155
 [<603c5370>] ? dump_stack_print_info+0xe2/0xeb
 [<6006fcbb>] ? printk+0x0/0x94
 [<603c53b5>] dump_stack+0x2a/0x2c
 [<6003bfc4>] __warn+0x10e/0x13e
 [<60070320>] ? vprintk_func+0xc8/0xcf
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<6003c08b>] warn_slowpath_fmt+0x97/0x99
 [<600311a1>] ? set_signals+0x0/0x3f
 [<6003bff4>] ? warn_slowpath_fmt+0x0/0x99
 [<600842cb>] ? tick_oneshot_mode_active+0x44/0x4f
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<6007d2d5>] ? __clocksource_select+0x20/0x1b1
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<60083160>] clockevents_register_device+0x72/0x140
 [<60031192>] ? get_signals+0x0/0xf
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<60002eec>] um_timer_setup+0xc8/0xca
 [<60001b59>] start_kernel+0x47f/0x57e
 [<600035bc>] start_kernel_proc+0x49/0x4d
 [<6006c483>] ? kmsg_dump_register+0x82/0x8a
 [<6001de62>] new_thread_handler+0x81/0xb2
 [<60003571>] ? kmsg_dumper_stdout_init+0x1a/0x1c
 [<60020c75>] uml_finishsetup+0x54/0x59

random: get_random_bytes called from init_oops_id+0x27/0x34 with crng_init=0
---[ end trace 00173d0117a88acb ]---
Calibrating delay loop... 6941.90 BogoMIPS (lpj=34709504)

Signed-off-by: Maciej Żenczykowski 
Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: linux...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 arch/um/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/kernel/time.c b/arch/um/kernel/time.c
index 052de4c8acb2..0c572a48158e 100644
--- a/arch/um/kernel/time.c
+++ b/arch/um/kernel/time.c
@@ -56,7 +56,7 @@ static int itimer_one_shot(struct clock_event_device *evt)
 static struct clock_event_device timer_clockevent = {
.name   = "posix-timer",
.rating = 250,
-   .cpumask= cpu_all_mask,
+   .cpumask= cpu_possible_mask,
.features   = CLOCK_EVT_FEAT_PERIODIC |
  CLOCK_EVT_FEAT_ONESHOT,
.set_state_shutdown = itimer_shutdown,
-- 
2.21.0.392.gf8f6787159e-goog



[PATCH] uml: fix a boot splat wrt use of cpu_all_mask

2019-04-10 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

Memory: 509108K/542612K available (3835K kernel code, 919K rwdata, 1028K 
rodata, 129K init, 211K bss, 33504K reserved, 0K cma-reserved)
NR_IRQS: 15
clocksource: timer: mask: 0x max_cycles: 0x1cd42e205, 
max_idle_ns: 881590404426 ns
[ cut here ]
WARNING: CPU: 0 PID: 0 at kernel/time/clockevents.c:458 
clockevents_register_device+0x72/0x140
posix-timer cpumask == cpu_all_mask, using cpu_possible_mask instead
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc4-00048-ged79cc87302b #4
Stack:
 604ebda0 603c5370 604ebe20 6046fd17
  6006fcbb 604ebdb0 603c53b5
 604ebe10 6003bfc4 604ebdd0 901ca
Call Trace:
 [<6006fcbb>] ? printk+0x0/0x94
 [<60083160>] ? clockevents_register_device+0x72/0x140
 [<6001f16e>] show_stack+0x13b/0x155
 [<603c5370>] ? dump_stack_print_info+0xe2/0xeb
 [<6006fcbb>] ? printk+0x0/0x94
 [<603c53b5>] dump_stack+0x2a/0x2c
 [<6003bfc4>] __warn+0x10e/0x13e
 [<60070320>] ? vprintk_func+0xc8/0xcf
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<6003c08b>] warn_slowpath_fmt+0x97/0x99
 [<600311a1>] ? set_signals+0x0/0x3f
 [<6003bff4>] ? warn_slowpath_fmt+0x0/0x99
 [<600842cb>] ? tick_oneshot_mode_active+0x44/0x4f
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<6007d2d5>] ? __clocksource_select+0x20/0x1b1
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<60083160>] clockevents_register_device+0x72/0x140
 [<60031192>] ? get_signals+0x0/0xf
 [<60030fd6>] ? block_signals+0x0/0x16
 [<6006fcbb>] ? printk+0x0/0x94
 [<60002eec>] um_timer_setup+0xc8/0xca
 [<60001b59>] start_kernel+0x47f/0x57e
 [<600035bc>] start_kernel_proc+0x49/0x4d
 [<6006c483>] ? kmsg_dump_register+0x82/0x8a
 [<6001de62>] new_thread_handler+0x81/0xb2
 [<60003571>] ? kmsg_dumper_stdout_init+0x1a/0x1c
 [<60020c75>] uml_finishsetup+0x54/0x59

random: get_random_bytes called from init_oops_id+0x27/0x34 with crng_init=0
---[ end trace 00173d0117a88acb ]---
Calibrating delay loop... 6941.90 BogoMIPS (lpj=34709504)

Signed-off-by: Maciej Żenczykowski 
Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: linux...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 arch/um/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/kernel/time.c b/arch/um/kernel/time.c
index 052de4c8acb2..7c2ea3b57102 100644
--- a/arch/um/kernel/time.c
+++ b/arch/um/kernel/time.c
@@ -56,7 +56,7 @@ static int itimer_one_shot(struct clock_event_device *evt)
 static struct clock_event_device timer_clockevent = {
.name   = "posix-timer",
.rating = 250,
-   .cpumask= cpu_all_mask,
+   .cpumask= cpu_possible_mask, //cpu_all_mask,
.features   = CLOCK_EVT_FEAT_PERIODIC |
  CLOCK_EVT_FEAT_ONESHOT,
.set_state_shutdown = itimer_shutdown,
-- 
2.21.0.392.gf8f6787159e-goog



[PATCH] net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP

2019-01-24 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

__bpf_redirect() and act_mirred checks this boolean
to determine whether to prefix an ethernet header.

Signed-off-by: Maciej Żenczykowski 
---
 include/linux/if_arp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/if_arp.h b/include/linux/if_arp.h
index 6756fea18b69..e44746de95cd 100644
--- a/include/linux/if_arp.h
+++ b/include/linux/if_arp.h
@@ -54,6 +54,7 @@ static inline bool dev_is_mac_header_xmit(const struct 
net_device *dev)
case ARPHRD_IPGRE:
case ARPHRD_VOID:
case ARPHRD_NONE:
+   case ARPHRD_RAWIP:
return false;
default:
return true;
-- 
2.20.1.321.g9e740568ce-goog



Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
> Good point about CAP_DAC_OVERRIDE on files you own.
>
> I think there is an argument that you are playing dangerous games with
> the permission system there, as it isn't effectively a file you own if
> you can't read it, and you can't change it's permissions.

Append-only files are useful - particularly for logging.
It could also simply be a non-readable file on a R/O filesystem.

> Given little things like that I can completely see no_new_privs meaning
> you can't create a user namespace.  That seems consistent with the
> meaning and philosophy of no_new_privs.  So simple it is hard to get
> wrong.

Yes, I could totally buy the argument that no_new_privs should prevent
creating a user ns.

However, there's also setns() and that's a fair bit harder to reason about.
Entirely deny it?  But that actually seems potentially useful...
Allow it but cap it?  That's what this does...

> We could do more clever things like plug this whole in user namespaces,
> and that would not hurt my feelings.

Sure, this particular one wouldn't be all that easy I think... and how
many such holes are there?
I found this particular one *after* your first reply in this thread.

> However unless that is our only
> choice to avoid badly breaking userspace I would have to have to depend
> on user namespaces being perfect for no_new_privs to be a proper jail.

This stuff is ridiculously complex to get right from userspace. :-(

> As a general rule user namespaces are where we tackle the subtle scary
> things that should work, and no_new_privs is where we implement a simple
> hard to get wrong jail.  Most of the time the effect is the same to an
> outside observer (bounded permissions), but there is a real difference
> in difficulty of implementation.

So, where to now...

Would you accept patches that:

- make no_new_priv block user ns creation?

- make no_new_priv block user ns transition?

Or perhaps we can assume that lack of create privs is sufficient, and
if there's a pre-existing user ns for you to enter, then that's
acceptable...
Although this implies you probably always want to combine no_new_privs
with a leaf user ns, or no_new_privs isn't all that useful for root in
root ns...
This added complexity, probably means it should be blocked...

- inherits bset across user ns creation/transition based on X?
[this is the one we care about, because there are simply too many bugs
in the kernel wrt. certain caps]
X could be:
- a new flag similar to no_new_priv
- a new securebit flag (w/lockbit)  [provided securebits survive a
userns transition, haven't checked]
- or perhaps a new capability
- something else?

How do we make forward progress?


Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
> Good point about CAP_DAC_OVERRIDE on files you own.
>
> I think there is an argument that you are playing dangerous games with
> the permission system there, as it isn't effectively a file you own if
> you can't read it, and you can't change it's permissions.

Append-only files are useful - particularly for logging.
It could also simply be a non-readable file on a R/O filesystem.

> Given little things like that I can completely see no_new_privs meaning
> you can't create a user namespace.  That seems consistent with the
> meaning and philosophy of no_new_privs.  So simple it is hard to get
> wrong.

Yes, I could totally buy the argument that no_new_privs should prevent
creating a user ns.

However, there's also setns() and that's a fair bit harder to reason about.
Entirely deny it?  But that actually seems potentially useful...
Allow it but cap it?  That's what this does...

> We could do more clever things like plug this whole in user namespaces,
> and that would not hurt my feelings.

Sure, this particular one wouldn't be all that easy I think... and how
many such holes are there?
I found this particular one *after* your first reply in this thread.

> However unless that is our only
> choice to avoid badly breaking userspace I would have to have to depend
> on user namespaces being perfect for no_new_privs to be a proper jail.

This stuff is ridiculously complex to get right from userspace. :-(

> As a general rule user namespaces are where we tackle the subtle scary
> things that should work, and no_new_privs is where we implement a simple
> hard to get wrong jail.  Most of the time the effect is the same to an
> outside observer (bounded permissions), but there is a real difference
> in difficulty of implementation.

So, where to now...

Would you accept patches that:

- make no_new_priv block user ns creation?

- make no_new_priv block user ns transition?

Or perhaps we can assume that lack of create privs is sufficient, and
if there's a pre-existing user ns for you to enter, then that's
acceptable...
Although this implies you probably always want to combine no_new_privs
with a leaf user ns, or no_new_privs isn't all that useful for root in
root ns...
This added complexity, probably means it should be blocked...

- inherits bset across user ns creation/transition based on X?
[this is the one we care about, because there are simply too many bugs
in the kernel wrt. certain caps]
X could be:
- a new flag similar to no_new_priv
- a new securebit flag (w/lockbit)  [provided securebits survive a
userns transition, haven't checked]
- or perhaps a new capability
- something else?

How do we make forward progress?


Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
On Thu, Dec 21, 2017 at 10:44 PM, Eric W. Biederman
 wrote:
> No.  This makes no logical sense.
>
> A task that enters a user namespace loses all capabilities to everything
> outside of the user namespace.  Capabilities inside a user namespace are
> only valid for objects created inside that user namespace.
>
> So limiting capabilities inside a user namespace when the capability
> bounding set is already fully honored by not giving the processes any of
> those capabilities makes no logical sense.
>
> If the concern is kernel attack surface versus logical permissions we
> can look at ways to reduce the attack surface but that needs to be fully
> discussed in the change log.

Here's an example of using user namespaces to read a file you
shouldn't be able to.

lpk19:~# uname -r
4.15.0-smp-d1ce8ceb8ba8

(we start as true global root)
lpk19:~# id
uid=0(root) gid=0(root) groups=0(root)

(cleanup after previous run)
lpk19:~# cd /; chattr -i /immu; rm -f /immu/log; rmdir /immu

(now we create an append only logfile owned by target user:group)
lpk19:~# cd /; mkdir /immu; touch /immu/log; chown produser:prod
/immu/log; chmod a-rwx,u+w /immu/log; chattr +a /immu/log

(let's show what things look like)
lpk19:~# chattr +i /immu; ls -ld / /immu /immu/log; lsattr -d / /immu /immu/log
drwxr-xr-x 22 root root 4096 Dec 21 16:33 /
drwxr-xr-x 2 root root 4096 Dec 21 16:23 /immu
--w--- 1 produser prod 0 Dec 21 16:23 /immu/log
---I--e /
i-e /immu
-ae /immu/log

(the immutable bit prevents us from changing permissions on the file)
lpk19:/# chmod a+rwx /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(the append only bit prevents us from simply overwriting the file)
lpk19:/# echo log1 > /immu/log
-bash: /immu/log: Operation not permitted

(but we can append to it)
lpk19:/# echo log1 >> /immu/log

(we're global root with CAP_DAC_OVERRIDE, so we can *still* read it)
lpk19:/# cat /immu/log
log1

(let's transition to target user)
lpk19:/# su - produser

produser@lpk19:~$ id
uid=2080(produser) gid=620(prod) groups=620(prod)

(we can't overwrite it)
produser@lpk19:~$ echo log2 > /immu/log
-su: /immu/log: Operation not permitted

(but we can log to it: as intended)
produser@lpk19:~$ echo log2 >> /immu/log

(we can't change its permissions, cause it's in an immutable directory)
produser@lpk19:~$ chmod u+r /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(we can't dump the file, cause we don't have CAP_DAC_OVERRIDE)
produser@lpk19:~$ cat /immu/log
cat: /immu/log: Permission denied

(or can we?)
produser@lpk19:~$ unshare -U -r cat /immu/log
log1
log2



Now, of course, the above patch doesn't actually fix this on it's own,
since 'su' doesn't (yet?) know to restrict bset or to set
no_new_privs.

But: it allows the sandbox equivalent of su to drop CAP_DAC_OVERRIDE
from it's inh/eff/perm/ambient/bset, and set no_new_privs.
Now the unshare won't gain CAP_DAC_OVERRIDE and won't be able to cat
the non-readable append-only log file.

IMHO the point of having a capability bounding set and/or no_new_privs
is to never be able to regain capabilities.
Note also that 'no_new_privs' isn't cleared across a
unshare(CLONE_NEWUSER) [presumably also applies to setns()].

We can of course argue the implementation details (for example instead
of using the existing no_new_privs flag, add a new
keep_bset_across_userns_transitions securebits flag)... but
*something* has to be done.

- Maciej


Re: [PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
On Thu, Dec 21, 2017 at 10:44 PM, Eric W. Biederman
 wrote:
> No.  This makes no logical sense.
>
> A task that enters a user namespace loses all capabilities to everything
> outside of the user namespace.  Capabilities inside a user namespace are
> only valid for objects created inside that user namespace.
>
> So limiting capabilities inside a user namespace when the capability
> bounding set is already fully honored by not giving the processes any of
> those capabilities makes no logical sense.
>
> If the concern is kernel attack surface versus logical permissions we
> can look at ways to reduce the attack surface but that needs to be fully
> discussed in the change log.

Here's an example of using user namespaces to read a file you
shouldn't be able to.

lpk19:~# uname -r
4.15.0-smp-d1ce8ceb8ba8

(we start as true global root)
lpk19:~# id
uid=0(root) gid=0(root) groups=0(root)

(cleanup after previous run)
lpk19:~# cd /; chattr -i /immu; rm -f /immu/log; rmdir /immu

(now we create an append only logfile owned by target user:group)
lpk19:~# cd /; mkdir /immu; touch /immu/log; chown produser:prod
/immu/log; chmod a-rwx,u+w /immu/log; chattr +a /immu/log

(let's show what things look like)
lpk19:~# chattr +i /immu; ls -ld / /immu /immu/log; lsattr -d / /immu /immu/log
drwxr-xr-x 22 root root 4096 Dec 21 16:33 /
drwxr-xr-x 2 root root 4096 Dec 21 16:23 /immu
--w--- 1 produser prod 0 Dec 21 16:23 /immu/log
---I--e /
i-e /immu
-ae /immu/log

(the immutable bit prevents us from changing permissions on the file)
lpk19:/# chmod a+rwx /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(the append only bit prevents us from simply overwriting the file)
lpk19:/# echo log1 > /immu/log
-bash: /immu/log: Operation not permitted

(but we can append to it)
lpk19:/# echo log1 >> /immu/log

(we're global root with CAP_DAC_OVERRIDE, so we can *still* read it)
lpk19:/# cat /immu/log
log1

(let's transition to target user)
lpk19:/# su - produser

produser@lpk19:~$ id
uid=2080(produser) gid=620(prod) groups=620(prod)

(we can't overwrite it)
produser@lpk19:~$ echo log2 > /immu/log
-su: /immu/log: Operation not permitted

(but we can log to it: as intended)
produser@lpk19:~$ echo log2 >> /immu/log

(we can't change its permissions, cause it's in an immutable directory)
produser@lpk19:~$ chmod u+r /immu/log
chmod: changing permissions of '/immu/log': Operation not permitted

(we can't dump the file, cause we don't have CAP_DAC_OVERRIDE)
produser@lpk19:~$ cat /immu/log
cat: /immu/log: Permission denied

(or can we?)
produser@lpk19:~$ unshare -U -r cat /immu/log
log1
log2



Now, of course, the above patch doesn't actually fix this on it's own,
since 'su' doesn't (yet?) know to restrict bset or to set
no_new_privs.

But: it allows the sandbox equivalent of su to drop CAP_DAC_OVERRIDE
from it's inh/eff/perm/ambient/bset, and set no_new_privs.
Now the unshare won't gain CAP_DAC_OVERRIDE and won't be able to cat
the non-readable append-only log file.

IMHO the point of having a capability bounding set and/or no_new_privs
is to never be able to regain capabilities.
Note also that 'no_new_privs' isn't cleared across a
unshare(CLONE_NEWUSER) [presumably also applies to setns()].

We can of course argue the implementation details (for example instead
of using the existing no_new_privs flag, add a new
keep_bset_across_userns_transitions securebits flag)... but
*something* has to be done.

- Maciej


[PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
From: Maciej Żenczykowski <m...@google.com>

This allows locking down user namespaces tighter,
and it could even be considered a security fix.

Signed-off-by: Maciej Żenczykowski <m...@google.com>
---
 kernel/user_namespace.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..2354f7ade78a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -50,11 +50,12 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
 * anything as the capabilities are bound to the new user namespace.
 */
cred->securebits = SECUREBITS_DEFAULT;
+   cred->cap_bset = task_no_new_privs(current) ? current_cred()->cap_bset
+   : CAP_FULL_SET;
cred->cap_inheritable = CAP_EMPTY_SET;
-   cred->cap_permitted = CAP_FULL_SET;
-   cred->cap_effective = CAP_FULL_SET;
+   cred->cap_permitted = cred->cap_bset;
+   cred->cap_effective = cred->cap_bset;
cred->cap_ambient = CAP_EMPTY_SET;
-   cred->cap_bset = CAP_FULL_SET;
 #ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
-- 
2.15.1.620.gb9897f4670-goog



[PATCH] userns: honour no_new_privs for cap_bset during user ns creation/switch

2017-12-21 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

This allows locking down user namespaces tighter,
and it could even be considered a security fix.

Signed-off-by: Maciej Żenczykowski 
---
 kernel/user_namespace.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..2354f7ade78a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -50,11 +50,12 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
 * anything as the capabilities are bound to the new user namespace.
 */
cred->securebits = SECUREBITS_DEFAULT;
+   cred->cap_bset = task_no_new_privs(current) ? current_cred()->cap_bset
+   : CAP_FULL_SET;
cred->cap_inheritable = CAP_EMPTY_SET;
-   cred->cap_permitted = CAP_FULL_SET;
-   cred->cap_effective = CAP_FULL_SET;
+   cred->cap_permitted = cred->cap_bset;
+   cred->cap_effective = cred->cap_bset;
cred->cap_ambient = CAP_EMPTY_SET;
-   cred->cap_bset = CAP_FULL_SET;
 #ifdef CONFIG_KEYS
key_put(cred->request_key_auth);
cred->request_key_auth = NULL;
-- 
2.15.1.620.gb9897f4670-goog



Re: GSO with udp_tunnel_xmit_skb

2015-11-08 Thread Maciej Żenczykowski
> Once it figures out which gso_inner_segment to use, it calls
> __skb_udp_tunnel_segment with it, which then does some curious header
> calculations on various lengths (that I need to read carefully), and
> then proceeds to split the segments using our gso_inner_segment
> function of choice, and then adds the length and checksum header
> fields. Unfortunately, it doesn't add the UDP source and destination
> port header fields. That means I might as well be building the UDP
> headers ahead of time myself, which is a bit of a bummer.

I'm guessing the udp src dst port (and ??possibly?? optional gue
headers) are meant to be part of the external headers that are already
pre-populated.

> Anyway, the idea would be to [ab]use SKB_GSO_UDP_TUNNEL with a
> scintillating gso_inner_segment function for a custom inner_ipproto
> field, in order to make a superpacket.

That's probably basically what that was designed for.  So doesn't seem
like an abuse.

Tunnel GSO offloads are still very very fresh and actively being
worked on (by Tom and Eric among others).
I'm afraid my knowledge of them at HEAD is very limited.
I've only recently started experimenting in this area myself.

> How's this looking as a strategy (and an outline of the "niggly bits"
> as you put it)?

Looks fine.  Devil is in the details.  You may discover that the stack
is still missing some things you'll need to add in.

(for example, personally I'm trying to understand if CHECKSUM_PARTIAL
shouldn't carry an extra bit of information specifying whether
we need a TCP or UDP style checksum, since they differ in how a
checksum of 0 is transmitted, it appears this causes nic drivers
to need to redigest the packet to figure it out before they can pass
it on to the hardware)

- Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GSO with udp_tunnel_xmit_skb

2015-11-08 Thread Maciej Żenczykowski
> Once it figures out which gso_inner_segment to use, it calls
> __skb_udp_tunnel_segment with it, which then does some curious header
> calculations on various lengths (that I need to read carefully), and
> then proceeds to split the segments using our gso_inner_segment
> function of choice, and then adds the length and checksum header
> fields. Unfortunately, it doesn't add the UDP source and destination
> port header fields. That means I might as well be building the UDP
> headers ahead of time myself, which is a bit of a bummer.

I'm guessing the udp src dst port (and ??possibly?? optional gue
headers) are meant to be part of the external headers that are already
pre-populated.

> Anyway, the idea would be to [ab]use SKB_GSO_UDP_TUNNEL with a
> scintillating gso_inner_segment function for a custom inner_ipproto
> field, in order to make a superpacket.

That's probably basically what that was designed for.  So doesn't seem
like an abuse.

Tunnel GSO offloads are still very very fresh and actively being
worked on (by Tom and Eric among others).
I'm afraid my knowledge of them at HEAD is very limited.
I've only recently started experimenting in this area myself.

> How's this looking as a strategy (and an outline of the "niggly bits"
> as you put it)?

Looks fine.  Devil is in the details.  You may discover that the stack
is still missing some things you'll need to add in.

(for example, personally I'm trying to understand if CHECKSUM_PARTIAL
shouldn't carry an extra bit of information specifying whether
we need a TCP or UDP style checksum, since they differ in how a
checksum of 0 is transmitted, it appears this causes nic drivers
to need to redigest the packet to figure it out before they can pass
it on to the hardware)

- Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GSO with udp_tunnel_xmit_skb

2015-11-07 Thread Maciej Żenczykowski
> What I was thinking about is this: My driver receives a super-packet.
> By calling skb_gso_segment(), I'm given a list of equal sized packets
> (of gso_size each), except for the last one which is either the same
> size or smaller than the rest. Let's say calling skb_gso_segment()
> gives me a list of 1300 byte packets.

This isn't particularly efficient.  This is basically equivalent to doing
GSO before the superpacket reaches your driver (you might get some
savings by not bothering to look at the packet headers of the second
and on packets, but that's most likely minimal savings).

In particular you're allocating a new skb and clearing it for each of those
1300 byte packets (and deallocating the superpacket skb).  And then you
are presumably deallocating all those freshly allocated skbs - since
I'm guessing
you are creating new skbs for transmit.

What you really want to do (although of course it's much harder)
is not call skb_gso_segment() at all for packet formats you know how
to handle (ideally you can handle anything you claim to be able to
handle via the features bits)
and instead reach directly into the skb and grab the right portions
of it and handle them directly.  This way you only ever have the one
incoming skb,
but yes it requires considerable effort.

This should get you a fair bit of savings.

> Next, I do a particular
> transformation to the packet. Let's say I encrypt it somehow, and I
> add on some additional information. Now all those 1300 byte packets
> yield new 1400 byte packets. It is time to send those 1400 byte
> packets to a particular destination.

Are you in control of the receiver?  Can you modify packet format?

> Since they're all children of the
> same skb_gso_segment()ified packet, they're all destined for the same
> destination. So, one solution is to do this:
>
> for each skb in list:
> udp_tunnel_xmit_skb(dst, skb);
>
> But this does not perform how I'd like it to perform. The reason is
> that now each and every one of these packets has to traverse the whole
> networking stack, including various netfilter postrouting hooks and
> such, but most importantly, it means the ethernet driver that's
> sending the physical packet has to process each and every one.

Theoretically you could manually add the proper headers to each
of the new packets, and create a chain and send that, although
honestly I'm not sure if the stack is at all capable of dealing with
that atm.

Alternatively instead of sending through the stack, put on full ethernet
headers and send straight to the nic via the nic's xmit function.

> My hope was that instead of doing the `for each` above, I could
> instead do something like:
>
> superpacket->gso_size = 1400
> for each skb in list:
> add_to_superpacket_as_ufo(skb, superpacket);
> udp_tunnel_xmit_skb(dst, superpacket);

UFO = UDP Fragmentation Offload = really meaning 'UDP transmit
checksum offload + IP fragmentation offload'

so when you send that out you get ip fragments of 1 udp packet, not
many individual udp packets.

> And that way, the superpacket would only have to traverse the
> networking stack once, leaving it either to the final ethernet driver
> to send in a big chunk to the ethernet card, or to the
> skb_gso_segment() call in core.c's validate_xmit_skb().

> Is this conceptually okay? What you wrote would seem to indicate it
> doesn't make sense conceptually, but I'm not sure.

This definitely doesn't make sense with UFO.

---

It is possible some hardware (possibly some intel nics, maybe bnx2x)
could be tricked into doing udp segmentation with their tcp segmentation
engine.  Theoretically (based on having glanced at the datasheets) the
intel nic segmentation is pretty generic, and it would appear at first
glance that with the right driver hacks (populating the transmit descriptor
correctly) it could be made to work.  I mention bnx2x because
they managed to make tcp segmentation work with tunnels,
so it's possible that the support is generic enough for it to be possible (with
driver changes).  Who knows.

It may or may not require putting on a fake 20 byte TCP header.
There's some tunnel spec that basically does that (should be able to find
an RFC online [perhaps I'm thinking of STT - Stateless Transport Tunneling].

I don't think there is currently any way to setup a linux skb with the
right metadata for it to just happen though.

It does seem like something that could be potentially worth adding though.

> So you mean to say UFO is mostly useful for just IP fragmentation?
> Don't some NICs also generate individual UDP packets when you pass it
> a big buffer of multiple pieces of data all at once?

I'm not actually aware of any nics doing that.  It's possible if you
take an IP/TCP TSO
superpacket and stuff an extra IP/UDP header on it the existing tunnel offload
stuff in the kernel might make that happen with some nics.  Unsure though
(as in unsure whether IP/UDP tunneling is currently supported, I know
IP/GRE is).
--
To unsubscribe 

Re: GSO with udp_tunnel_xmit_skb

2015-11-07 Thread Maciej Żenczykowski
> I suppose this is about UFO.
>
> Specifically -- let's say I have a list of 500 skbs, which have their
> data in place but don't yet have an IP or UDP header etc. I want to
> send out these out using udp_tunnel_xmit_skb. Right now, if I just
> send them all out, one after another, they don't seem to be getting
> assembled into a super packet suitable for UFO. Instead, they're just
> sent one at a time, and I get the vast majority of `perf top` CPU
> usage in my ethernet card's driver and along the path to it -- the
> problem that UFO is supposed to solve.
>
> So my question is -- how can I make UFO happen with udp_tunnel_xmit_skb?

UFO will never collapse multiple (UDP) packets.

It would be incorrect to do so, since UDP has to maintain packet
framing boundaries, and the only way to mark that on the wire is via
individual appropriately sized packets.

UFO prevents the need to do IP fragmentation on overly large
*singular* UDP packets.

The case where UFO (should) help is if you are taking a TCP TSO
segment of 10k and adding UDP headers and sending it out as an
20+8+10k UDP packet.
Without UFO this would now need to be software (potentially
checksummed and) ip fragmented into (8+10k)/(1500-20) packets
(assuming 1500 mtu), with UFO hw offload the nic deals with that (it
does the checksumming and it does the ip fragmentation).

Although note: in the case of UDP+TCP TSO this has reliability issues,
since a loss of a single frame will now lose the entire fragmented IP
UDP datagram and thus lose the entire TCP TSO segment,
meaning that you probably do not want to use this unless your network
is lossless (ie. loopback, veth and other virtual networks come to
mind).

I guess UDP encap of a larger than mtu UDP is probably a valid use
case for UFO, since we'd have ip fragmented anyway, and it's cheaper
to ip fragment on the outer IP header than on the inner.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GSO with udp_tunnel_xmit_skb

2015-11-07 Thread Maciej Żenczykowski
> I suppose this is about UFO.
>
> Specifically -- let's say I have a list of 500 skbs, which have their
> data in place but don't yet have an IP or UDP header etc. I want to
> send out these out using udp_tunnel_xmit_skb. Right now, if I just
> send them all out, one after another, they don't seem to be getting
> assembled into a super packet suitable for UFO. Instead, they're just
> sent one at a time, and I get the vast majority of `perf top` CPU
> usage in my ethernet card's driver and along the path to it -- the
> problem that UFO is supposed to solve.
>
> So my question is -- how can I make UFO happen with udp_tunnel_xmit_skb?

UFO will never collapse multiple (UDP) packets.

It would be incorrect to do so, since UDP has to maintain packet
framing boundaries, and the only way to mark that on the wire is via
individual appropriately sized packets.

UFO prevents the need to do IP fragmentation on overly large
*singular* UDP packets.

The case where UFO (should) help is if you are taking a TCP TSO
segment of 10k and adding UDP headers and sending it out as an
20+8+10k UDP packet.
Without UFO this would now need to be software (potentially
checksummed and) ip fragmented into (8+10k)/(1500-20) packets
(assuming 1500 mtu), with UFO hw offload the nic deals with that (it
does the checksumming and it does the ip fragmentation).

Although note: in the case of UDP+TCP TSO this has reliability issues,
since a loss of a single frame will now lose the entire fragmented IP
UDP datagram and thus lose the entire TCP TSO segment,
meaning that you probably do not want to use this unless your network
is lossless (ie. loopback, veth and other virtual networks come to
mind).

I guess UDP encap of a larger than mtu UDP is probably a valid use
case for UFO, since we'd have ip fragmented anyway, and it's cheaper
to ip fragment on the outer IP header than on the inner.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GSO with udp_tunnel_xmit_skb

2015-11-07 Thread Maciej Żenczykowski
> What I was thinking about is this: My driver receives a super-packet.
> By calling skb_gso_segment(), I'm given a list of equal sized packets
> (of gso_size each), except for the last one which is either the same
> size or smaller than the rest. Let's say calling skb_gso_segment()
> gives me a list of 1300 byte packets.

This isn't particularly efficient.  This is basically equivalent to doing
GSO before the superpacket reaches your driver (you might get some
savings by not bothering to look at the packet headers of the second
and on packets, but that's most likely minimal savings).

In particular you're allocating a new skb and clearing it for each of those
1300 byte packets (and deallocating the superpacket skb).  And then you
are presumably deallocating all those freshly allocated skbs - since
I'm guessing
you are creating new skbs for transmit.

What you really want to do (although of course it's much harder)
is not call skb_gso_segment() at all for packet formats you know how
to handle (ideally you can handle anything you claim to be able to
handle via the features bits)
and instead reach directly into the skb and grab the right portions
of it and handle them directly.  This way you only ever have the one
incoming skb,
but yes it requires considerable effort.

This should get you a fair bit of savings.

> Next, I do a particular
> transformation to the packet. Let's say I encrypt it somehow, and I
> add on some additional information. Now all those 1300 byte packets
> yield new 1400 byte packets. It is time to send those 1400 byte
> packets to a particular destination.

Are you in control of the receiver?  Can you modify packet format?

> Since they're all children of the
> same skb_gso_segment()ified packet, they're all destined for the same
> destination. So, one solution is to do this:
>
> for each skb in list:
> udp_tunnel_xmit_skb(dst, skb);
>
> But this does not perform how I'd like it to perform. The reason is
> that now each and every one of these packets has to traverse the whole
> networking stack, including various netfilter postrouting hooks and
> such, but most importantly, it means the ethernet driver that's
> sending the physical packet has to process each and every one.

Theoretically you could manually add the proper headers to each
of the new packets, and create a chain and send that, although
honestly I'm not sure if the stack is at all capable of dealing with
that atm.

Alternatively instead of sending through the stack, put on full ethernet
headers and send straight to the nic via the nic's xmit function.

> My hope was that instead of doing the `for each` above, I could
> instead do something like:
>
> superpacket->gso_size = 1400
> for each skb in list:
> add_to_superpacket_as_ufo(skb, superpacket);
> udp_tunnel_xmit_skb(dst, superpacket);

UFO = UDP Fragmentation Offload = really meaning 'UDP transmit
checksum offload + IP fragmentation offload'

so when you send that out you get ip fragments of 1 udp packet, not
many individual udp packets.

> And that way, the superpacket would only have to traverse the
> networking stack once, leaving it either to the final ethernet driver
> to send in a big chunk to the ethernet card, or to the
> skb_gso_segment() call in core.c's validate_xmit_skb().

> Is this conceptually okay? What you wrote would seem to indicate it
> doesn't make sense conceptually, but I'm not sure.

This definitely doesn't make sense with UFO.

---

It is possible some hardware (possibly some intel nics, maybe bnx2x)
could be tricked into doing udp segmentation with their tcp segmentation
engine.  Theoretically (based on having glanced at the datasheets) the
intel nic segmentation is pretty generic, and it would appear at first
glance that with the right driver hacks (populating the transmit descriptor
correctly) it could be made to work.  I mention bnx2x because
they managed to make tcp segmentation work with tunnels,
so it's possible that the support is generic enough for it to be possible (with
driver changes).  Who knows.

It may or may not require putting on a fake 20 byte TCP header.
There's some tunnel spec that basically does that (should be able to find
an RFC online [perhaps I'm thinking of STT - Stateless Transport Tunneling].

I don't think there is currently any way to setup a linux skb with the
right metadata for it to just happen though.

It does seem like something that could be potentially worth adding though.

> So you mean to say UFO is mostly useful for just IP fragmentation?
> Don't some NICs also generate individual UDP packets when you pass it
> a big buffer of multiple pieces of data all at once?

I'm not actually aware of any nics doing that.  It's possible if you
take an IP/TCP TSO
superpacket and stuff an extra IP/UDP header on it the existing tunnel offload
stuff in the kernel might make that happen with some nics.  Unsure though
(as in unsure whether IP/UDP tunneling is currently supported, I know
IP/GRE is).
--
To unsubscribe 

Re: [PATCH net-next v1 0/7] net: extend ethtool link mode bitmaps to 48 bits

2015-01-04 Thread Maciej Żenczykowski
>> I can send updates to other drivers, even though it's rather pointless
>> to update 1G drivers at this point for example. Please let me know,
>> but I'd prefer to do this in follow-up patches outside this first
>> patch series.
> [...]
>
> They should be changed to ensure they reject setting any of the high
> advertising flags, but it's not urgent.

if old drivers advertised a get/set_bits function while new drivers
advertised a get/set_new_bits function,
you could not updated any old drivers, and simply take care of
rejecting invalid bits in core, by calling set_new_bits if provided,
if not, rejecting bad bits and calling set_bits if no bad bits were
set.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next v1 0/7] net: extend ethtool link mode bitmaps to 48 bits

2015-01-04 Thread Maciej Żenczykowski
 I can send updates to other drivers, even though it's rather pointless
 to update 1G drivers at this point for example. Please let me know,
 but I'd prefer to do this in follow-up patches outside this first
 patch series.
 [...]

 They should be changed to ensure they reject setting any of the high
 advertising flags, but it's not urgent.

if old drivers advertised a get/set_bits function while new drivers
advertised a get/set_new_bits function,
you could not updated any old drivers, and simply take care of
rejecting invalid bits in core, by calling set_new_bits if provided,
if not, rejecting bad bits and calling set_bits if no bad bits were
set.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
> Thanks, that's really helpful.   I can say that using a 4MB journal and
> running fsstress is _not_ enough to trigger the problem.
>
> Looking more closely at what might be needed to trigger the bug, 'i'
> gets left uninitialized when err is set to -EAGAIN, and that happens
> when ext4_ext_truncate_extend_restart() is unable to extend the
> journal transaction.  But that also means we need to be deleting a
> sufficiently large enough file that the blocks span multiple block
> groups (which is why we need to extend the transaction, so we can
> modify more bitmap blocks) at the point when there is no more room in
> the journal, so we have to close the current transaction, and then
> retry it again with a new journal handle in a new transaction.
>
> So that implies that untaring a bunch of kernels probably won't be
> sufficient, since the files will be too small.  What we probably will
> need to do is to fill a large file system with lots of large files,
> use a small journal, and then try to do an rm -rf.
>
>  - Ted

My suggestion of untarring kernels was to cause the big multi gigabyte
files created later on to be massively fragmented, and thus have tons
of extents and a relatively deep extent tree.
But maybe that's not needed to trigger this bug, if as you say, it is
caused by the absolute number of disks blocks being freed and not by
the size/depth/complexity of the extent tree.
My knowledge of the internals of ext4 is pretty much non-existent. ;-)
 In this case I'm just an end user.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
> Maciej, you weren't able to reliably repro the crash were you?  I'm
> pretty sure this should fix the crash, but it would be really great to
> confirm things.
>
> I suspect creating a file system with a really small journal may make
> it easier to reproduce, but I haven't had time to try create a
> reliable repro for this bug yet.

This happened twice to me while moving data off of a ~1TB ext4 partition.
The data portion was on a stripe raid across 2 ~500GB drives, the
journal was on a relatively large partition (500MB?) on an SSD.
(crypto and lvm were also involved).
I've since emptied the partition and deleted even the raid array.

Both times it happened during rm, first time rm -rf of a directory
tree, second time during rm of a 250GB disk image generated by dd
(from a notebook drive).
Both rm's were manually run by me from a shell command line, and there
was pretty much nothing else happening on the machine at the time.

I'm not aware of there having been anything interesting (like:
holes/punch/sparseness, much r/w activity in the middle of files, etc)
on this filesystem, it was pretty much just a write-once data backup
that I had copied elsewhere and was deleting.  The 250GB disk image
was definitely just a sequentially written disk dump, and I think the
same thing holds true for the contents of the wiped directory tree
(although in many much smaller files).

I know i=1 in both cases (and dissasembly pointed out the location
where the above debug patch is BUGing), but I don't think it's
possible to figure out what inode # it crashed on.

Perhaps just untarring a bunch of kernels onto an empty partition,
filling it up, then deleting those kernels should be sufficient to
repro this (untried).

Perhaps something like:
  create 1TB filesystem
  untar a thousand kernel source trees on to it
  create 20GB files of junk until it is full
  rm -rf /


- Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
This would probably be much more readable code if the 'i=0' init was
before path=kzalloc.

On Thu, Aug 16, 2012 at 8:25 AM, Theodore Ts'o  wrote:
> On Thu, Aug 16, 2012 at 07:10:51PM +0800, Fengguang Wu wrote:
>>
>> Here is the dmesg. BTW, it seems 3.5.0 don't have this issue.
>
> Fengguang,
>
> It sounds like you have a (at least fairly) reliable reproduction for
> this problem?  Is it something you can share?  It would be good to get
> this into our test suites, since it was _not_ something that was
> caught by xfstests, apparently.
>
> Can you see if this patch addresses it?  (The first two patch hunks
> are the same debugging additions I had posted before.)
>
> It looks like the responsible commit is 968dee7722: "ext4: fix hole
> punch failure when depth is greater than 0".  I had thought this patch
> was low risk if you weren't using the new punch ioctl, but it turns
> out it did make a critical change in the non-punch (i.e., truncate)
> code path, which is what the addition of "i = 0;" in the patch below
> addresses.
>
> Regards,
>
>   - Ted
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 769151d..fa829dc 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2432,6 +2432,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
>
> /* the header must be checked already in ext4_ext_remove_space() */
> ext_debug("truncate since %u in leaf to %u\n", start, end);
> +   if (!path[depth].p_hdr && !path[depth].p_bh) {
> +   EXT4_ERROR_INODE(inode, "depth %d", depth);
> +   BUG_ON(1);
> +   }
> if (!path[depth].p_hdr)
> path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
> eh = path[depth].p_hdr;
> @@ -2730,6 +2734,10 @@ cont:
> /* this is index block */
> if (!path[i].p_hdr) {
> ext_debug("initialize header\n");
> +   if (!path[i].p_hdr && !path[i].p_bh) {
> +   EXT4_ERROR_INODE(inode, "i=%d", i);
> +   BUG_ON(1);
> +   }
> path[i].p_hdr = ext_block_hdr(path[i].p_bh);
> }
>
> @@ -2828,6 +2836,7 @@ out:
> kfree(path);
> if (err == -EAGAIN) {
> path = NULL;
> +   i = 0;
> goto again;
> }
> ext4_journal_stop(handle);



-- 
Maciej A. Żenczykowski
Kernel Networking Developer @ Google
1600 Amphitheatre Parkway, Mountain View, CA 94043
tel: +1 (650) 253-0062
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
This would probably be much more readable code if the 'i=0' init was
before path=kzalloc.

On Thu, Aug 16, 2012 at 8:25 AM, Theodore Ts'o ty...@mit.edu wrote:
 On Thu, Aug 16, 2012 at 07:10:51PM +0800, Fengguang Wu wrote:

 Here is the dmesg. BTW, it seems 3.5.0 don't have this issue.

 Fengguang,

 It sounds like you have a (at least fairly) reliable reproduction for
 this problem?  Is it something you can share?  It would be good to get
 this into our test suites, since it was _not_ something that was
 caught by xfstests, apparently.

 Can you see if this patch addresses it?  (The first two patch hunks
 are the same debugging additions I had posted before.)

 It looks like the responsible commit is 968dee7722: ext4: fix hole
 punch failure when depth is greater than 0.  I had thought this patch
 was low risk if you weren't using the new punch ioctl, but it turns
 out it did make a critical change in the non-punch (i.e., truncate)
 code path, which is what the addition of i = 0; in the patch below
 addresses.

 Regards,

   - Ted

 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 769151d..fa829dc 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -2432,6 +2432,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,

 /* the header must be checked already in ext4_ext_remove_space() */
 ext_debug(truncate since %u in leaf to %u\n, start, end);
 +   if (!path[depth].p_hdr  !path[depth].p_bh) {
 +   EXT4_ERROR_INODE(inode, depth %d, depth);
 +   BUG_ON(1);
 +   }
 if (!path[depth].p_hdr)
 path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
 eh = path[depth].p_hdr;
 @@ -2730,6 +2734,10 @@ cont:
 /* this is index block */
 if (!path[i].p_hdr) {
 ext_debug(initialize header\n);
 +   if (!path[i].p_hdr  !path[i].p_bh) {
 +   EXT4_ERROR_INODE(inode, i=%d, i);
 +   BUG_ON(1);
 +   }
 path[i].p_hdr = ext_block_hdr(path[i].p_bh);
 }

 @@ -2828,6 +2836,7 @@ out:
 kfree(path);
 if (err == -EAGAIN) {
 path = NULL;
 +   i = 0;
 goto again;
 }
 ext4_journal_stop(handle);



-- 
Maciej A. Żenczykowski
Kernel Networking Developer @ Google
1600 Amphitheatre Parkway, Mountain View, CA 94043
tel: +1 (650) 253-0062
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
 Maciej, you weren't able to reliably repro the crash were you?  I'm
 pretty sure this should fix the crash, but it would be really great to
 confirm things.

 I suspect creating a file system with a really small journal may make
 it easier to reproduce, but I haven't had time to try create a
 reliable repro for this bug yet.

This happened twice to me while moving data off of a ~1TB ext4 partition.
The data portion was on a stripe raid across 2 ~500GB drives, the
journal was on a relatively large partition (500MB?) on an SSD.
(crypto and lvm were also involved).
I've since emptied the partition and deleted even the raid array.

Both times it happened during rm, first time rm -rf of a directory
tree, second time during rm of a 250GB disk image generated by dd
(from a notebook drive).
Both rm's were manually run by me from a shell command line, and there
was pretty much nothing else happening on the machine at the time.

I'm not aware of there having been anything interesting (like:
holes/punch/sparseness, much r/w activity in the middle of files, etc)
on this filesystem, it was pretty much just a write-once data backup
that I had copied elsewhere and was deleting.  The 250GB disk image
was definitely just a sequentially written disk dump, and I think the
same thing holds true for the contents of the wiped directory tree
(although in many much smaller files).

I know i=1 in both cases (and dissasembly pointed out the location
where the above debug patch is BUGing), but I don't think it's
possible to figure out what inode # it crashed on.

Perhaps just untarring a bunch of kernels onto an empty partition,
filling it up, then deleting those kernels should be sufficient to
repro this (untried).

Perhaps something like:
  create 1TB filesystem
  untar a thousand kernel source trees on to it
  create 20GB files of junk until it is full
  rm -rf /


- Maciej
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1

2012-08-16 Thread Maciej Żenczykowski
 Thanks, that's really helpful.   I can say that using a 4MB journal and
 running fsstress is _not_ enough to trigger the problem.

 Looking more closely at what might be needed to trigger the bug, 'i'
 gets left uninitialized when err is set to -EAGAIN, and that happens
 when ext4_ext_truncate_extend_restart() is unable to extend the
 journal transaction.  But that also means we need to be deleting a
 sufficiently large enough file that the blocks span multiple block
 groups (which is why we need to extend the transaction, so we can
 modify more bitmap blocks) at the point when there is no more room in
 the journal, so we have to close the current transaction, and then
 retry it again with a new journal handle in a new transaction.

 So that implies that untaring a bunch of kernels probably won't be
 sufficient, since the files will be too small.  What we probably will
 need to do is to fill a large file system with lots of large files,
 use a small journal, and then try to do an rm -rf.

  - Ted

My suggestion of untarring kernels was to cause the big multi gigabyte
files created later on to be massively fragmented, and thus have tons
of extents and a relatively deep extent tree.
But maybe that's not needed to trigger this bug, if as you say, it is
caused by the absolute number of disks blocks being freed and not by
the size/depth/complexity of the extent tree.
My knowledge of the internals of ext4 is pretty much non-existent. ;-)
 In this case I'm just an end user.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/