Re: [PATCH 0/5] RFC: connector: Add network namespace awareness
On Tue, 2020-07-14 at 15:03 +1000, Aleksa Sarai wrote: > On 2020-07-13, Eric W. Biederman wrote: > > Matt Bennett writes: > > > > > On Thu, 2020-07-02 at 21:10 +0200, Christian Brauner wrote: > > > > On Thu, Jul 02, 2020 at 08:17:38AM -0500, Eric W. Biederman wrote: > > > > > Matt Bennett writes: > > > > > > > > > > > Previously the connector functionality could only be used by > > > > > > processes running in the > > > > > > default network namespace. This meant that any process that uses > > > > > > the connector functionality > > > > > > could not operate correctly when run inside a container. This is a > > > > > > draft patch series that > > > > > > attempts to now allow this functionality outside of the default > > > > > > network namespace. > > > > > > > > > > > > I see this has been discussed previously [1], but am not sure how > > > > > > my changes relate to all > > > > > > of the topics discussed there and/or if there are any unintended > > > > > > side effects from my draft > > > > > > changes. > > > > > > > > > > Is there a piece of software that uses connector that you want to get > > > > > working in containers? > > > > > > We have an IPC system [1] where processes can register their socket > > > details (unix, tcp, tipc, ...) to a 'monitor' process. Processes can > > > then get notified when other processes they are interested in > > > start/stop their servers and use the registered details to connect to > > > them. Everything works unless a process crashes, in which case the > > > monitoring process never removes their details. Therefore the > > > monitoring process uses the connector functionality with > > > PROC_EVENT_EXIT to detect when a process crashes and removes the > > > details if it is a previously registered PID. > > > > > > This was working for us until we tried to run our system in a container. > > > > > > > > > > > > > I am curious what the motivation is because up until now there has > > > > > been > > > > > nothing very interesting using this functionality. So it hasn't been > > > > > worth anyone's time to make the necessary changes to the code. > > > > > > > > Imho, we should just state once and for all that the proc connector will > > > > not be namespaced. This is such a corner-case thing and has been > > > > non-namespaced for such a long time without consistent push for it to be > > > > namespaced combined with the fact that this needs quite some code to > > > > make it work correctly that I fear we end up buying more bugs than we're > > > > selling features. And realistically, you and I will end up maintaining > > > > this and I feel this is not worth the time(?). Maybe I'm being too > > > > pessimistic though. > > > > > > > > > > Fair enough. I can certainly look for another way to detect process > > > crashes. Interestingly I found a patch set [2] on the mailing list > > > that attempts to solve the problem I wish to solve, but it doesn't > > > look like the patches were ever developed further. From reading the > > > discussion thread on that patch set it appears that I should be doing > > > some form of polling on the /proc files. > > > > Recently Christian Brauner implemented pidfd complete with a poll > > operation that reports when a process terminates. > > > > If you are willing to change your userspace code switching to pidfd > > should be all that you need. > > While this does solve the problem of getting exit notifications in > general, you cannot get the exit code. But if they don't care about that > then we can solve that problem another time. :D > From first glance using pidfd will do exactly what we need. Not being able to get the exit code will not be an issue. In fact I think it will be an improvement over the connector as the listener will now only be waiting for the PIDs we actually care about - rather than getting woken up on every single process exit and having to check if it cares about the PID. Many thanks Eric and others, Matt
Re: [PATCH 0/5] RFC: connector: Add network namespace awareness
On Thu, 2020-07-02 at 21:10 +0200, Christian Brauner wrote: > On Thu, Jul 02, 2020 at 08:17:38AM -0500, Eric W. Biederman wrote: > > Matt Bennett writes: > > > > > Previously the connector functionality could only be used by processes > > > running in the > > > default network namespace. This meant that any process that uses the > > > connector functionality > > > could not operate correctly when run inside a container. This is a draft > > > patch series that > > > attempts to now allow this functionality outside of the default network > > > namespace. > > > > > > I see this has been discussed previously [1], but am not sure how my > > > changes relate to all > > > of the topics discussed there and/or if there are any unintended side > > > effects from my draft > > > changes. > > > > Is there a piece of software that uses connector that you want to get > > working in containers? We have an IPC system [1] where processes can register their socket details (unix, tcp, tipc, ...) to a 'monitor' process. Processes can then get notified when other processes they are interested in start/stop their servers and use the registered details to connect to them. Everything works unless a process crashes, in which case the monitoring process never removes their details. Therefore the monitoring process uses the connector functionality with PROC_EVENT_EXIT to detect when a process crashes and removes the details if it is a previously registered PID. This was working for us until we tried to run our system in a container. > > > > I am curious what the motivation is because up until now there has been > > nothing very interesting using this functionality. So it hasn't been > > worth anyone's time to make the necessary changes to the code. > > Imho, we should just state once and for all that the proc connector will > not be namespaced. This is such a corner-case thing and has been > non-namespaced for such a long time without consistent push for it to be > namespaced combined with the fact that this needs quite some code to > make it work correctly that I fear we end up buying more bugs than we're > selling features. And realistically, you and I will end up maintaining > this and I feel this is not worth the time(?). Maybe I'm being too > pessimistic though. > Fair enough. I can certainly look for another way to detect process crashes. Interestingly I found a patch set [2] on the mailing list that attempts to solve the problem I wish to solve, but it doesn't look like the patches were ever developed further. From reading the discussion thread on that patch set it appears that I should be doing some form of polling on the /proc files. Best regards, Matt [1] https://github.com/alliedtelesis/cmsg/blob/master/cmsg/src/service_listener/netlink.c#L61 [2] https://lkml.org/lkml/2018/10/29/638
Re: [PATCH 0/5] RFC: connector: Add network namespace awareness
On Thu, 2020-07-02 at 13:59 -0500, Eric W. Biederman wrote: > Matt Bennett writes: > > > Previously the connector functionality could only be used by processes > > running in the > > default network namespace. This meant that any process that uses the > > connector functionality > > could not operate correctly when run inside a container. This is a draft > > patch series that > > attempts to now allow this functionality outside of the default network > > namespace. > > > > I see this has been discussed previously [1], but am not sure how my > > changes relate to all > > of the topics discussed there and/or if there are any unintended side > > effects from my draft > > In a quick skim this patchset does not look like it approaches a correct > conversion to having code that works in multiple namespaces. > > I will take the changes to proc_id_connector for example. > You report the values in the callers current namespaces. > > Which means an unprivileged user can create a user namespace and get > connector to report whichever ids they want to users in another > namespace. AKA lie. > > So this appears to make connector completely unreliable. > > Eric > Hi Eric, Thank you for taking the time to review. I wrote these patches in an attempt to show that I was willing to do the work myself rather than simply asking for someone else to do it for me. The changes worked for my use cases when I tested them, but I expected that some of the changes would be incorrect and that I would need some guidance. I can spend some time to really dig in and fully understand the changes I am trying to make (I have limited kernel development experience) but based on the rest of the discussion threads it seems that there is likely no appetite to ever support namespaces with the connector. Best regards, Matt
[PATCH 1/5] connector: Use task pid helpers
In preparation for supporting the connector outside of the default network namespace we switch to using these helpers now. As the connector is still only supported in the default namespace this change is a no-op. Signed-off-by: Matt Bennett --- drivers/connector/cn_proc.c | 48 ++--- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 646ad385e490..36a7823c56ec 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -83,11 +83,11 @@ void proc_fork_connector(struct task_struct *task) ev->what = PROC_EVENT_FORK; rcu_read_lock(); parent = rcu_dereference(task->real_parent); - ev->event_data.fork.parent_pid = parent->pid; - ev->event_data.fork.parent_tgid = parent->tgid; + ev->event_data.fork.parent_pid = task_pid_vnr(parent); + ev->event_data.fork.parent_tgid = task_tgid_vnr(parent); rcu_read_unlock(); - ev->event_data.fork.child_pid = task->pid; - ev->event_data.fork.child_tgid = task->tgid; + ev->event_data.fork.child_pid = task_pid_vnr(task); + ev->event_data.fork.child_tgid = task_tgid_vnr(task); memcpy(>id, _proc_event_id, sizeof(msg->id)); msg->ack = 0; /* not used */ @@ -110,8 +110,8 @@ void proc_exec_connector(struct task_struct *task) memset(>event_data, 0, sizeof(ev->event_data)); ev->timestamp_ns = ktime_get_ns(); ev->what = PROC_EVENT_EXEC; - ev->event_data.exec.process_pid = task->pid; - ev->event_data.exec.process_tgid = task->tgid; + ev->event_data.exec.process_pid = task_pid_vnr(task); + ev->event_data.exec.process_tgid = task_tgid_vnr(task); memcpy(>id, _proc_event_id, sizeof(msg->id)); msg->ack = 0; /* not used */ @@ -134,8 +134,8 @@ void proc_id_connector(struct task_struct *task, int which_id) ev = (struct proc_event *)msg->data; memset(>event_data, 0, sizeof(ev->event_data)); ev->what = which_id; - ev->event_data.id.process_pid = task->pid; - ev->event_data.id.process_tgid = task->tgid; + ev->event_data.id.process_pid = task_pid_vnr(task); + ev->event_data.id.process_tgid = task_tgid_vnr(task); rcu_read_lock(); cred = __task_cred(task); if (which_id == PROC_EVENT_UID) { @@ -172,8 +172,8 @@ void proc_sid_connector(struct task_struct *task) memset(>event_data, 0, sizeof(ev->event_data)); ev->timestamp_ns = ktime_get_ns(); ev->what = PROC_EVENT_SID; - ev->event_data.sid.process_pid = task->pid; - ev->event_data.sid.process_tgid = task->tgid; + ev->event_data.sid.process_pid = task_pid_vnr(task); + ev->event_data.sid.process_tgid = task_tgid_vnr(task); memcpy(>id, _proc_event_id, sizeof(msg->id)); msg->ack = 0; /* not used */ @@ -196,11 +196,11 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) memset(>event_data, 0, sizeof(ev->event_data)); ev->timestamp_ns = ktime_get_ns(); ev->what = PROC_EVENT_PTRACE; - ev->event_data.ptrace.process_pid = task->pid; - ev->event_data.ptrace.process_tgid = task->tgid; + ev->event_data.ptrace.process_pid = task_pid_vnr(task); + ev->event_data.ptrace.process_tgid = task_tgid_vnr(task); if (ptrace_id == PTRACE_ATTACH) { - ev->event_data.ptrace.tracer_pid = current->pid; - ev->event_data.ptrace.tracer_tgid = current->tgid; + ev->event_data.ptrace.tracer_pid = task_pid_vnr(current); + ev->event_data.ptrace.tracer_tgid = task_tgid_vnr(current); } else if (ptrace_id == PTRACE_DETACH) { ev->event_data.ptrace.tracer_pid = 0; ev->event_data.ptrace.tracer_tgid = 0; @@ -228,8 +228,8 @@ void proc_comm_connector(struct task_struct *task) memset(>event_data, 0, sizeof(ev->event_data)); ev->timestamp_ns = ktime_get_ns(); ev->what = PROC_EVENT_COMM; - ev->event_data.comm.process_pid = task->pid; - ev->event_data.comm.process_tgid = task->tgid; + ev->event_data.comm.process_pid = task_pid_vnr(task); + ev->event_data.comm.process_tgid = task_tgid_vnr(task); get_task_comm(ev->event_data.comm.comm, task); memcpy(>id, _proc_event_id, sizeof(msg->id)); @@ -254,14 +254,14 @@ void proc_coredump_connector(struct task_struct *task) memset(>event_data, 0, sizeof(ev->event_data)); ev->timestamp_ns = ktime_get_ns(); ev->what = PROC_EVENT_COREDUMP; - ev->event_data.coredump.process_pid = task->pid; - ev->event_
[PATCH 5/5] connector: Create connector per namespace
Move to storing the connector instance per network namespace. In doing so the ability to use the connector functionality outside the default namespace is now available. Signed-off-by: Matt Bennett --- drivers/connector/cn_proc.c | 49 ++ drivers/connector/connector.c | 171 -- drivers/hv/hv_fcopy.c | 1 + include/linux/connector.h | 14 ++- include/net/net_namespace.h | 4 + kernel/exit.c | 2 +- 6 files changed, 190 insertions(+), 51 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 9202be177a30..661d921fd146 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -37,7 +38,6 @@ static inline struct cn_msg *buffer_to_cn_msg(__u8 *buffer) return (struct cn_msg *)(buffer + 4); } -static atomic_t proc_event_num_listeners = ATOMIC_INIT(0); static struct cb_id cn_proc_event_id = { CN_IDX_PROC, CN_VAL_PROC }; /* local_event.count is used as the sequence number of the netlink message */ @@ -51,6 +51,9 @@ static DEFINE_PER_CPU(struct local_event, local_event) = { static inline void send_msg(struct cn_msg *msg) { + int ret = 0; + struct net *net = current->nsproxy->net_ns; + local_lock(_event.lock); msg->seq = __this_cpu_inc_return(local_event.count) - 1; @@ -62,7 +65,9 @@ static inline void send_msg(struct cn_msg *msg) * * If cn_netlink_send() fails, the data is not sent. */ - cn_netlink_send(_net, msg, 0, CN_IDX_PROC, GFP_NOWAIT); + ret = cn_netlink_send(net, msg, 0, CN_IDX_PROC, GFP_NOWAIT); + if (ret == -ESRCH && netlink_has_listeners(net->cdev.nls, CN_IDX_PROC) == 0) + atomic_set(&(net->cdev.proc_event_num_listeners), 0); local_unlock(_event.lock); } @@ -73,8 +78,9 @@ void proc_fork_connector(struct task_struct *task) struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); struct task_struct *parent; + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -102,8 +108,9 @@ void proc_exec_connector(struct task_struct *task) struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -127,8 +134,9 @@ void proc_id_connector(struct task_struct *task, int which_id) struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); const struct cred *cred; + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -164,8 +172,9 @@ void proc_sid_connector(struct task_struct *task) struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -188,8 +197,9 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -220,8 +230,9 @@ void proc_comm_connector(struct task_struct *task) struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event_num_listeners)) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -246,8 +257,9 @@ void proc_coredump_connector(struct task_struct *task) struct proc_event *ev; struct task_struct *parent; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct net *net = current->nsproxy->net_ns; - if (atomic_read(_event_num_listeners) < 1) + if (atomic_read(&(net->cdev.proc_event
[PATCH 3/5] connector: Ensure callback entry is released
Currently the entry itself appears to be being leaked. Signed-off-by: Matt Bennett --- drivers/connector/cn_queue.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c index 49295052ba8b..a82ceeb37f26 100644 --- a/drivers/connector/cn_queue.c +++ b/drivers/connector/cn_queue.c @@ -132,8 +132,10 @@ void cn_queue_free_dev(struct cn_queue_dev *dev) struct cn_callback_entry *cbq, *n; spin_lock_bh(>queue_lock); - list_for_each_entry_safe(cbq, n, >queue_list, callback_entry) + list_for_each_entry_safe(cbq, n, >queue_list, callback_entry) { list_del(>callback_entry); + cn_queue_release_callback(cbq); + } spin_unlock_bh(>queue_lock); while (atomic_read(>refcnt)) { -- 2.27.0
[PATCH 4/5] connector: Prepare for supporting multiple namespaces
Extend the existing function definitions / call sites to start passing the network namespace. For now we still only pass the default namespace. Signed-off-by: Matt Bennett --- Documentation/driver-api/connector.rst | 6 +++--- drivers/connector/cn_proc.c| 5 +++-- drivers/connector/cn_queue.c | 5 +++-- drivers/connector/connector.c | 21 - drivers/hv/hv_utils_transport.c| 6 -- drivers/md/dm-log-userspace-transfer.c | 6 -- drivers/video/fbdev/uvesafb.c | 8 +--- drivers/w1/w1_netlink.c| 19 +++ include/linux/connector.h | 24 samples/connector/cn_test.c| 6 -- 10 files changed, 65 insertions(+), 41 deletions(-) diff --git a/Documentation/driver-api/connector.rst b/Documentation/driver-api/connector.rst index c100c7482289..4fb1f73d76ad 100644 --- a/Documentation/driver-api/connector.rst +++ b/Documentation/driver-api/connector.rst @@ -25,9 +25,9 @@ handling, etc... The Connector driver allows any kernelspace agents to use netlink based networking for inter-process communication in a significantly easier way:: - int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *)); - void cn_netlink_send_multi(struct cn_msg *msg, u16 len, u32 portid, u32 __group, int gfp_mask); - void cn_netlink_send(struct cn_msg *msg, u32 portid, u32 __group, int gfp_mask); + int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct net *, struct cn_msg *, struct netlink_skb_parms *)); + void cn_netlink_send_multi(struct net *net, struct cn_msg *msg, u16 len, u32 portid, u32 __group, int gfp_mask); + void cn_netlink_send(struct net *net, struct cn_msg *msg, u32 portid, u32 __group, int gfp_mask); struct cb_id { diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index d90aea555a21..9202be177a30 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -61,7 +62,7 @@ static inline void send_msg(struct cn_msg *msg) * * If cn_netlink_send() fails, the data is not sent. */ - cn_netlink_send(msg, 0, CN_IDX_PROC, GFP_NOWAIT); + cn_netlink_send(_net, msg, 0, CN_IDX_PROC, GFP_NOWAIT); local_unlock(_event.lock); } @@ -343,7 +344,7 @@ static void cn_proc_ack(int err, int rcvd_seq, int rcvd_ack) * cn_proc_mcast_ctl * @data: message sent from userspace via the connector */ -static void cn_proc_mcast_ctl(struct cn_msg *msg, +static void cn_proc_mcast_ctl(struct net *net, struct cn_msg *msg, struct netlink_skb_parms *nsp) { enum proc_cn_mcast_op *mc_op = NULL; diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c index a82ceeb37f26..22fdd2b149af 100644 --- a/drivers/connector/cn_queue.c +++ b/drivers/connector/cn_queue.c @@ -16,11 +16,12 @@ #include #include #include +#include static struct cn_callback_entry * cn_queue_alloc_callback_entry(struct cn_queue_dev *dev, const char *name, struct cb_id *id, - void (*callback)(struct cn_msg *, + void (*callback)(struct net *, struct cn_msg *, struct netlink_skb_parms *)) { struct cn_callback_entry *cbq; @@ -58,7 +59,7 @@ int cn_cb_equal(struct cb_id *i1, struct cb_id *i2) int cn_queue_add_callback(struct cn_queue_dev *dev, const char *name, struct cb_id *id, - void (*callback)(struct cn_msg *, + void (*callback)(struct net *, struct cn_msg *, struct netlink_skb_parms *)) { struct cn_callback_entry *cbq, *__cbq; diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 2d22d6bf52f2..82fcaa4d8be3 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -58,8 +58,8 @@ static int cn_already_initialized; * The message is sent to, the portid if given, the group if given, both if * both, or if both are zero then the group is looked up and sent there. */ -int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group, - gfp_t gfp_mask) +int cn_netlink_send_mult(struct net *net, struct cn_msg *msg, u16 len, +u32 portid, u32 __group, gfp_t gfp_mask) { struct cn_callback_entry *__cbq; unsigned int size; @@ -118,17 +118,18 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group, EXPORT_SYMBOL_GPL(cn_netlink_send_mult); /* same as cn_netlink_send_mult except msg->len is used for len */ -int cn_netlink_send(struct cn_msg *msg, u32 portid, u32 __group, - gfp_t gfp_mask) +
[PATCH 2/5] connector: Use 'current_user_ns' function
In preparation for supporting the connector outside of the default network namespace we switch to using this function now. As the connector is still only supported in the default namespace this change is a no-op. Signed-off-by: Matt Bennett --- drivers/connector/cn_proc.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 36a7823c56ec..d90aea555a21 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -139,11 +139,11 @@ void proc_id_connector(struct task_struct *task, int which_id) rcu_read_lock(); cred = __task_cred(task); if (which_id == PROC_EVENT_UID) { - ev->event_data.id.r.ruid = from_kuid_munged(_user_ns, cred->uid); - ev->event_data.id.e.euid = from_kuid_munged(_user_ns, cred->euid); + ev->event_data.id.r.ruid = from_kuid_munged(current_user_ns(), cred->uid); + ev->event_data.id.e.euid = from_kuid_munged(current_user_ns(), cred->euid); } else if (which_id == PROC_EVENT_GID) { - ev->event_data.id.r.rgid = from_kgid_munged(_user_ns, cred->gid); - ev->event_data.id.e.egid = from_kgid_munged(_user_ns, cred->egid); + ev->event_data.id.r.rgid = from_kgid_munged(current_user_ns(), cred->gid); + ev->event_data.id.e.egid = from_kgid_munged(current_user_ns(), cred->egid); } else { rcu_read_unlock(); return; @@ -362,7 +362,7 @@ static void cn_proc_mcast_ctl(struct cn_msg *msg, return; /* Can only change if privileged. */ - if (!__netlink_ns_capable(nsp, _user_ns, CAP_NET_ADMIN)) { + if (!__netlink_ns_capable(nsp, current_user_ns(), CAP_NET_ADMIN)) { err = EPERM; goto out; } -- 2.27.0
[PATCH 0/5] RFC: connector: Add network namespace awareness
Previously the connector functionality could only be used by processes running in the default network namespace. This meant that any process that uses the connector functionality could not operate correctly when run inside a container. This is a draft patch series that attempts to now allow this functionality outside of the default network namespace. I see this has been discussed previously [1], but am not sure how my changes relate to all of the topics discussed there and/or if there are any unintended side effects from my draft changes. Thanks. [1] https://marc.info/?l=linux-kernel=150806196728365=2 Matt Bennett (5): connector: Use task pid helpers connector: Use 'current_user_ns' function connector: Ensure callback entry is released connector: Prepare for supporting multiple namespaces connector: Create connector per namespace Documentation/driver-api/connector.rst | 6 +- drivers/connector/cn_proc.c| 110 +++--- drivers/connector/cn_queue.c | 9 +- drivers/connector/connector.c | 192 - drivers/hv/hv_fcopy.c | 1 + drivers/hv/hv_utils_transport.c| 6 +- drivers/md/dm-log-userspace-transfer.c | 6 +- drivers/video/fbdev/uvesafb.c | 8 +- drivers/w1/w1_netlink.c| 19 +-- include/linux/connector.h | 38 +++-- include/net/net_namespace.h| 4 + kernel/exit.c | 2 +- samples/connector/cn_test.c| 6 +- 13 files changed, 286 insertions(+), 121 deletions(-) -- 2.27.0
Re: [PATCH] mmc: core: Check for timeout before checking mmc device state
Hi Uffe, We are using the Octeon mmc host driver supplied from the Cavium SDK (I don't believe it is released to upstream linux). We have both a mmc flash memory device and an SD card reader attached to the mmc bus. In the host driver code their is a mutex which must be obtained before the driver can access the mmc bus. This stops the mmc flash and SD card reader being written to in parallel (otherwise the signal on the bus will be corrupted). It doesn't prevent parallel requests, it's just that the second request will block on this mutex until the first request has been completed. In our specific case the following is occurring: 1. mmc_blk_part_switch() is called to switch partition on the mmc flash device. This calls mmc_switch with a timeout_ms value of 'card->ext_csd.part_time' which is 10ms in this case. 2. In __mmc_switch() the command to switch partition is sent to the mmc flash. 3. Between the command being sent to the flash and then the host polling the status of the device (no busy detection hardware) a read or write operation is begun on the SD card (in our case a Specification Version 2.00 card). In my testing I have seen the bus be blocked up to 800ms while completing this operation. 4. The host polls the device for the status but blocks the first time on the mutex for ~800ms while the SD card operation completes. 5. Finally the host gains the mutex and gets the status from the flash device. In my testing at this stage the status was never still 'R1_STATE_PRG' (it has been 800ms since the command was sent after all). However the timeout check fails because it has been 800ms compared to the original timeout_ms value passed in of 10ms. Therefore even though the device has left the 'R1_STATE_PRG' state we return early with an error that eventually gets printed to the log. This does not affect any functionality as the host will simply try to switch the partition again and if the bus does not block again then there are no issues. By putting the timeout check before we read the status of the device (and potentially block for longer than the timeout) we don't return an early error if the device has indeed left the programming state. We might as well continue through the function as after we return the error the host is just going to issue the command again. Please excuse me if I have missed something fundamental. Thanks, Matt On Mon, 2015-05-11 at 12:00 +0200, Ulf Hansson wrote: > On 8 May 2015 at 04:40, Matt Bennett wrote: > > On a system that has multiple devices on the mmc bus the host can > > block on the mutex that protects access to the bus. Some operations > > What mutex are you referring to? > > And why, exactly, does it prevent parallel requests on different cards > (devices)? > > Kind regards > Uffe > > > require the status of the device to be polled to see when the device > > finishes executing the previous command that was sent to it (if > > there is no busy detection in hardware). The current execution order > > to check the status is: > > > > LOOP > > { > > 1. Send command to device to retrieve the status (this can block). > > 2. Check we haven't exceeded the timeout value. If we have then > > return an error. > > 3. If the device is no longer in the program state then exit the > > loop and continue through the function. > > } > > > > If the send command blocks (and the timeout is exceeded) then the > > function returns (and prints) an error even though the device has > > likely left the programming state (due to the lengthy period of > > time while the bus is blocked). By moving the timeout check before > > retrieving the device status in the loop we better handle the case > > where the mmc bus has been blocked but the device has left the > > programming state. > > > > Signed-off-by: Matt Bennett > > Cc: kuninori.morimoto...@renesas.com > > Cc: jh80.ch...@samsung.com > > Cc: sb...@codeaurora.org > > Cc: johan.rudh...@axis.com > > Cc: linux-kernel@vger.kernel.org > > Cc: linux-...@vger.kernel.org > > --- > > drivers/mmc/card/block.c | 22 +++--- > > drivers/mmc/core/core.c| 20 ++-- > > drivers/mmc/core/mmc_ops.c | 14 +++--- > > 3 files changed, 28 insertions(+), 28 deletions(-) > > > > diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c > > index 2c25271..0abefde 100644 > > --- a/drivers/mmc/card/block.c > > +++ b/drivers/mmc/card/block.c > > @@ -747,6 +747,17 @@ static int card_busy_detect(struct mmc_card *card, > > unsigned int timeout_ms, > > u32 status; > > > > do { > > + /* > > +* Timeout if the device
Re: [PATCH] mmc: core: Check for timeout before checking mmc device state
Hi Uffe, We are using the Octeon mmc host driver supplied from the Cavium SDK (I don't believe it is released to upstream linux). We have both a mmc flash memory device and an SD card reader attached to the mmc bus. In the host driver code their is a mutex which must be obtained before the driver can access the mmc bus. This stops the mmc flash and SD card reader being written to in parallel (otherwise the signal on the bus will be corrupted). It doesn't prevent parallel requests, it's just that the second request will block on this mutex until the first request has been completed. In our specific case the following is occurring: 1. mmc_blk_part_switch() is called to switch partition on the mmc flash device. This calls mmc_switch with a timeout_ms value of 'card-ext_csd.part_time' which is 10ms in this case. 2. In __mmc_switch() the command to switch partition is sent to the mmc flash. 3. Between the command being sent to the flash and then the host polling the status of the device (no busy detection hardware) a read or write operation is begun on the SD card (in our case a Specification Version 2.00 card). In my testing I have seen the bus be blocked up to 800ms while completing this operation. 4. The host polls the device for the status but blocks the first time on the mutex for ~800ms while the SD card operation completes. 5. Finally the host gains the mutex and gets the status from the flash device. In my testing at this stage the status was never still 'R1_STATE_PRG' (it has been 800ms since the command was sent after all). However the timeout check fails because it has been 800ms compared to the original timeout_ms value passed in of 10ms. Therefore even though the device has left the 'R1_STATE_PRG' state we return early with an error that eventually gets printed to the log. This does not affect any functionality as the host will simply try to switch the partition again and if the bus does not block again then there are no issues. By putting the timeout check before we read the status of the device (and potentially block for longer than the timeout) we don't return an early error if the device has indeed left the programming state. We might as well continue through the function as after we return the error the host is just going to issue the command again. Please excuse me if I have missed something fundamental. Thanks, Matt On Mon, 2015-05-11 at 12:00 +0200, Ulf Hansson wrote: On 8 May 2015 at 04:40, Matt Bennett matt.benn...@alliedtelesis.co.nz wrote: On a system that has multiple devices on the mmc bus the host can block on the mutex that protects access to the bus. Some operations What mutex are you referring to? And why, exactly, does it prevent parallel requests on different cards (devices)? Kind regards Uffe require the status of the device to be polled to see when the device finishes executing the previous command that was sent to it (if there is no busy detection in hardware). The current execution order to check the status is: LOOP { 1. Send command to device to retrieve the status (this can block). 2. Check we haven't exceeded the timeout value. If we have then return an error. 3. If the device is no longer in the program state then exit the loop and continue through the function. } If the send command blocks (and the timeout is exceeded) then the function returns (and prints) an error even though the device has likely left the programming state (due to the lengthy period of time while the bus is blocked). By moving the timeout check before retrieving the device status in the loop we better handle the case where the mmc bus has been blocked but the device has left the programming state. Signed-off-by: Matt Bennett matt.benn...@alliedtelesis.co.nz Cc: kuninori.morimoto...@renesas.com Cc: jh80.ch...@samsung.com Cc: sb...@codeaurora.org Cc: johan.rudh...@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org --- drivers/mmc/card/block.c | 22 +++--- drivers/mmc/core/core.c| 20 ++-- drivers/mmc/core/mmc_ops.c | 14 +++--- 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c index 2c25271..0abefde 100644 --- a/drivers/mmc/card/block.c +++ b/drivers/mmc/card/block.c @@ -747,6 +747,17 @@ static int card_busy_detect(struct mmc_card *card, unsigned int timeout_ms, u32 status; do { + /* +* Timeout if the device never becomes ready for data and never +* leaves the program state. +*/ + if (time_after(jiffies, timeout)) { + pr_err(%s: Card stuck in programming state! %s %s\n, + mmc_hostname(card-host), + req-rq_disk-disk_name, __func__
[PATCH] mmc: core: Check for timeout before checking mmc device state
On a system that has multiple devices on the mmc bus the host can block on the mutex that protects access to the bus. Some operations require the status of the device to be polled to see when the device finishes executing the previous command that was sent to it (if there is no busy detection in hardware). The current execution order to check the status is: LOOP { 1. Send command to device to retrieve the status (this can block). 2. Check we haven't exceeded the timeout value. If we have then return an error. 3. If the device is no longer in the program state then exit the loop and continue through the function. } If the send command blocks (and the timeout is exceeded) then the function returns (and prints) an error even though the device has likely left the programming state (due to the lengthy period of time while the bus is blocked). By moving the timeout check before retrieving the device status in the loop we better handle the case where the mmc bus has been blocked but the device has left the programming state. Signed-off-by: Matt Bennett Cc: kuninori.morimoto...@renesas.com Cc: jh80.ch...@samsung.com Cc: sb...@codeaurora.org Cc: johan.rudh...@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org --- drivers/mmc/card/block.c | 22 +++--- drivers/mmc/core/core.c| 20 ++-- drivers/mmc/core/mmc_ops.c | 14 +++--- 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c index 2c25271..0abefde 100644 --- a/drivers/mmc/card/block.c +++ b/drivers/mmc/card/block.c @@ -747,6 +747,17 @@ static int card_busy_detect(struct mmc_card *card, unsigned int timeout_ms, u32 status; do { + /* +* Timeout if the device never becomes ready for data and never +* leaves the program state. +*/ + if (time_after(jiffies, timeout)) { + pr_err("%s: Card stuck in programming state! %s %s\n", + mmc_hostname(card->host), + req->rq_disk->disk_name, __func__); + return -ETIMEDOUT; + } + err = get_card_status(card, , 5); if (err) { pr_err("%s: error %d requesting status\n", @@ -766,17 +777,6 @@ static int card_busy_detect(struct mmc_card *card, unsigned int timeout_ms, break; /* -* Timeout if the device never becomes ready for data and never -* leaves the program state. -*/ - if (time_after(jiffies, timeout)) { - pr_err("%s: Card stuck in programming state! %s %s\n", - mmc_hostname(card->host), - req->rq_disk->disk_name, __func__); - return -ETIMEDOUT; - } - - /* * Some cards mishandle the status bits, * so make sure to check both the busy * indication and the card state. diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index c296bc0..6e56cb3 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -2047,6 +2047,16 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, timeout = jiffies + msecs_to_jiffies(MMC_CORE_TIMEOUT_MS); do { + /* Timeout if the device never becomes ready for data and +* never leaves the program state. +*/ + if (time_after(jiffies, timeout)) { + pr_err("%s: Card stuck in programming state! %s\n", + mmc_hostname(card->host), __func__); + err = -EIO; + goto out; + } + memset(, 0, sizeof(struct mmc_command)); cmd.opcode = MMC_SEND_STATUS; cmd.arg = card->rca << 16; @@ -2060,16 +2070,6 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, goto out; } - /* Timeout if the device never becomes ready for data and -* never leaves the program state. -*/ - if (time_after(jiffies, timeout)) { - pr_err("%s: Card stuck in programming state! %s\n", - mmc_hostname(card->host), __func__); - err = -EIO; - goto out; - } - } while (!(cmd.resp[0] & R1_READY_FOR_DATA) || (R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG)); out: diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 0ea042d..b30ed91 100644 ---
[PATCH] mmc: core: Check for timeout before checking mmc device state
On a system that has multiple devices on the mmc bus the host can block on the mutex that protects access to the bus. Some operations require the status of the device to be polled to see when the device finishes executing the previous command that was sent to it (if there is no busy detection in hardware). The current execution order to check the status is: LOOP { 1. Send command to device to retrieve the status (this can block). 2. Check we haven't exceeded the timeout value. If we have then return an error. 3. If the device is no longer in the program state then exit the loop and continue through the function. } If the send command blocks (and the timeout is exceeded) then the function returns (and prints) an error even though the device has likely left the programming state (due to the lengthy period of time while the bus is blocked). By moving the timeout check before retrieving the device status in the loop we better handle the case where the mmc bus has been blocked but the device has left the programming state. Signed-off-by: Matt Bennett matt.benn...@alliedtelesis.co.nz Cc: kuninori.morimoto...@renesas.com Cc: jh80.ch...@samsung.com Cc: sb...@codeaurora.org Cc: johan.rudh...@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-...@vger.kernel.org --- drivers/mmc/card/block.c | 22 +++--- drivers/mmc/core/core.c| 20 ++-- drivers/mmc/core/mmc_ops.c | 14 +++--- 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c index 2c25271..0abefde 100644 --- a/drivers/mmc/card/block.c +++ b/drivers/mmc/card/block.c @@ -747,6 +747,17 @@ static int card_busy_detect(struct mmc_card *card, unsigned int timeout_ms, u32 status; do { + /* +* Timeout if the device never becomes ready for data and never +* leaves the program state. +*/ + if (time_after(jiffies, timeout)) { + pr_err(%s: Card stuck in programming state! %s %s\n, + mmc_hostname(card-host), + req-rq_disk-disk_name, __func__); + return -ETIMEDOUT; + } + err = get_card_status(card, status, 5); if (err) { pr_err(%s: error %d requesting status\n, @@ -766,17 +777,6 @@ static int card_busy_detect(struct mmc_card *card, unsigned int timeout_ms, break; /* -* Timeout if the device never becomes ready for data and never -* leaves the program state. -*/ - if (time_after(jiffies, timeout)) { - pr_err(%s: Card stuck in programming state! %s %s\n, - mmc_hostname(card-host), - req-rq_disk-disk_name, __func__); - return -ETIMEDOUT; - } - - /* * Some cards mishandle the status bits, * so make sure to check both the busy * indication and the card state. diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index c296bc0..6e56cb3 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -2047,6 +2047,16 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, timeout = jiffies + msecs_to_jiffies(MMC_CORE_TIMEOUT_MS); do { + /* Timeout if the device never becomes ready for data and +* never leaves the program state. +*/ + if (time_after(jiffies, timeout)) { + pr_err(%s: Card stuck in programming state! %s\n, + mmc_hostname(card-host), __func__); + err = -EIO; + goto out; + } + memset(cmd, 0, sizeof(struct mmc_command)); cmd.opcode = MMC_SEND_STATUS; cmd.arg = card-rca 16; @@ -2060,16 +2070,6 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, goto out; } - /* Timeout if the device never becomes ready for data and -* never leaves the program state. -*/ - if (time_after(jiffies, timeout)) { - pr_err(%s: Card stuck in programming state! %s\n, - mmc_hostname(card-host), __func__); - err = -EIO; - goto out; - } - } while (!(cmd.resp[0] R1_READY_FOR_DATA) || (R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG)); out: diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c index 0ea042d..b30ed91 100644 --- a/drivers/mmc/core/mmc_ops.c +++ b/drivers/mmc/core/mmc_ops.c @@ -526,6
Re: [PATCH] MIPS: OCTEON: Stop .bss memory being allocated to the kernel
I mistakenly only included David in the original patch email so I have added the other people from get_maintainer.pl. I believe this patch could be fixing a rather significant problem (a potential kernel panic) so some feedback would be appreciated. Thanks, Matt On Thu, 2015-04-30 at 11:48 +1200, Matt Bennett wrote: > During development work on a 3.16 based kernel it was found that a > number of builds would panic during the kernel init process, more > specifically in 'delayed_fput()'. The panic showed the kernel trying > to access a memory address of '0xb7fdc00' while traversing the > 'delayed_fput_list' structure. Comparing this memory address to the > value of the pointer used on builds that did not panic confirmed > that the pointer on crashing builds must have been corrupted at some > stage earlier in the init process. > > By traversing the list earlier and earlier in the code it was found > that 'plat_mem_setup()' was responsible for corrupting the list. > Specifically the line: > > memory = cvmx_bootmem_phy_alloc(mem_alloc_size, > __pa_symbol(&__init_end), -1, > 0x10, > CVMX_BOOTMEM_FLAG_NO_LOCKING); > > Which would eventually call: > > cvmx_bootmem_phy_set_size(new_ent_addr, > cvmx_bootmem_phy_get_size > (ent_addr) - > (desired_min_addr - > ent_addr)); > > Where 'new_ent_addr'=0x480 (the address of 'delayed_fput_list') > and the second argument (size)=0xb7fdc00 (the address causing the > kernel panic). The job of this part of 'plat_mem_setup()' is to > allocate chunks of memory for the kernel to use. At the start of > each chunk of memory the size of the chunk is written, hence the > value 0xb7fdc00 is written onto memory at 0x480, therefore the > kernel panics when it goes back to access 'delayed_fput_list' later > on in the initialisation process. > > On builds that were not crashing it was found that the compiler had > placed 'delayed_fput_list' at 0x488, meaning it wasn't corrupted > (but something else in memory was overwritten). > > As can be seen in the first function call above the code begins to > allocate chunks of memory beginning from the symbol '__init_end'. > The MIPS linker script (vmlinux.lds.S) however defines the .bss > section to begin after '__init_end'. Therefore memory within the > .bss section is allocated to the kernel to use (System.map shows > 'delayed_fput_list' and other kernel structures to be in .bss). > > To stop the kernel panic (and the .bss section being corrupted) > memory should begin being allocated from the symbol '_end'. > > Signed-off-by: Matt Bennett > Cc: linux-m...@linux-mips.org > Cc: linux-kernel@vger.kernel.org > --- > arch/mips/cavium-octeon/setup.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c > index 7e4367b..f632f14 100644 > --- a/arch/mips/cavium-octeon/setup.c > +++ b/arch/mips/cavium-octeon/setup.c > @@ -1008,7 +1008,7 @@ void __init plat_mem_setup(void) > while ((boot_mem_map.nr_map < BOOT_MEM_MAP_MAX) > && (total < MAX_MEMORY)) { > memory = cvmx_bootmem_phy_alloc(mem_alloc_size, > - __pa_symbol(&__init_end), -1, > + __pa_symbol(&_end), -1, > 0x10, > CVMX_BOOTMEM_FLAG_NO_LOCKING); > if (memory >= 0) {
Re: [PATCH] MIPS: OCTEON: Stop .bss memory being allocated to the kernel
I mistakenly only included David in the original patch email so I have added the other people from get_maintainer.pl. I believe this patch could be fixing a rather significant problem (a potential kernel panic) so some feedback would be appreciated. Thanks, Matt On Thu, 2015-04-30 at 11:48 +1200, Matt Bennett wrote: During development work on a 3.16 based kernel it was found that a number of builds would panic during the kernel init process, more specifically in 'delayed_fput()'. The panic showed the kernel trying to access a memory address of '0xb7fdc00' while traversing the 'delayed_fput_list' structure. Comparing this memory address to the value of the pointer used on builds that did not panic confirmed that the pointer on crashing builds must have been corrupted at some stage earlier in the init process. By traversing the list earlier and earlier in the code it was found that 'plat_mem_setup()' was responsible for corrupting the list. Specifically the line: memory = cvmx_bootmem_phy_alloc(mem_alloc_size, __pa_symbol(__init_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); Which would eventually call: cvmx_bootmem_phy_set_size(new_ent_addr, cvmx_bootmem_phy_get_size (ent_addr) - (desired_min_addr - ent_addr)); Where 'new_ent_addr'=0x480 (the address of 'delayed_fput_list') and the second argument (size)=0xb7fdc00 (the address causing the kernel panic). The job of this part of 'plat_mem_setup()' is to allocate chunks of memory for the kernel to use. At the start of each chunk of memory the size of the chunk is written, hence the value 0xb7fdc00 is written onto memory at 0x480, therefore the kernel panics when it goes back to access 'delayed_fput_list' later on in the initialisation process. On builds that were not crashing it was found that the compiler had placed 'delayed_fput_list' at 0x488, meaning it wasn't corrupted (but something else in memory was overwritten). As can be seen in the first function call above the code begins to allocate chunks of memory beginning from the symbol '__init_end'. The MIPS linker script (vmlinux.lds.S) however defines the .bss section to begin after '__init_end'. Therefore memory within the .bss section is allocated to the kernel to use (System.map shows 'delayed_fput_list' and other kernel structures to be in .bss). To stop the kernel panic (and the .bss section being corrupted) memory should begin being allocated from the symbol '_end'. Signed-off-by: Matt Bennett matt.benn...@alliedtelesis.co.nz Cc: linux-m...@linux-mips.org Cc: linux-kernel@vger.kernel.org --- arch/mips/cavium-octeon/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c index 7e4367b..f632f14 100644 --- a/arch/mips/cavium-octeon/setup.c +++ b/arch/mips/cavium-octeon/setup.c @@ -1008,7 +1008,7 @@ void __init plat_mem_setup(void) while ((boot_mem_map.nr_map BOOT_MEM_MAP_MAX) (total MAX_MEMORY)) { memory = cvmx_bootmem_phy_alloc(mem_alloc_size, - __pa_symbol(__init_end), -1, + __pa_symbol(_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); if (memory = 0) {
[PATCH] MIPS: OCTEON: Stop .bss memory being allocated to the kernel
During development work on a 3.16 based kernel it was found that a number of builds would panic during the kernel init process, more specifically in 'delayed_fput()'. The panic showed the kernel trying to access a memory address of '0xb7fdc00' while traversing the 'delayed_fput_list' structure. Comparing this memory address to the value of the pointer used on builds that did not panic confirmed that the pointer on crashing builds must have been corrupted at some stage earlier in the init process. By traversing the list earlier and earlier in the code it was found that 'plat_mem_setup()' was responsible for corrupting the list. Specifically the line: memory = cvmx_bootmem_phy_alloc(mem_alloc_size, __pa_symbol(&__init_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); Which would eventually call: cvmx_bootmem_phy_set_size(new_ent_addr, cvmx_bootmem_phy_get_size (ent_addr) - (desired_min_addr - ent_addr)); Where 'new_ent_addr'=0x480 (the address of 'delayed_fput_list') and the second argument (size)=0xb7fdc00 (the address causing the kernel panic). The job of this part of 'plat_mem_setup()' is to allocate chunks of memory for the kernel to use. At the start of each chunk of memory the size of the chunk is written, hence the value 0xb7fdc00 is written onto memory at 0x480, therefore the kernel panics when it goes back to access 'delayed_fput_list' later on in the initialisation process. On builds that were not crashing it was found that the compiler had placed 'delayed_fput_list' at 0x488, meaning it wasn't corrupted (but something else in memory was overwritten). As can be seen in the first function call above the code begins to allocate chunks of memory beginning from the symbol '__init_end'. The MIPS linker script (vmlinux.lds.S) however defines the .bss section to begin after '__init_end'. Therefore memory within the .bss section is allocated to the kernel to use (System.map shows 'delayed_fput_list' and other kernel structures to be in .bss). To stop the kernel panic (and the .bss section being corrupted) memory should begin being allocated from the symbol '_end'. Signed-off-by: Matt Bennett Cc: linux-m...@linux-mips.org Cc: linux-kernel@vger.kernel.org --- arch/mips/cavium-octeon/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c index 7e4367b..f632f14 100644 --- a/arch/mips/cavium-octeon/setup.c +++ b/arch/mips/cavium-octeon/setup.c @@ -1008,7 +1008,7 @@ void __init plat_mem_setup(void) while ((boot_mem_map.nr_map < BOOT_MEM_MAP_MAX) && (total < MAX_MEMORY)) { memory = cvmx_bootmem_phy_alloc(mem_alloc_size, - __pa_symbol(&__init_end), -1, + __pa_symbol(&_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); if (memory >= 0) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MIPS: OCTEON: Stop .bss memory being allocated to the kernel
During development work on a 3.16 based kernel it was found that a number of builds would panic during the kernel init process, more specifically in 'delayed_fput()'. The panic showed the kernel trying to access a memory address of '0xb7fdc00' while traversing the 'delayed_fput_list' structure. Comparing this memory address to the value of the pointer used on builds that did not panic confirmed that the pointer on crashing builds must have been corrupted at some stage earlier in the init process. By traversing the list earlier and earlier in the code it was found that 'plat_mem_setup()' was responsible for corrupting the list. Specifically the line: memory = cvmx_bootmem_phy_alloc(mem_alloc_size, __pa_symbol(__init_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); Which would eventually call: cvmx_bootmem_phy_set_size(new_ent_addr, cvmx_bootmem_phy_get_size (ent_addr) - (desired_min_addr - ent_addr)); Where 'new_ent_addr'=0x480 (the address of 'delayed_fput_list') and the second argument (size)=0xb7fdc00 (the address causing the kernel panic). The job of this part of 'plat_mem_setup()' is to allocate chunks of memory for the kernel to use. At the start of each chunk of memory the size of the chunk is written, hence the value 0xb7fdc00 is written onto memory at 0x480, therefore the kernel panics when it goes back to access 'delayed_fput_list' later on in the initialisation process. On builds that were not crashing it was found that the compiler had placed 'delayed_fput_list' at 0x488, meaning it wasn't corrupted (but something else in memory was overwritten). As can be seen in the first function call above the code begins to allocate chunks of memory beginning from the symbol '__init_end'. The MIPS linker script (vmlinux.lds.S) however defines the .bss section to begin after '__init_end'. Therefore memory within the .bss section is allocated to the kernel to use (System.map shows 'delayed_fput_list' and other kernel structures to be in .bss). To stop the kernel panic (and the .bss section being corrupted) memory should begin being allocated from the symbol '_end'. Signed-off-by: Matt Bennett matt.benn...@alliedtelesis.co.nz Cc: linux-m...@linux-mips.org Cc: linux-kernel@vger.kernel.org --- arch/mips/cavium-octeon/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c index 7e4367b..f632f14 100644 --- a/arch/mips/cavium-octeon/setup.c +++ b/arch/mips/cavium-octeon/setup.c @@ -1008,7 +1008,7 @@ void __init plat_mem_setup(void) while ((boot_mem_map.nr_map BOOT_MEM_MAP_MAX) (total MAX_MEMORY)) { memory = cvmx_bootmem_phy_alloc(mem_alloc_size, - __pa_symbol(__init_end), -1, + __pa_symbol(_end), -1, 0x10, CVMX_BOOTMEM_FLAG_NO_LOCKING); if (memory = 0) { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/