Re: [PATCH] vsock.7: document VSOCK socket address family

2018-02-01 Thread Michael Kerrisk (man-pages)
On 1 February 2018 at 19:03, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> On Tue, Jan 30, 2018 at 10:31:54PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi Stefan,
>>
>> Ping on the below please, since it either blocks the man-pages release
>> I'd currently like to make, or I must remove the vsock.7 page for this
>> release.
>
> Sorry for the delay.  The verbatim license is fine.

Thanks, Stefan!

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] vsock.7: document VSOCK socket address family

2018-01-30 Thread Michael Kerrisk (man-pages)
Hi Stefan,

Ping on the below please, since it either blocks the man-pages release
I'd currently like to make, or I must remove the vsock.7 page for this
release.

Thanks,

Michael



On 26 January 2018 at 22:47, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:
> Stefan,
>
> I've just now noted that your page came with no license. What license
> do you want to use Please see
> https://www.kernel.org/doc/man-pages/licenses.html
>
> Thanks,
>
> Michael
>
>
> On 30 November 2017 at 12:21, Stefan Hajnoczi <stefa...@redhat.com> wrote:
>> The AF_VSOCK address family has been available since Linux 3.9 without a
>> corresponding man page.
>>
>> This patch adds vsock.7 and describes its use along the same lines as
>> existing ip.7, unix.7, and netlink.7 man pages.
>>
>> CC: Jorgen Hansen <jhan...@vmware.com>
>> CC: Dexuan Cui <de...@microsoft.com>
>> Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
>> ---
>>  man7/vsock.7 | 175 
>> +++
>>  1 file changed, 175 insertions(+)
>>  create mode 100644 man7/vsock.7
>>
>> diff --git a/man7/vsock.7 b/man7/vsock.7
>> new file mode 100644
>> index 0..48c6c2e1e
>> --- /dev/null
>> +++ b/man7/vsock.7
>> @@ -0,0 +1,175 @@
>> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +vsock \- Linux VSOCK address family
>> +.SH SYNOPSIS
>> +.B #include 
>> +.br
>> +.B #include 
>> +.PP
>> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
>> +.br
>> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
>> +.SH DESCRIPTION
>> +The VSOCK address family facilitates communication between virtual machines 
>> and
>> +the host they are running on.  This address family is used by guest agents 
>> and
>> +hypervisor services that need a communications channel that is independent 
>> of
>> +virtual machine network configuration.
>> +.PP
>> +Valid socket types are
>> +.B SOCK_STREAM
>> +and
>> +.B SOCK_DGRAM .
>> +.B SOCK_STREAM
>> +provides connection-oriented byte streams with guaranteed, in-order 
>> delivery.
>> +.B SOCK_DGRAM
>> +provides a connectionless datagram packet service.  Availability of these
>> +socket types is dependent on the underlying hypervisor.
>> +.PP
>> +A new socket is created with
>> +.PP
>> +socket(AF_VSOCK, socket_type, 0);
>> +.PP
>> +When a process wants to establish a connection it calls
>> +.BR connect (2)
>> +with a given destination socket address.  The socket is automatically bound 
>> to
>> +a free port if unbound.
>> +.PP
>> +A process can listen for incoming connections by first binding to a socket 
>> address using
>> +.BR bind (2)
>> +and then calling
>> +.BR listen (2).
>> +.PP
>> +Data is transferred using the usual
>> +.BR send (2)
>> +and
>> +.BR recv (2)
>> +family of socket system calls.
>> +.SS Address format
>> +A socket address is defined as a combination of a 32-bit Context Identifier 
>> (CID) and a 32-bit port number.  The CID identifies the source or 
>> destination, which is either a virtual machine or the host.  The port number 
>> differentiates between multiple services running on a single machine.
>> +.PP
>> +.in +4n
>> +.EX
>> +struct sockaddr_vm {
>> +sa_family_t svm_family; /* address family: AF_VSOCK */
>> +unsigned short  svm_reserved1;
>> +unsigned intsvm_port;   /* port in native byte order */
>> +unsigned intsvm_cid;/* address in native byte order */
>> +};
>> +.EE
>> +.in
>> +.PP
>> +.I svm_family
>> +is always set to
>> +.BR AF_VSOCK .
>> +.I svm_reserved1
>> +is always set to 0.
>> +.I svm_port
>> +contains the port in native byte order.
>> +The port numbers below 1024 are called
>> +.IR "privileged ports" .
>> +Only a process with
>> +.B CAP_NET_BIND_SERVER
>> +capability may
>> +.BR bind (2)
>> +to these port numbers.
>> +.PP
>> +There are several special addresses:
>> +.B VMADDR_CID_ANY
>> +(-1U)
>> +means any address for binding;
>> +.B VMADDR_CID_HYPERVISOR
>> +(0) and
>> +.B VMADDR_CID_RESERVED
>> +(1) are unused addresses;
>> +.B VMADDR_CID_HOST
>> +(2)
>> +is the well-known address of the host.
>> +.PP
>> +The special constant
>> +.B VMADDR_P

Re: [PATCH] vsock.7: document VSOCK socket address family

2018-01-26 Thread Michael Kerrisk (man-pages)
Stefan,

I've just now noted that your page came with no license. What license
do you want to use Please see
https://www.kernel.org/doc/man-pages/licenses.html

Thanks,

Michael


On 30 November 2017 at 12:21, Stefan Hajnoczi  wrote:
> The AF_VSOCK address family has been available since Linux 3.9 without a
> corresponding man page.
>
> This patch adds vsock.7 and describes its use along the same lines as
> existing ip.7, unix.7, and netlink.7 man pages.
>
> CC: Jorgen Hansen 
> CC: Dexuan Cui 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  man7/vsock.7 | 175 
> +++
>  1 file changed, 175 insertions(+)
>  create mode 100644 man7/vsock.7
>
> diff --git a/man7/vsock.7 b/man7/vsock.7
> new file mode 100644
> index 0..48c6c2e1e
> --- /dev/null
> +++ b/man7/vsock.7
> @@ -0,0 +1,175 @@
> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +vsock \- Linux VSOCK address family
> +.SH SYNOPSIS
> +.B #include 
> +.br
> +.B #include 
> +.PP
> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
> +.br
> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
> +.SH DESCRIPTION
> +The VSOCK address family facilitates communication between virtual machines 
> and
> +the host they are running on.  This address family is used by guest agents 
> and
> +hypervisor services that need a communications channel that is independent of
> +virtual machine network configuration.
> +.PP
> +Valid socket types are
> +.B SOCK_STREAM
> +and
> +.B SOCK_DGRAM .
> +.B SOCK_STREAM
> +provides connection-oriented byte streams with guaranteed, in-order delivery.
> +.B SOCK_DGRAM
> +provides a connectionless datagram packet service.  Availability of these
> +socket types is dependent on the underlying hypervisor.
> +.PP
> +A new socket is created with
> +.PP
> +socket(AF_VSOCK, socket_type, 0);
> +.PP
> +When a process wants to establish a connection it calls
> +.BR connect (2)
> +with a given destination socket address.  The socket is automatically bound 
> to
> +a free port if unbound.
> +.PP
> +A process can listen for incoming connections by first binding to a socket 
> address using
> +.BR bind (2)
> +and then calling
> +.BR listen (2).
> +.PP
> +Data is transferred using the usual
> +.BR send (2)
> +and
> +.BR recv (2)
> +family of socket system calls.
> +.SS Address format
> +A socket address is defined as a combination of a 32-bit Context Identifier 
> (CID) and a 32-bit port number.  The CID identifies the source or 
> destination, which is either a virtual machine or the host.  The port number 
> differentiates between multiple services running on a single machine.
> +.PP
> +.in +4n
> +.EX
> +struct sockaddr_vm {
> +sa_family_t svm_family; /* address family: AF_VSOCK */
> +unsigned short  svm_reserved1;
> +unsigned intsvm_port;   /* port in native byte order */
> +unsigned intsvm_cid;/* address in native byte order */
> +};
> +.EE
> +.in
> +.PP
> +.I svm_family
> +is always set to
> +.BR AF_VSOCK .
> +.I svm_reserved1
> +is always set to 0.
> +.I svm_port
> +contains the port in native byte order.
> +The port numbers below 1024 are called
> +.IR "privileged ports" .
> +Only a process with
> +.B CAP_NET_BIND_SERVER
> +capability may
> +.BR bind (2)
> +to these port numbers.
> +.PP
> +There are several special addresses:
> +.B VMADDR_CID_ANY
> +(-1U)
> +means any address for binding;
> +.B VMADDR_CID_HYPERVISOR
> +(0) and
> +.B VMADDR_CID_RESERVED
> +(1) are unused addresses;
> +.B VMADDR_CID_HOST
> +(2)
> +is the well-known address of the host.
> +.PP
> +The special constant
> +.B VMADDR_PORT_ANY
> +(-1U)
> +means any port number for binding.
> +.SS Live migration
> +Sockets are affected by live migration of virtual machines.  Connected
> +.B SOCK_STREAM
> +sockets become disconnected when the virtual machine migrates to a new host.
> +Applications must reconnect when this happens.
> +.PP
> +The local CID may change across live migration if the old CID is not 
> available
> +on the new host.  Bound sockets are automatically updated to the new CID.
> +.SS Ioctls
> +.TP
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
> int.
> +.IP
> +.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket that is already bound, providing an invalid 
> struct
> +.B sockaddr_vm ,
> +and other input validation errors.
> +.TP
> +.B EOPNOTSUPP
> +Operation not 

Re: aio poll, io_pgetevents and a new in-kernel poll API V2

2018-01-10 Thread Michael Kerrisk (man-pages)
Hi Christoph,

On 01/10/2018 04:58 PM, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.
> 
> To make the interface fully useful a new io_pgetevents system call is
> added, which atomically saves and restores the signal mask over the
> io_pgetevents system call.  It it the logical equivalent to pselect and
> ppoll for io_pgetevents.
> 
> The corresponding libaio changes for io_pgetevents support and
> documentation, as well as a test case will be posted in a separate
> series.
> 
> The changes were sponsored by Scylladb, and improve performance
> of the seastar framework up to 10%, while also removing the need
> for a privileged SCHED_FIFO epoll listener thread.
> 
> The patches are on top of Als __poll_t annoations, so I've also
> prepared a git branch on top of those here:
> 
> git://git.infradead.org/users/hch/vfs.git aio-poll
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.2
> 
> Libaio changes:
> 
> http://git.infradead.org/users/hch/libaio.git/shortlog/refs/heads/aio-poll
> 
> Seastar changes:
> 
> https://github.com/avikivity/seastar/commits/aio
> 
> Changes since V1:
>  - handle the NULL ->poll case in vfs_poll
>  - dropped the file argument to the ->poll_mask socket operation
>  - replace the ->pre_poll socket operation with ->get_poll_head as
>in the file operations

Are there some man pages patches already for these changes?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-30 Thread Michael Kerrisk (man-pages)
Hello Mahesh,

On 12/28/2017 01:45 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)
> <mtk.manpa...@gmail.com> wrote:
>> Hello Mahesh,
>>
>> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
>> <mahe...@google.com> wrote:
>>> Hello James,
>>>
>>> Seems like I missed your name to be added into the review of this
>>> patch series. Would you be willing be pull this into the security
>>> tree? Serge Hallyn has already ACKed it.
>>
>> We seem to have no formal documentation/specification of this feature.
>> I think that should be written up before this patch goes into
>> mainline...
>>
> absolutely. I have added enough information into the Documentation dir
> relevant to this feature (please look at the  individual patches),
> that could be used. I could help if needed.

Yes, but I think that the documentation is rather incomplete.
I'll also reply to the relevant Documentation thread.

See also some comments below about this commit message, which
should make things *much* easier for the reader.

>>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar <mah...@bandewar.net> wrote:
>>>> From: Mahesh Bandewar <mahe...@google.com>
>>>>
>>>> TL;DR version
>>>> -
>>>> Creating a sandbox environment with namespaces is challenging
>>>> considering what these sandboxed processes can engage into. e.g.
>>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>>> Current form of user-namespaces, however, if changed a bit can allow
>>>> us to create a sandbox environment without locking down user-
>>>> namespaces.
>>>>
>>>> Detailed version
>>>> 
>>>>
>>>> Problem
>>>> ---
>>>> User-namespaces in the current form have increased the attack surface as
>>>> any process can acquire capabilities which are not available to them (by
>>>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>>>
>>>> #define _GNU_SOURCE
>>>> #include 
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int ac, char **av)
>>>> {
>>>> int sock = -1;
>>>>
>>>> printf("Attempting to open RAW socket before unshare()...\n");
>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>> if (sock < 0) {
>>>> perror("socket() SOCK_RAW failed: ");
>>>> } else {
>>>> printf("Successfully opened RAW-Sock before unshare().\n");
>>>> close(sock);
>>>> sock = -1;
>>>> }
>>>>
>>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>>> perror("unshare() failed: ");
>>>> return 1;
>>>> }
>>>>
>>>> printf("Attempting to open RAW socket after unshare()...\n");
>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>> if (sock < 0) {
>>>> perror("socket() SOCK_RAW failed: ");
>>>> } else {
>>>> printf("Successfully opened RAW-Sock after unshare().\n");
>>>> close(sock);
>>>> sock = -1;
>>>> }
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> The above example shows how easy it is to acquire NET_RAW capabilities
>>>> and once acquired, these processes could take benefit of above mentioned
>>>> or similar issues discovered/undiscovered with malicious intent.

But you do not actually describe what the problem is. I think
it's not sufficient to simply refer to some CVEs.
Your mail message/commit should clearly describe what the issue is,
rather than leave the reader to decipher a bunch of CVEs, and derive
your concerns from those CVEs.

>>>> Note
>>>> that this is just an example and the problem/solution is not limited
>>>> to NET_RAW capability *only*.
>>>>
>>>> The easiest fix one can apply here is to lock-down user-namespaces which
>>>> many of the distros do (i.e. don't allow users to create user namespaces),
>>>> but unfortunately that prevents everyone from using them.
>>>>
>

Re: [PATCHv3 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-12-30 Thread Michael Kerrisk (man-pages)
Hello Mahesh,

On 12/05/2017 11:31 PM, Mahesh Bandewar wrote:
> From: Mahesh Bandewar 
> 
> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
> takes input as capability mask expressed as two comma separated hex
> u32 words. The mask, however, is stored in kernel as kernel_cap_t type.

Just by the way, why is it not expressed as a 64 bit value? (The answer
to that question should I think be part of this commit message.)

> Any capabilities that are not part of this mask will be controlled and
> will not be allowed to processes in controlled user-ns.
> 
> Acked-by: Serge Hallyn 
> Signed-off-by: Mahesh Bandewar 
> ---
> v3:
>   Added couple of comments as requested by Serge Hallyn
> v2:
>   Rebase
> v1:
>   Initial submission
> 
>  Documentation/sysctl/kernel.txt | 21 ++
>  include/linux/capability.h  |  3 +++
>  kernel/capability.c | 47 
> +
>  kernel/sysctl.c |  5 +
>  4 files changed, 76 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 694968c7523c..a1d39dbae847 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
>  - bootloader_version  [ X86 only ]
>  - callhome[ S390 only ]
>  - cap_last_cap
> +- controlled_userns_caps_whitelist
>  - core_pattern
>  - core_pipe_limit
>  - core_uses_pid
> @@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
>  
>  ==
>  
> +controlled_userns_caps_whitelist
> +
> +Capability mask that is whitelisted for "controlled" user namespaces.

How is a user-ns marked as "controlled"? Please clarify this.

> +Any capability that is missing from this mask will not be allowed to
> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
> +is not part of this mask, then processes running inside any controlled
> +userns's will not be allowed to perform action that needs CAP_NET_RAW
> +capability. However, processes that are attached to a parent user-ns
> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
> +performing those actions. User-namespaces are marked "controlled" at
> +the time of their creation based on the capabilities of the creator.
> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
> +that are controlled.
> +
> +The value is expressed as two comma separated hex words (u32). This
> +sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
> +are allowed to make changes.

Could you add here a shell session that demonstrates the use of these 
interfaces and how they allow/disallow capabilities. 

Is there a way that a process can see whether it is a controlled user-ns
vs an uncontrolled user-ns? I think it would be good to explain in this
doc patch.

Thanks,

Michael

> +==
> +
>  core_pattern:
>  
>  core_pattern is used to specify a core dumpfile pattern name.
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index f640dcbc880c..7d79a4689625 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -14,6 +14,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include 
> +#include 
>  
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> @@ -248,6 +249,8 @@ extern bool ptracer_capable(struct task_struct *tsk, 
> struct user_namespace *ns);
>  
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
>  
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 1e1c0236f55b..4a859b7d4902 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -29,6 +29,8 @@ EXPORT_SYMBOL(__cap_empty_set);
>  
>  int file_caps_enabled = 1;
>  
> +kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
> +
>  static int __init file_caps_disable(char *str)
>  {
>   file_caps_enabled = 0;
> @@ -507,3 +509,48 @@ bool ptracer_capable(struct task_struct *tsk, struct 
> user_namespace *ns)
>   rcu_read_unlock();
>   return (ret == 0);
>  }
> +
> +/* Controlled-userns capabilities routines */
> +#ifdef CONFIG_SYSCTL
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos)
> +{
> + DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
> + struct ctl_table caps_table;
> + char tbuf[NAME_MAX];
> + int ret;
> +
> + ret = 

Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread Michael Kerrisk (man-pages)
Hello Mahesh,

On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> Hello James,
>
> Seems like I missed your name to be added into the review of this
> patch series. Would you be willing be pull this into the security
> tree? Serge Hallyn has already ACKed it.

We seem to have no formal documentation/specification of this feature.
I think that should be written up before this patch goes into
mainline...

Cheers,

Michael


>
> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>> From: Mahesh Bandewar 
>>
>> TL;DR version
>> -
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> 
>>
>> Problem
>> ---
>> User-namespaces in the current form have increased the attack surface as
>> any process can acquire capabilities which are not available to them (by
>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>
>> #define _GNU_SOURCE
>> #include 
>> #include 
>> #include 
>>
>> int main(int ac, char **av)
>> {
>> int sock = -1;
>>
>> printf("Attempting to open RAW socket before unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock before unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>> perror("unshare() failed: ");
>> return 1;
>> }
>>
>> printf("Attempting to open RAW socket after unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock after unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> return 0;
>> }
>>
>> The above example shows how easy it is to acquire NET_RAW capabilities
>> and once acquired, these processes could take benefit of above mentioned
>> or similar issues discovered/undiscovered with malicious intent. Note
>> that this is just an example and the problem/solution is not limited
>> to NET_RAW capability *only*.
>>
>> The easiest fix one can apply here is to lock-down user-namespaces which
>> many of the distros do (i.e. don't allow users to create user namespaces),
>> but unfortunately that prevents everyone from using them.
>>
>> Approach
>> 
>> Introduce a notion of 'controlled' user-namespaces. Every process on
>> the host is allowed to create user-namespaces (governed by the limit
>> imposed by per-ns sysctl) however, mark user-namespaces created by
>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>> capability check in conjunction with a global capability whitelist.
>> If the capability is not whitelisted, processes that belong to
>> controlled user-namespaces will not be allowed.
>>
>> Once a user-ns is marked as 'controlled'; all its child user-
>> namespaces are marked as 'controlled' too.
>>
>> A global whitelist is list of capabilities governed by the
>> sysctl which is available to (privileged) user in init-ns to modify
>> while it's applicable to all controlled user-namespaces on the host.
>>
>> Marking user-namespaces controlled without modifying the whitelist is
>> equivalent of the current behavior. The default value of whitelist includes
>> all capabilities so that the compatibility is maintained. However it gives
>> admins fine-grained ability to control various capabilities system wide
>> without locking down user-namespaces.
>>
>> Please see individual patches in this series.
>>
>> Mahesh Bandewar (2):
>>   capability: introduce sysctl for controlled user-ns capability whitelist
>>   userns: control capabilities of some user namespaces
>>
>>  Documentation/sysctl/kernel.txt | 21 +
>>  include/linux/capability.h  |  7 ++
>>  include/linux/user_namespace.h  | 25 
>>  kernel/capability.c | 52 
>> +
>>  kernel/sysctl.c |  5 
>>  kernel/user_namespace.c |  4 
>>  security/commoncap.c|  8 +++
>>  7 files changed, 122 insertions(+)
>>
>> --
>> 2.15.0.531.g2ccb3012c9-goog
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
On 12/06/2017 03:06 PM, Jorgen S. Hansen wrote:
> 
>> On Dec 5, 2017, at 11:56 AM, Stefan Hajnoczi  wrote:
>>
>> The AF_VSOCK address family has been available since Linux 3.9 without a
>> corresponding man page.
>>
>> This patch adds vsock.7 and describes its use along the same lines as
>> existing ip.7, unix.7, and netlink.7 man pages.
>>
>> CC: Jorgen Hansen 
>> CC: Dexuan Cui 
>> Signed-off-by: Stefan Hajnoczi 
>> ---
>> man7/vsock.7 | 180 
>> +++
>> 1 file changed, 180 insertions(+)
>> create mode 100644 man7/vsock.7
>>
>> diff --git a/man7/vsock.7 b/man7/vsock.7
>> new file mode 100644
>> index 0..46dc561f5
>> --- /dev/null
>> +++ b/man7/vsock.7
>> @@ -0,0 +1,180 @@
>> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +vsock \- Linux VSOCK address family
>> +.SH SYNOPSIS
>> +.B #include 
>> +.br
>> +.B #include 
>> +.PP
>> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
>> +.br
>> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
>> +.SH DESCRIPTION
>> +The VSOCK address family facilitates communication between virtual machines 
>> and
>> +the host they are running on.  This address family is used by guest agents 
>> and
>> +hypervisor services that need a communications channel that is independent 
>> of
>> +virtual machine network configuration.
>> +.PP
>> +Valid socket types are
>> +.B SOCK_STREAM
>> +and
>> +.BR SOCK_DGRAM .
>> +.B SOCK_STREAM
>> +provides connection-oriented byte streams with guaranteed, in-order 
>> delivery.
>> +.B SOCK_DGRAM
>> +provides a connectionless datagram packet service with best-effort delivery 
>> and
>> +best-effort ordering.  Availability of these socket types is dependent on 
>> the
>> +underlying hypervisor.
>> +.PP
>> +A new socket is created with
>> +.PP
>> +socket(AF_VSOCK, socket_type, 0);
>> +.PP
>> +When a process wants to establish a connection it calls
>> +.BR connect (2)
>> +with a given destination socket address.  The socket is automatically bound 
>> to
>> +a free port if unbound.
>> +.PP
>> +A process can listen for incoming connections by first binding to a socket
>> +address using
>> +.BR bind (2)
>> +and then calling
>> +.BR listen (2).
>> +.PP
>> +Data is transferred using the usual
>> +.BR send (2)
>> +and
>> +.BR recv (2)
>> +family of socket system calls.
>> +.SS Address format
>> +A socket address is defined as a combination of a 32-bit Context Identifier
>> +(CID) and a 32-bit port number.  The CID identifies the source or 
>> destination,
>> +which is either a virtual machine or the host.  The port number 
>> differentiates
>> +between multiple services running on a single machine.
>> +.PP
>> +.in +4n
>> +.EX
>> +struct sockaddr_vm {
>> +sa_family_t svm_family; /* address family: AF_VSOCK */
>> +unsigned short  svm_reserved1;
>> +unsigned intsvm_port;   /* port in native byte order */
>> +unsigned intsvm_cid;/* address in native byte order */
>> +};
>> +.EE
>> +.in
>> +.PP
>> +.I svm_family
>> +is always set to
>> +.BR AF_VSOCK .
>> +.I svm_reserved1
>> +is always set to 0.
>> +.I svm_port
>> +contains the port in native byte order.
>> +The port numbers below 1024 are called
>> +.IR "privileged ports" .
>> +Only a process with
>> +.B CAP_NET_BIND_SERVER
>> +capability may
>> +.BR bind (2)
>> +to these port numbers.
>> +.PP
>> +There are several special addresses:
>> +.B VMADDR_CID_ANY
>> +(-1U)
>> +means any address for binding;
>> +.B VMADDR_CID_HYPERVISOR
>> +(0) is reserved for services built into the hypervisor;
>> +.B VMADDR_CID_RESERVED
>> +(1) must not be used;
>> +.B VMADDR_CID_HOST
>> +(2)
>> +is the well-known address of the host.
>> +.PP
>> +The special constant
>> +.B VMADDR_PORT_ANY
>> +(-1U)
>> +means any port number for binding.
>> +.SS Live migration
>> +Sockets are affected by live migration of virtual machines.  Connected
>> +.B SOCK_STREAM
>> +sockets become disconnected when the virtual machine migrates to a new host.
>> +Applications must reconnect when this happens.
>> +.PP
>> +The local CID may change across live migration if the old CID is not 
>> available
>> +on the new host.  Bound sockets are automatically updated to the new CID.
>> +.SS Ioctls
>> +.TP
>> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
>> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
>> int.
>> +.IP
>> +.in +4n
>> +.EX
>> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  
>> ");"
>> +.EE
>> +.in
>> +.IP
>> +Consider using
>> +.B VMADDR_CID_ANY
>> +when binding instead of getting the local CID with
>> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
>> +.SH ERRORS
>> +.TP
>> +.B EACCES
>> +Unable to bind to a privileged port without the
>> +.B CAP_NET_BIND_SERVICE
>> +capability.
>> +.TP
>> +.B EINVAL
>> +Invalid parameters.  This includes:
>> +attempting to bind a socket that is 

Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
Hello Stefan,

Thanks for this page!

I have applied your patch, and made a few tweaks, but
I have some minor questions. Please see below.

On 12/05/2017 11:56 AM, Stefan Hajnoczi wrote:
> The AF_VSOCK address family has been available since Linux 3.9 without a
> corresponding man page.
> 
> This patch adds vsock.7 and describes its use along the same lines as
> existing ip.7, unix.7, and netlink.7 man pages.
> 
> CC: Jorgen Hansen 
> CC: Dexuan Cui 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  man7/vsock.7 | 180 
> +++
>  1 file changed, 180 insertions(+)
>  create mode 100644 man7/vsock.7
> 
> diff --git a/man7/vsock.7 b/man7/vsock.7
> new file mode 100644
> index 0..46dc561f5
> --- /dev/null
> +++ b/man7/vsock.7
> @@ -0,0 +1,180 @@
> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +vsock \- Linux VSOCK address family
> +.SH SYNOPSIS
> +.B #include 
> +.br
> +.B #include 
> +.PP
> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
> +.br
> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
> +.SH DESCRIPTION
> +The VSOCK address family facilitates communication between virtual machines 
> and
> +the host they are running on.  This address family is used by guest agents 
> and
> +hypervisor services that need a communications channel that is independent of
> +virtual machine network configuration.
> +.PP
> +Valid socket types are
> +.B SOCK_STREAM
> +and
> +.BR SOCK_DGRAM .
> +.B SOCK_STREAM
> +provides connection-oriented byte streams with guaranteed, in-order delivery.
> +.B SOCK_DGRAM
> +provides a connectionless datagram packet service with best-effort delivery 
> and
> +best-effort ordering.  Availability of these socket types is dependent on the
> +underlying hypervisor.
> +.PP
> +A new socket is created with
> +.PP
> +socket(AF_VSOCK, socket_type, 0);
> +.PP
> +When a process wants to establish a connection it calls
> +.BR connect (2)
> +with a given destination socket address.  The socket is automatically bound 
> to
> +a free port if unbound.
> +.PP
> +A process can listen for incoming connections by first binding to a socket
> +address using
> +.BR bind (2)
> +and then calling
> +.BR listen (2).
> +.PP
> +Data is transferred using the usual
> +.BR send (2)
> +and
> +.BR recv (2)

Or equally, write(2) and read(2), right? By failing to mention those, the
text subtly implies that send(2) and recv(2) are preferred, but I don't
suppose that is true.

> +family of socket system calls.
> +.SS Address format
> +A socket address is defined as a combination of a 32-bit Context Identifier
> +(CID) and a 32-bit port number.  The CID identifies the source or 
> destination,
> +which is either a virtual machine or the host.  The port number 
> differentiates
> +between multiple services running on a single machine.
> +.PP
> +.in +4n
> +.EX
> +struct sockaddr_vm {
> +sa_family_t svm_family; /* address family: AF_VSOCK */
> +unsigned short  svm_reserved1;
> +unsigned intsvm_port;   /* port in native byte order */
> +unsigned intsvm_cid;/* address in native byte order */
> +};
> +.EE
> +.in
> +.PP
> +.I svm_family
> +is always set to
> +.BR AF_VSOCK .
> +.I svm_reserved1
> +is always set to 0.
> +.I svm_port
> +contains the port in native byte order.
> +The port numbers below 1024 are called
> +.IR "privileged ports" .
> +Only a process with
> +.B CAP_NET_BIND_SERVER
> +capability may
> +.BR bind (2)
> +to these port numbers.
> +.PP
> +There are several special addresses:
> +.B VMADDR_CID_ANY
> +(-1U)
> +means any address for binding;
> +.B VMADDR_CID_HYPERVISOR
> +(0) is reserved for services built into the hypervisor;
> +.B VMADDR_CID_RESERVED
> +(1) must not be used;
> +.B VMADDR_CID_HOST
> +(2)
> +is the well-known address of the host.
> +.PP
> +The special constant
> +.B VMADDR_PORT_ANY
> +(-1U)
> +means any port number for binding.
> +.SS Live migration
> +Sockets are affected by live migration of virtual machines.  Connected
> +.B SOCK_STREAM
> +sockets become disconnected when the virtual machine migrates to a new host.
> +Applications must reconnect when this happens.
> +.PP
> +The local CID may change across live migration if the old CID is not 
> available
> +on the new host.  Bound sockets are automatically updated to the new CID.
> +.SS Ioctls
> +.TP
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
> int.
> +.IP
> +.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid 

Re: Incorrect behaviour or documentation problem of SO_RXQ_OVFL

2017-11-20 Thread Michael Kerrisk (man-pages)
[Adding Neil, who wrote the original text. Maybe he has also some
suggested improvement.]

Hello Petr and Tobias,

Thank you both for your reports about the incorrect documentation. See below.

On 15 November 2017 at 16:14, Petr Malat  wrote:
> Hi!
> Generic SO_RXQ_OVFL helpers sock_skb_set_dropcount() and sock_recv_drops()
> implements returning of sk->sk_drops (the total number of dropped packets),
> although the documentation says the number of dropped packets since the
> last received one should be returned (quoting the current socket.7):
>   SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message (cmsg)
>   should be attached to received skbs indicating the number of packets
>   dropped by the socket between the last received packet and this
>   received packet.
>
> I assume the documentation needs to be updated, as fixing this in the
> code could break programs depending on the current behavior, although
> the formerly planned functionality seems to be more usefull.
>
> The problem can be revealed with the following program:
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int extract_drop(struct msghdr *msg)
> {
> struct cmsghdr *cmsg;
> int rtn;
>
> for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg,cmsg)) {
> if (cmsg->cmsg_level == SOL_SOCKET &&
> cmsg->cmsg_type == SO_RXQ_OVFL) {
> memcpy(, CMSG_DATA(cmsg), sizeof rtn);
> return rtn;
> }
> }
> return -1;
> }
>
> int main(int argc, char *argv[])
> {
> struct sockaddr_in addr = { .sin_family = AF_INET };
> char msg[48*1024], cmsgbuf[256];
> struct iovec iov = { .iov_base = msg, .iov_len = sizeof msg };
> int sk1, sk2, i, one = 1;
>
> sk1 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
> sk2 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
>
> inet_pton(AF_INET, "127.0.0.1", _addr);
> addr.sin_port = htons(5);
>
> bind(sk1, (struct sockaddr*), sizeof addr);
> connect(sk2, (struct sockaddr*), sizeof addr);
>
> // Kernel doubles this limit, but it accounts also the SKB overhead,
> // but it receives as long as there is at least 1 byte free.
> i = sizeof msg;
> setsockopt(sk1, SOL_SOCKET, SO_RCVBUF, , sizeof i);
> setsockopt(sk1, SOL_SOCKET, SO_RXQ_OVFL, , sizeof one);
>
> for (i = 0; i < 4; i++) {
> int rtn;
>
> send(sk2, msg, sizeof msg, 0);
> send(sk2, msg, sizeof msg, 0);
> send(sk2, msg, sizeof msg, 0);
>
> do {
> struct msghdr msghdr = {
> .msg_iov = , .msg_iovlen = 1,
> .msg_control = ,
> .msg_controllen = sizeof cmsgbuf };
> rtn = recvmsg(sk1, , MSG_DONTWAIT);
> if (rtn > 0) {
> printf("rtn: %d drop %d\n", rtn,
> extract_drop());
> } else {
> printf("rtn: %d\n", rtn);
> }
> } while (rtn > 0);
> }
>
> return 0;
> }
>
> which prints
>   rtn: 49152 drop -1
>   rtn: 49152 drop -1
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 1
>   rtn: -1
>   rtn: 49152 drop 2
>   rtn: 49152 drop 2
>   rtn: -1
>   rtn: 49152 drop 3
>   rtn: 49152 drop 3
>   rtn: -1
> although it should print (according to the documentation):
>   rtn: 49152 drop 0
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>
> Please keep me on To:/CC: as I'm not on the list.

Thanks for the test program. Tobias reported the same issue, and I've
applied his suggested change to the page. (See below.)

Cheers,

Michael

diff --git a/man7/socket.7 b/man7/socket.7
index 79966a6fd..1a2cfe9cc 100644
--- a/man7/socket.7
+++ b/man7/socket.7
@@ -881,8 +881,7 @@ compete to receive datagrams on the same socket.
 .\" commit 3b885787ea4112eaa80945999ea0901bf742707f
 Indicates that an unsigned 32-bit value ancillary message (cmsg)
 should be attached to received skbs indicating
-the number of packets dropped by the socket between
-the last received packet and this received packet.
+the number of packets dropped by the socket since its creation.
 .TP
 .B SO_SNDBUF
 Sets or gets the maximum socket send buffer in bytes.


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Bug in socket(7) man page

2017-11-20 Thread Michael Kerrisk (man-pages)
[CC widended]

Tobias,

On 7 August 2017 at 13:53, Tobias Klausmann  wrote:
> Hi!
>
> This bug pertains to the manpage as visible on man7.org right
> now.
>
> The socket(7) man page has this paragraph:
>
>SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message 
> (cmsg) should be attached to
>   received skbs indicating the number of packets dropped by the 
> socket between the  last
>   received packet and this received packet.
>
> The second half is wrong: the counter (internally,
> SOCK_SKB_CB(skb)->dropcount is *not* reset after every packet.
> That is, it is a proper counter, not a gauge, in monitoring
> parlance.
>
> A better version of that paragraph:
>
>SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message 
> (cmsg) should be attached to
>   received skbs indicating the number of packets dropped by the 
> socket since its
>   creation.

Thanks for the report. See also my reply to Petr in just a moment.
I've taken your suggested text change.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2017-08-15 Thread Michael Kerrisk (man-pages)
On 11/14/2016 11:36 PM, Rick Jones wrote:
>> Lets change the example so others don't propagate the problem further.
>>
>> Signed-off-by David Wilder 
>>
>> --- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
>> +++ man7/netlink.7  2016-11-14 13:30:51.002086354 -0800
>> @@ -511,7 +511,7 @@
>>  .in +4n
>>  .nf
>>  int len;
>> -char buf[4096];
>> +char buf[8192];
> 
> Since there doesn't seem to be a define one could use in the user space 
> linux/netlink.h (?), but there are comments in the example code in the 
> manpage, how about also including a brief comment to the effect that 
> using 8192 bytes will avoid message truncation problems on platforms 
> with a large PAGE_SIZE?
> 
> /* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */
> 
> or something like that.

Thanks for the suggestion, Rick. Done!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2017-08-15 Thread Michael Kerrisk (man-pages)
On 11/14/2016 11:20 PM, dwilder wrote:
> The example code in netlink(7) (for reading netlink message) suggests 
> using
> a 4k read buffer with recvmsg.  This can cause truncated messages on 
> systems
> using a page size is >4096.  Please see:
> linux/include/linux/netlink.h (in the kernel source)
> 
> 
> /*
>   *  skb should fit one page. This choice is good for headerless 
> malloc.
>   *  But we should limit to 8K so that userspace does not have to
>   *  use enormous buffer sizes on recvmsg() calls just to avoid
>   *  MSG_TRUNC when PAGE_SIZE is very large.
>   */
> #if PAGE_SIZE < 8192UL
> #define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(PAGE_SIZE)
> #else
> #define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(8192UL)
> #endif
> 
> #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
> 
> 
> I was troubleshooting some up-stream code on a ppc64le system
> (page:size of 64k) This code had duplicated the example from netlink(7) 
> and
> was using a 4k buffer.  On x86-64 with a 4k page size this is not a 
> problem,
> however on the 64k page system some messages were truncated.  Using an 
> 8k buffer
> as implied in netlink.h prevents problems with any page size.
> 
> Lets change the example so others don't propagate the problem further.
> 
> Signed-off-by David Wilder 

Thanks, David. Patch applied.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-20 Thread Michael Kerrisk (man-pages)
On 04/19/2017 10:13 PM, Eric Dumazet wrote:
> On Wed, 2017-04-19 at 20:48 +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Eric,
>>
>> [reodering for clarity]
>>
>>>> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [CC += Eric, so that he might review]
>>>>>
>>>>> Hello Francois,
>>>>>
>>>>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>>>>>> This socket option is undocumented. Applies on the latest version
>>>>>> (man-pages-4.09-511).
>>>>>>
>>>>>> diff --git a/man7/socket.7 b/man7/socket.7
>>>>>> index 3efd7a5d8..1a3ffa253 100644
>>>>>> --- a/man7/socket.7
>>>>>> +++ b/man7/socket.7
>>>>>> @@ -490,6 +490,26 @@ flag on a socket
>>>>>>  operation.
>>>>>>  Expects an integer boolean flag.
>>>>>>  .TP
>>>>>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>>>>>> Linux 4.4)"
>>>>>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>>>>>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>>>>>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>>>>>> +.sp
>>>>>> +.in +4n
>>>>>> +.nf
>>>>>> +int cpu = 1;
>>>>>> +socklen_t len = sizeof(cpu);
>>>>>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>>>>>> +.fi
>>>>>> +.in
>>>>>> +.sp
>>>>>> +The typical use case is one listener per RX queue, as the associated 
>>>>>> listener
>>>>>> +should only accept flows handled in softirq by the same cpu.  This 
>>>>>> provides
>>>>>> +optimal NUMA behavior and keep cpu caches hot.
>>>>>> +.TP
>>>>>>  .B SO_KEEPALIVE
>>>>>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>>>>>  Expects an integer boolean flag.
>>>>>
>>>>> Thank you! Patch applied.
>>>>>
>>>>> I have tried to enhance the description somewhat. I'm not sure whether
>>>>> what I've written is quite correct (or whether it should be further
>>>>> extended). Eric, could you please take a look at the following, and let 
>>>>> me know if anything needs fixing:
>>>>>
>>>>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>>>>>4.4)
>>>>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>>>>>   integer flag.
>>>>>
>>>>>   int cpu = 1;
>>>>>   socklen_t len = sizeof(cpu);
>>>>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>>>>>
>>>>>   Because  all  of the packets for a single stream (i.e., all
>>>>>   packets for the same 4-tuple) arrive on the single RX queue
>>>>>   that  is  associated with a particular CPU, the typical use
>>>>>   case is to employ one listening process per RX queue,  with
>>>>>   the  incoming  flow being handled by a listener on the same
>>>>>   CPU that is handling the RX queue.  This  provides  optimal
>>>>>   NUMA behavior and keeps CPU caches hot.
>>
>>> Hi Michael
>>>
>>> Sorry for the delay.
>>
>> Thanks for the reply, but I think you are assuming I know more than 
>> I do. I'd like you to elaborate a little please. See below.
>>
>>> Note that setting the option is not supported if SO_REUSEPORT is used.
>>
>> Please define "not supported". Does this yield an API diagnostic?
>> If so, what is it?
>>
>>> Socket will be selected from an array, either by a hash or BPF program
>>> that has no access to this information.
>>
>> Sorry -- I'm lost here. How does this comment relate to the proposed
>> man page text above?
> 
> Simply that :
> 
> If an application uses both SO_INCOMING_CPU and SO_REUSEPORT, then
> SO_REUSEPORT logic, selecting the socket to receive the packet, ignores
> SO_INCOMING_CPU setting.
> 
> This does not need to be documented, because it is an implementation
> detail/bug that could be changed, if someone cares enough.

Okay, thanks, Eric. I'll just merge the page text as it currently 
is then.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Michael Kerrisk (man-pages)
Hi Eric,

[reodering for clarity]

>> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
>>> [CC += Eric, so that he might review]
>>>
>>> Hello Francois,
>>>
>>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>>>> This socket option is undocumented. Applies on the latest version
>>>> (man-pages-4.09-511).
>>>>
>>>> diff --git a/man7/socket.7 b/man7/socket.7
>>>> index 3efd7a5d8..1a3ffa253 100644
>>>> --- a/man7/socket.7
>>>> +++ b/man7/socket.7
>>>> @@ -490,6 +490,26 @@ flag on a socket
>>>>  operation.
>>>>  Expects an integer boolean flag.
>>>>  .TP
>>>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>>>> Linux 4.4)"
>>>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>>>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>>>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>>>> +.sp
>>>> +.in +4n
>>>> +.nf
>>>> +int cpu = 1;
>>>> +socklen_t len = sizeof(cpu);
>>>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>>>> +.fi
>>>> +.in
>>>> +.sp
>>>> +The typical use case is one listener per RX queue, as the associated 
>>>> listener
>>>> +should only accept flows handled in softirq by the same cpu.  This 
>>>> provides
>>>> +optimal NUMA behavior and keep cpu caches hot.
>>>> +.TP
>>>>  .B SO_KEEPALIVE
>>>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>>>  Expects an integer boolean flag.
>>>
>>> Thank you! Patch applied.
>>>
>>> I have tried to enhance the description somewhat. I'm not sure whether
>>> what I've written is quite correct (or whether it should be further
>>> extended). Eric, could you please take a look at the following, and let 
>>> me know if anything needs fixing:
>>>
>>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>>>4.4)
>>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>>>   integer flag.
>>>
>>>   int cpu = 1;
>>>   socklen_t len = sizeof(cpu);
>>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>>>
>>>   Because  all  of the packets for a single stream (i.e., all
>>>   packets for the same 4-tuple) arrive on the single RX queue
>>>   that  is  associated with a particular CPU, the typical use
>>>   case is to employ one listening process per RX queue,  with
>>>   the  incoming  flow being handled by a listener on the same
>>>   CPU that is handling the RX queue.  This  provides  optimal
>>>   NUMA behavior and keeps CPU caches hot.

> Hi Michael
> 
> Sorry for the delay.

Thanks for the reply, but I think you are assuming I know more than 
I do. I'd like you to elaborate a little please. See below.

> Note that setting the option is not supported if SO_REUSEPORT is used.

Please define "not supported". Does this yield an API diagnostic?
If so, what is it?

> Socket will be selected from an array, either by a hash or BPF program
> that has no access to this information.

Sorry -- I'm lost here. How does this comment relate to the proposed
man page text above?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Michael Kerrisk (man-pages)
Ping Eric!

Would you have a chance to review the proposed text below, please.

Thanks,

Michael

On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
> [CC += Eric, so that he might review]
> 
> Hello Francois,
> 
> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>> This socket option is undocumented. Applies on the latest version
>> (man-pages-4.09-511).
>>
>> diff --git a/man7/socket.7 b/man7/socket.7
>> index 3efd7a5d8..1a3ffa253 100644
>> --- a/man7/socket.7
>> +++ b/man7/socket.7
>> @@ -490,6 +490,26 @@ flag on a socket
>>  operation.
>>  Expects an integer boolean flag.
>>  .TP
>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>> Linux 4.4)"
>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>> +.sp
>> +.in +4n
>> +.nf
>> +int cpu = 1;
>> +socklen_t len = sizeof(cpu);
>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
>> +.fi
>> +.in
>> +.sp
>> +The typical use case is one listener per RX queue, as the associated 
>> listener
>> +should only accept flows handled in softirq by the same cpu.  This provides
>> +optimal NUMA behavior and keep cpu caches hot.
>> +.TP
>>  .B SO_KEEPALIVE
>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>  Expects an integer boolean flag.
> 
> Thank you! Patch applied.
> 
> I have tried to enhance the description somewhat. I'm not sure whether
> what I've written is quite correct (or whether it should be further
> extended). Eric, could you please take a look at the following, and let 
> me know if anything needs fixing:
> 
>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>4.4)
>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>   integer flag.
> 
>   int cpu = 1;
>   socklen_t len = sizeof(cpu);
>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
> 
>   Because  all  of the packets for a single stream (i.e., all
>   packets for the same 4-tuple) arrive on the single RX queue
>   that  is  associated with a particular CPU, the typical use
>   case is to employ one listening process per RX queue,  with
>   the  incoming  flow being handled by a listener on the same
>   CPU that is handling the RX queue.  This  provides  optimal
>   NUMA behavior and keeps CPU caches hot.
> 
> Cheers,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-02-19 Thread Michael Kerrisk (man-pages)
[CC += Eric, so that he might review]

Hello Francois,

On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
> This socket option is undocumented. Applies on the latest version
> (man-pages-4.09-511).
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index 3efd7a5d8..1a3ffa253 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -490,6 +490,26 @@ flag on a socket
>  operation.
>  Expects an integer boolean flag.
>  .TP
> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
> Linux 4.4)"
> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
> +.sp
> +.in +4n
> +.nf
> +int cpu = 1;
> +socklen_t len = sizeof(cpu);
> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );
> +.fi
> +.in
> +.sp
> +The typical use case is one listener per RX queue, as the associated listener
> +should only accept flows handled in softirq by the same cpu.  This provides
> +optimal NUMA behavior and keep cpu caches hot.
> +.TP
>  .B SO_KEEPALIVE
>  Enable sending of keep-alive messages on connection-oriented sockets.
>  Expects an integer boolean flag.

Thank you! Patch applied.

I have tried to enhance the description somewhat. I'm not sure whether
what I've written is quite correct (or whether it should be further
extended). Eric, could you please take a look at the following, and let 
me know if anything needs fixing:

   SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
   4.4)
  Sets or gets the CPU affinity  of  a  socket.   Expects  an
  integer flag.

  int cpu = 1;
  socklen_t len = sizeof(cpu);
  setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, , );

  Because  all  of the packets for a single stream (i.e., all
  packets for the same 4-tuple) arrive on the single RX queue
  that  is  associated with a particular CPU, the typical use
  case is to employ one listening process per RX queue,  with
  the  incoming  flow being handled by a listener on the same
  CPU that is handling the RX queue.  This  provides  optimal
  NUMA behavior and keeps CPU caches hot.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)
On 26 July 2016 at 18:52, Kees Cook <keesc...@chromium.org> wrote:
> On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman
> <ebied...@xmission.com> wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpa...@gmail.com> writes:
>>
>>> Hello Eric,
>>>
>>> I realized I had a question after the last mail.
>>>
>>> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>>>>
>>>> This patchset addresses two use cases:
>>>> - Implement a sane upper bound on the number of namespaces.
>>>> - Provide a way for sandboxes to limit the attack surface from
>>>>   namespaces.
>>>
>>> Can you say more about the second point? What exactly is the
>>> problem that is being addressed, and how does the patch series
>>> address it? (It would be good to have those details in the
>>> revised commit message...)
>>
>> At some point it was reported that seccomp was not sufficient to disable
>> namespace creation.  I need to go back and look at that claim to see
>> which set of circumstances that was referring to.  Seccomp doesn't stack
>> so I can see why it is an issue.
>
> seccomp does stack. The trouble usually comes from a perception that
> seccomp overhead is not trivial, so setting a system-wide policy is a
> bit of a large hammer for such a limitiation. Also, at the time,
> seccomp could be bypasses with ptrace, but this (as of v4.8) is no
> longer true.

Sounds like someone needs to send me a patch for the seccomp.2 man page?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

I realized I had a question after the last mail.

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.


Can you say more about the second point? What exactly is the
problem that is being addressed, and how does the patch series
address it? (It would be good to have those details in the
revised commit message...)

Cheers,

Michael




Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.

The maximum sane case I can imagine is if every process is a fat
process, so I set the maximum number of namespaces to the maximum
number of threads.

I make these limits recursive and per user namespace so that a
usernamespace root can reduce the limits further.  If a user namespace
root raises the limit the limit in the parent namespace will be honored.

I have cut this implementation to the bare minimum needed to achieve
these objectives.

Does anyone know if there is a proper error code to return for resource
limit exceeded?  I am currently using -EUSERS or -ENFILE but both of
those feel a little wrong.


ENFILE certainly seems weird. I suppose my first question is: why two
different errors?

Some alternatives you might want to consider: E2BIG, EOVERFLOW,
or (maybe) ERANGE.

Cheers,

Michael








Re: [PATCH] netlink.7: describe netlink socket options

2016-06-12 Thread Michael Kerrisk (man-pages)
Hi Andrey,

On 06/10/2016 10:28 PM, Andrey Vagin wrote:
> Cc: Kir Kolyshkin 
> Cc: Michael Kerrisk 
> Cc: Herbert Xu 
> Cc: Patrick McHardy 
> Cc: Christophe Ricard 
> Cc: Nicolas Dichtel 
> Signed-off-by: Andrey Vagin 
> ---
>  man7/netlink.7 | 75 
> ++
>  1 file changed, 75 insertions(+)


Thanks for the nicely done patch. Applied!

Cheers,

Michael


> diff --git a/man7/netlink.7 b/man7/netlink.7
> index 513f854..b4848df 100644
> --- a/man7/netlink.7
> +++ b/man7/netlink.7
> @@ -368,6 +368,81 @@ and
>  .BR NETLINK_SELINUX
>  groups allow other users to receive messages.
>  No groups allow other users to send messages.
> +
> +.SS Socket options
> +To set or get a netlink socket option, call
> +.BR getsockopt (2)
> +to read or
> +.BR setsockopt (2)
> +to write the option with the option level argument set to
> +.BR SOL_NETLINK .
> +Unless otherwise noted,
> +.I optval
> +is a pointer to an
> +.IR int .
> +.TP
> +.BR NETLINK_PKTINFO " (since Linux 2.6.14)"
> +Enable
> +.B nl_pktinfo
> +control messages for received packets to get the extended
> +destination group number.
> +.TP
> +.BR NETLINK_ADD_MEMBERSHIP ,\  NETLINK_DROP_MEMBERSHIP " (since Linux 
> 2.6.14)"
> +Join/leave a group specified by
> +.IR optval .
> +.\"  commit 9a4595bc7e67962f13232ee55a64e063062c3a99
> +.\"  Author: Patrick McHardy 
> +.TP
> +.BR NETLINK_LIST_MEMBERSHIPS " (since Linux 4.2)"
> +Retrieve all groups a socket is a member of.
> +.I optval
> +is a pointer to
> +.B __u32
> +and
> +.I optlen
> +is the size of the array. The array is filled with the full membership set 
> of the
> +socket, and the required array size is returned in
> +.I optlen.
> +.\"  commit b42be38b2778eda2237fc759e55e3b698b05b315
> +.\"  Author: David Herrmann 
> +.TP
> +.BR NETLINK_BROADCAST_ERROR " (since Linux 2.6.30)"
> +When not set,
> +.B netlink_broadcast()
> +only reports
> +.B ESRCH
> +errors and silently ignore
> +.B NOBUFS
> +errors.
> +.\"  commit be0c22a46cfb79ab2342bb28fde99afa94ef868e
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_NO_ENOBUFS " (since Linux 2.6.30)"
> +This flag can be used by unicast and broadcast listeners to avoid receiving
> +.B ENOBUFS
> +errors.
> +.\"  commit 38938bfe3489394e2eed5e40c9bb8f66a2ce1405
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_LISTEN_ALL_NSID " (since Linux 4.2)"
> +When set, this socket will receive netlink notifications from all network 
> namespaces that
> +have an
> +.I nsid
> +assigned into the network namespace where the socket has been opened. The
> +.I nsid
> +is sent to user space via an ancillary data.
> +.\"  commit 59324cf35aba5336b611074028777838a963d03b
> +.\"  Author: Nicolas Dichtel 
> +.TP
> +.BR NETLINK_CAP_ACK " (since Linux 4.2)"
> +The kernel may fail to allocate the necessary room for the acknowledgment
> +message back to userspace. This option trims off the payload of the original
> +netlink message.
> +The netlink message header is still included, so the user can guess from the
> +sequence number what is the message that has triggered the acknowledgment.
> +.\"  commit 0a6a3a23ea6efde079a5b77688541a98bf202721
> +.\"  Author: Christophe Ricard 
> +
>  .SH VERSIONS
>  The socket interface to netlink is a new feature of Linux 2.2.
>  
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] ip.7: Fix incorrect sockopt name

2016-03-25 Thread Michael Kerrisk (man-pages)
Hello Benjamin,

On 03/22/2016 09:28 AM, Benjamin Poirier wrote:
> "IP_LEAVE_GROUP" does not exist. It was perhaps a confusion with
> MCAST_LEAVE_GROUP. Change the text to IP_DROP_MEMBERSHIP which has the same
> function as MCAST_LEAVE_GROUP and is documented in the ip.7 man page.
> 
> Reference:
> Linux kernel net/ipv4/ip_sockglue.c do_ip_setsockopt()

Thanks! Applied.

Cheers,

Michael


> Cc: Radek Pazdera 
> Signed-off-by: Benjamin Poirier 
> ---
>  man7/ip.7 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man7/ip.7 b/man7/ip.7
> index 3905573..37e2c86 100644
> --- a/man7/ip.7
> +++ b/man7/ip.7
> @@ -376,7 +376,7 @@ a given multicast group that come from a given source.
>  If the application has subscribed to multiple sources within
>  the same group, data from the remaining sources will still be delivered.
>  To stop receiving data from all sources at once, use
> -.BR IP_LEAVE_GROUP .
> +.BR IP_DROP_MEMBERSHIP .
>  .IP
>  Argument is an
>  .I ip_mreq_source
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] socket.7: Document some BPF-related socket options

2016-03-01 Thread Michael Kerrisk (man-pages)
On 03/01/2016 11:10 AM, Vincent Bernat wrote:
>  ❦  1 mars 2016 11:03 +0100, "Michael Kerrisk (man-pages)" 
> <mtk.manpa...@gmail.com> :
> 
>>   Once   the   SO_LOCK_FILTER  option  has  been  enabled,
>>   attempts by an unprivileged process to change or  remove
>>   the  filter  attached  to  a  socket,  or to disable the
>>   SO_LOCK_FILTER option will fail with the error EPERM.
> 
> You should remove "unprivileged". I didn't try to check for permissions
> because I was just lazy (and I didn't have a need for it). As root, you
> can just recreate another socket.

Bother. That's what I meant to do, and then I omitted to do it! Done now
And thanks for catching that, Vincent.

Revised text below, with another query.

   SO_LOCK_FILTER
  When set, this option will prevent changing the  filters
  associated  with  the socket.  These filters include any
  set   using   the   socket   options   SO_ATTACH_FILTER,
  SO_ATTACH_BPF,SO_ATTACH_REUSEPORT_CBPF   and
  SO_ATTACH_REUSEPORT_EPBF.

  The typical use case is for a privileged process to  set
  up  a  socket with restrictive filters, set SO_LOCK_FIL‐
  TER, and then either drop its  privileges  or  pass  the
  socket file descriptor to an unprivileged process.

  Once   the   SO_LOCK_FILTER  option  has  been  enabled,
  attempts to change or remove the filter  attached  to  a
  socket,  or  to  disable  the SO_LOCK_FILTER option will
  fail with the error EPERM.

I think the second paragraph should probably drop mention of privileges,
right? In fact, maybe just drop the paragraph altogether?

Cheers,

Michael
 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] socket.7: Document some BPF-related socket options

2016-03-01 Thread Michael Kerrisk (man-pages)
Hi Craig,

On 02/29/2016 06:36 PM, Craig Gallek wrote:
> From: Craig Gallek 

Thanks for improvements. I've applied the patch and tweaked things 
somewhat, but I have a few comments and queries below. I'd be 
grateful if you'd check these, in case I have introduced any errors.
(The tweaked version of the page can be found in the Git repo.)

> Document the behavior and the first kernel version for each of the
> following socket options:
> SO_ATTACH_FILTER
> SO_ATTACH_BPF
> SO_ATTACH_REUSEPORT_CBPF
> SO_ATTACH_REUSEPORT_EBPF
> SO_DETACH_FILTER
> SO_DETACH_BPF
> SO_LOCK_FILTER
> 
> Signed-off-by: Craig Gallek 
> ---
> v2 changes:
> - Content suggestions from Michael Kerrisk :
>   * Clarify socket filter return value semantics
>   * Clarify wording of minimal kernel versions
>   * Explain behavior of multiple calls using SO_ATTACH_[BPF|FILTER]
>   * Define 'reuseport groups' in SO_ATTACH_REUSEPORT_*
> - Include SO_LOCK_FILTER documentation mostly based off of the wording
>   in the commit message by Vincent Bernat 
>   d59577b6ffd3 ("sk-filter: Add ability to lock a socket filter program")
> 
> ---
>  man7/socket.7 | 136 
> +-
>  1 file changed, 115 insertions(+), 21 deletions(-)
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index db7cb8324dde..d22107cc47d7 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -41,9 +41,6 @@
>  .\"  SO_GET_FILTER (3.8)
>  .\"  commit a8fc92778080c845eaadc369a0ecf5699a03bef0
>  .\"  Author: Pavel Emelyanov 
> -.\"  SO_LOCK_FILTER (3.9)
> -.\"  commit d59577b6ffd313d0ab3be39cb1ab47e29bdc9182
> -.\"  Author: Vincent Bernat 
>  .\"  SO_SELECT_ERR_QUEUE (3.10)
>  .\" commit 7d4c04fc170087119727119074e72445f2bb192b
>  .\"  Author: Keller, Jacob E 
> @@ -53,13 +50,6 @@
>  .\" SO_BPF_EXTENSIONS (3.14)
>  .\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e
>  .\"  Author: Michal Sekletar 
> -.\" SO_ATTACH_BPF (3.19)
> -.\" and SO_DETACH_BPF as synonym for SO_DETACH_FILTER
> -.\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
> -.\"  Author: Alexei Starovoitov 
> -.\"  SO_ATTACH_REUSEPORT_CBPF, SO_ATTACH_REUSEPORT_EBPF (4.5)
> -.\"  commit 538950a1b7527a0a52ccd9337e3fcd304f027f13
> -.\"  Author: Craig Gallek 
>  .\"
>  .TH SOCKET 7 2015-05-07 Linux "Linux Programmer's Manual"
>  .SH NAME
> @@ -311,6 +301,90 @@ The value 0 indicates that this is not a listening 
> socket,
>  the value 1 indicates that this is a listening socket.
>  This socket option is read-only.
>  .TP
> +.BR SO_ATTACH_FILTER " and " SO_ATTACH_BPF
> +Attach a classic or extended BPF program (respectively) to the socket
> +for use as a filter of incoming packets. A packet will be dropped if
> +the filter program returns zero.  If the filter program returns a
> +non-zero value which is less than the packet's data length, the packet
> +will be truncated to the length returned.  If the value returned by
> +the filter is greater than or equal to the packet's data length, the
> +packet is allowed to proceed unmodified.
> +
> +The argument for
> +.BR SO_ATTACH_FILTER
> +is a
> +.I sock_fprog
> +structure in
> +.B .
> +.sp
> +.in +4n
> +.nf
> +struct sock_fprog {
> +unsigned short  len;
> +struct sock_filter *filter;
> +};
> +.fi
> +.in
> +.IP
> +The argument for
> +.BR SO_ATTACH_BPF
> +is a file descriptor returned by the
> +.BR bpf (2)
> +system call and must refer to a program of type
> +.BR BPF_PROG_TYPE_SOCKET_FILTER.
> +These options may be set multiple times for a given socket, each time
> +replacing the previous filter program.  The classic and extended
> +versions may be called on the same socket, but the previous filter
> +will always be replaced such that a socket never has more than one
> +filter defined.
> +
> +.BR SO_ATTACH_FILTER
> +is available since Linux 2.2.
> +.BR SO_ATTACH_BPF
> +is available since Linux 3.19.  Both classic and extended BPF are
> +explained in the kernel source file
> +.I Documentation/networking/filter.txt
> +.TP
> +.BR SO_ATTACH_REUSEPORT_CBPF " and " SO_ATTACH_REUSEPORT_EBPF " (since Linux 
> 4.5)"
> +For use with the
> +.BR SO_REUSEPORT
> +option, these options allow the user to set a classic or extended
> +BPF program (respectively) which defines how packets are assigned to
> +the sockets in the reuseport group (that is, all sockets which have
> +.BR SO_REUSEPORT
> +set and are using the same local address to receive packets).  The BPF
> +program must return an index between 0 and N-1 representing the socket
> +which should receive the packet (where N is the number of sockets in
> +the group). If the BPF program returns an invalid index, socket
> +selection will fall back to the plain
> 

Re: [PATCH] socket.7: Document some BPF-related socket options

2016-02-28 Thread Michael Kerrisk (man-pages)
Hello Craig,

Thanks for putting this together. I have a few comments.
Would you please amend your patch and resend? (And include Alexei
in a "Reviewed-by" tag.)

On 02/25/2016 09:27 PM, Craig Gallek wrote:
> From: Craig Gallek 
> 
> Document the behavior and the first kernel version for each of the
> following socket options:
> SO_ATTACH_FILTER
> SO_ATTACH_BPF
> SO_ATTACH_REUSEPORT_CBPF
> SO_ATTACH_REUSEPORT_EBPF
> SO_DETACH_FILTER
> SO_DETACH_BPF
> 
> Signed-off-by: Craig Gallek 
> ---
>  man7/socket.7 | 104 
> --
>  1 file changed, 86 insertions(+), 18 deletions(-)
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index db7cb8324dde..79b4f3158541 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -53,13 +53,6 @@
>  .\" SO_BPF_EXTENSIONS (3.14)
>  .\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e
>  .\"  Author: Michal Sekletar 
> -.\" SO_ATTACH_BPF (3.19)
> -.\" and SO_DETACH_BPF as synonym for SO_DETACH_FILTER
> -.\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
> -.\"  Author: Alexei Starovoitov 
> -.\"  SO_ATTACH_REUSEPORT_CBPF, SO_ATTACH_REUSEPORT_EBPF (4.5)
> -.\"  commit 538950a1b7527a0a52ccd9337e3fcd304f027f13
> -.\"  Author: Craig Gallek 
>  .\"
>  .TH SOCKET 7 2015-05-07 Linux "Linux Programmer's Manual"
>  .SH NAME
> @@ -311,6 +304,80 @@ The value 0 indicates that this is not a listening 
> socket,
>  the value 1 indicates that this is a listening socket.
>  This socket option is read-only.
>  .TP
> +.BR SO_ATTACH_FILTER " and " SO_ATTACH_BPF
> +Attach a classic or extended BPF program (respectively) to the socket
> +for use as a filter of incoming packets.  A packet will be dropped if
> +the filter returns zero or have its data truncated to the non-zero
> +length returned.  

I find that last sentence hard to parse. How about something like:

A packet will be dropped if the filter program returns zero or will 
have its data truncated to the non-zero length returned [returned by 
what? The filter? Make this clearer please.]

>If the value returned is greater or equal to the
> +packet's data length, the packet is allowed to proceed unmodified.
> +
> +The argument for
> +.BR SO_ATTACH_FILTER
> +is a
> +.I sock_fprog
> +structure in
> +.B .
> +.sp
> +.in +4n
> +.nf
> +struct sock_fprog {
> +unsigned short  len;
> +struct sock_filter *filter;
> +};
> +.fi
> +.in
> +.IP
> +The argument for
> +.BR SO_ATTACH_BPF
> +is a file descriptor returned by the
> +.BR bpf (2)
> +system call and must represent a program of type

s/represent/refer to/

> +.BR BPF_PROG_TYPE_SOCKET_FILTER.
> +
> +.BR SO_ATTACH_FILTER
> +is available in Linux 2.2.

s/in/since/

> +.BR SO_ATTACH_BPF
> +is available in Linux 3.19.  Both classic and extended BPF are

s/in/since/

> +explained in the kernel source file
> +.I Documentation/networking/filter.txt

Presumably, it is not possible to attach multiple filters to a socket.
This should be stated explicitly somewhere here, as well as an
explanation of what happens if you try to add a filter to a socket
that already has one. Does it replace the existing filter, or does
an error result.

Seems like SOCK_FILTER_LOCKED also needs documenting here somewhere...

> +.TP
> +.BR SO_ATTACH_REUSEPORT_CBPF " and " SO_ATTACH_REUSEPORT_EBPF " (since Linux 
> 4.5)"
> +For use with the
> +.BR SO_REUSEPORT
> +option, these options allow the user to define a classic or extended
> +BPF program (respectively) which defines how packets are assigned to
> +the sockets in the reuseport group.  The program must return an index

Is there some documentation on "reuseport groups" that we can refer
to here? If yes, please add a reference.

s/program/BPF program/

> +between 0 and N-1 representing the socket which should receive the
> +packet (where N is the number of sockets in the group). If the BPF
> +program returns an invalid index, socket selection will fall back to
> +the plain
> +.BR SO_REUSEPORT
> +mechanism.
> +
> +Sockets are numbered in the order in which they are added to the group
> +(that is, the order of
> +.BR bind (2)
> +calls for UDP sockets or the order of
> +.BR listen (2)
> +calls for TCP sockets).  New sockets added to the group will inherit
> +the program.  When a socket is removed from the group (via

s/program/BPF program/

s/the group/a reuseport group/

> +.BR close (2))
> +the last socket in the group will be moved into the closed socket's
> +position.

Wow! That's interesting behavior that seems like it could easily 
trip up users!

> +
> +These options may be set repeatedly at any time on any single socket
> +in the group to replace the current BPF program used by all sockets in
> +the group.
> +.BR SO_ATTACH_REUSEPORT_CBPF
> +takes the same socket argument type as
> +.BR SO_ATTACH_FILTER
> +and
> +.BR 

Re: [PATCH 1/1] include/uapi/linux/sockios.h: mark SIOCRTMSG unused

2015-12-30 Thread Michael Kerrisk (man-pages)
Hi Heinrich,

On 12/29/2015 11:22 PM, Heinrich Schuchardt wrote:
> IOCTL SIOCRTMSG does nothing but return EINVAL.
> 
> So comment it as unused.

Can you say something about how you confirmed this?
It's not immediately obvious from the code.

Cheers,

Michael


> Signed-off-by: Heinrich Schuchardt 
> ---
>  include/uapi/linux/sockios.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/sockios.h b/include/uapi/linux/sockios.h
> index e888b1a..8e7890b 100644
> --- a/include/uapi/linux/sockios.h
> +++ b/include/uapi/linux/sockios.h
> @@ -27,7 +27,7 @@
>  /* Routing table calls. */
>  #define SIOCADDRT0x890B  /* add routing table entry  */
>  #define SIOCDELRT0x890C  /* delete routing table entry   */
> -#define SIOCRTMSG0x890D  /* call to routing system   */
> +#define SIOCRTMSG0x890D  /* unused   */
>  
>  /* Socket configuration controls. */
>  #define SIOCGIFNAME  0x8910  /* get iface name   */
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] poll.2: timeout_ts is a pointer, so use -> not . for member access

2015-12-23 Thread Michael Kerrisk (man-pages)
Hello Richard,

On 23 December 2015 at 20:30, richardvo...@gmail.com
 wrote:
> From the context, it is apparent that in the code explaining ppoll in
> terms of poll, timeout_ts must be a pointer.
>
> Usage #1:   ready = ppoll(, nfds, timeout_ts, );
>
> Usage #2:(timeout_ts == NULL)
>
> Thus member access in (timeout_ts.tv_sec * 1000 + timeout_ts.tv_nsec /
> 100) is an error.

Thanks. Patch applied.

Cheers,

Michael


> man2/poll.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/man2/poll.2 b/man2/poll.2
> index bcbecad..34b55a6 100644
> --- a/man2/poll.2
> +++ b/man2/poll.2
> @@ -266,7 +266,7 @@ executing the following calls:
>  int timeout;
>
>  timeout = (timeout_ts == NULL) ? \-1 :
> -  (timeout_ts.tv_sec * 1000 + timeout_ts.tv_nsec / 100);
> +  (timeout_ts\->tv_sec * 1000 + timeout_ts\->tv_nsec / 100);
>  pthread_sigmask(SIG_SETMASK, , );
>  ready = poll(, nfds, timeout);
>  pthread_sigmask(SIG_SETMASK, , NULL);



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Michael Kerrisk (man-pages)
HI Tycho

On 11 September 2015 at 02:21, Tycho Andersen
 wrote:
> This patch adds a way for a process that is "real root" to access the
> seccomp filters of another process. The process first does a
> PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> bpf(BPF_PROG_DUMP) to dump the actual program at each step.

Do you have a man- page patch for this change?

Cheers,

Michael

> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/linux/bpf.h | 12 ++
>  include/linux/seccomp.h | 14 +++
>  include/uapi/linux/ptrace.h |  3 +++
>  kernel/bpf/syscall.c| 26 -
>  kernel/ptrace.c |  7 ++
>  kernel/seccomp.c| 57 
> +
>  6 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f57d7fe..bfd9cab 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -162,6 +162,8 @@ void bpf_register_prog_type(struct bpf_prog_type_list 
> *tl);
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>
>  struct bpf_prog *bpf_prog_get(u32 ufd);
> +int bpf_prog_set(u32 ufd, struct bpf_prog *new);
> +int bpf_new_fd(struct bpf_prog *prog, int flags);
>  void bpf_prog_put(struct bpf_prog *prog);
>  void bpf_prog_put_rcu(struct bpf_prog *prog);
>
> @@ -180,6 +182,16 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
> return ERR_PTR(-EOPNOTSUPP);
>  }
>
> +static inline int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> +{
> +   return -EINVAL;
> +}
> +
> +static inline int bpf_new_fd(struct bpf_prog *prog, int flags)
> +{
> +   return -EINVAL;
> +}
> +
>  static inline void bpf_prog_put(struct bpf_prog *prog)
>  {
>  }
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index a19ddac..41b083c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -95,4 +95,18 @@ static inline void get_seccomp_filter(struct task_struct 
> *tsk)
> return;
>  }
>  #endif /* CONFIG_SECCOMP_FILTER */
> +
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +extern long seccomp_get_filter_fd(struct task_struct *child);
> +extern long seccomp_next_filter(struct task_struct *child, u32 fd);
> +#else
> +static inline long seccomp_get_filter_fd(struct task_struct *child)
> +{
> +   return -EINVAL;
> +}
> +static inline long seccomp_next_filter(struct task_struct *child, u32 fd)
> +{
> +   return -EINVAL;
> +}
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index cf1019e..041c3c3 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -23,6 +23,9 @@
>
>  #define PTRACE_SYSCALL   24
>
> +#define PTRACE_SECCOMP_GET_FILTER_FD   40
> +#define PTRACE_SECCOMP_NEXT_FILTER 41
> +
>  /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
>  #define PTRACE_SETOPTIONS  0x4200
>  #define PTRACE_GETEVENTMSG 0x4201
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 58ae9f4..ac3ed1c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
>  }
>  EXPORT_SYMBOL_GPL(bpf_prog_get);
>
> +int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> +{
> +   struct fd f;
> +   struct bpf_prog *prog;
> +
> +   f = fdget(ufd);
> +
> +   prog = get_prog(f);
> +   if (!IS_ERR(prog) && prog)
> +   bpf_prog_put(prog);
> +
> +   atomic_inc(>aux->refcnt);
> +   f.file->private_data = new;
> +   fdput(f);
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_set);
> +
> +int bpf_new_fd(struct bpf_prog *prog, int flags)
> +{
> +   return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
> +}
> +EXPORT_SYMBOL_GPL(bpf_new_fd);
> +
>  /* last field in 'union bpf_attr' used by this command */
>  #defineBPF_PROG_LOAD_LAST_FIELD kern_version
>
> @@ -572,7 +596,7 @@ static int bpf_prog_load(union bpf_attr *attr)
> if (err < 0)
> goto free_used_maps;
>
> -   err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | 
> O_CLOEXEC);
> +   err = bpf_new_fd(prog, O_RDWR | O_CLOEXEC);
> if (err < 0)
> /* failed to allocate fd */
> goto free_used_maps;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index c8e0e05..a151c35 

Re: [PATCH v2 1/5] ebpf: add a seccomp program type

2015-09-11 Thread Michael Kerrisk (man-pages)
On 11 September 2015 at 02:20, Tycho Andersen
 wrote:
> seccomp uses eBPF as its underlying storage and execution format, and eBPF
> has features that seccomp would like to make use of in the future. This
> patch adds a formal seccomp type to the eBPF verifier.
>
> The current implementation of the seccomp eBPF type is very limited, and
> doesn't support some interesting features (notably, maps) of eBPF. However,
> the primary motivation for this patchset is to enable checkpoint/restore
> for seccomp filters later in the series, to this limited feature set is ok
> for now.

Hi Tycho,

Seems like a man-pages patch is warranted here also?

Cheers,

Michael


> v2: * don't allow seccomp eBPF programs to call any functions
> * get rid of superfluous seccomp_convert_ctx_access
>
> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/bpf.h |  1 +
>  net/core/filter.c| 31 +++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 92a48e2..631cdee 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -123,6 +123,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_KPROBE,
> BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT,
> +   BPF_PROG_TYPE_SECCOMP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD  1
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 13079f0..faaae67 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1612,6 +1612,15 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> }
>  }
>
> +static const struct bpf_func_proto *
> +seccomp_func_proto(enum bpf_func_id func_id)
> +{
> +   /* At some point in the future seccomp filters may grow support for
> +* eBPF functions. For now, these are disabled.
> +*/
> +   return NULL;
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
> /* check bounds */
> @@ -1662,6 +1671,17 @@ static bool tc_cls_act_is_valid_access(int off, int 
> size,
> return __is_valid_access(off, size, type);
>  }
>
> +static bool seccomp_is_valid_access(int off, int size,
> +   enum bpf_access_type type)
> +{
> +   if (type == BPF_WRITE)
> +   return false;
> +
> +   if (off < 0 || off >= sizeof(struct seccomp_data) || off & 3)
> +   return false;
> +
> +   return true;
> +}
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>   int src_reg, int ctx_off,
>   struct bpf_insn *insn_buf)
> @@ -1795,6 +1815,11 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
> .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops seccomp_ops = {
> +   .get_func_proto = seccomp_func_proto,
> +   .is_valid_access = seccomp_is_valid_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
> .ops = _filter_ops,
> .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -1810,11 +1835,17 @@ static struct bpf_prog_type_list sched_act_type 
> __read_mostly = {
> .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list seccomp_type __read_mostly = {
> +   .ops = _ops,
> +   .type = BPF_PROG_TYPE_SECCOMP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
> bpf_register_prog_type(_filter_type);
> bpf_register_prog_type(_cls_type);
> bpf_register_prog_type(_act_type);
> +   bpf_register_prog_type(_type);
>
> return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-11 Thread Michael Kerrisk (man-pages)
On 11 September 2015 at 02:21, Tycho Andersen
 wrote:
> This is the final bit needed to support seccomp filters created via the bpf
> syscall. The patch adds a new seccomp operation SECCOMP_MODE_FILTER_EBPF,
> which takes exactly one command (presumably to be expanded upon later when
> seccomp EBPFs support more interesting things) and an argument struct
> similar to that of bpf(), although the size is explicit in the struct to
> avoid changing the signature of seccomp().
>
> v2: Don't abuse seccomp's third argument; use a separate command and a
> pointer to a structure instead.

Hi Tycho,

Here, I'm entering broken record territory :-). Seems like a man-pages
patch is warranted here also?

Cheers,

Michael


> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/seccomp.h |  16 +
>  kernel/seccomp.c | 135 
> ++-
>  2 files changed, 138 insertions(+), 13 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0f238a4..a8694e2 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -13,10 +13,14 @@
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT0
>  #define SECCOMP_SET_MODE_FILTER1
> +#define SECCOMP_MODE_FILTER_EBPF   2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC  1
>
> +/* Valid cmds for SECCOMP_MODE_FILTER_EBPF */
> +#define SECCOMP_EBPF_ADD_FD0
> +
>  /*
>   * All BPF programs must return a 32-bit value.
>   * The bottom 16-bits are for optional return data.
> @@ -51,4 +55,16 @@ struct seccomp_data {
> __u64 args[6];
>  };
>
> +struct seccomp_ebpf {
> +   unsigned int size;
> +
> +   union {
> +   /* SECCOMP_EBPF_ADD_FD */
> +   struct {
> +   unsigned intadd_flags;
> +   __u32   add_fd;
> +   };
> +   };
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 1856f69..e78175a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -65,6 +65,9 @@ struct seccomp_filter {
>  /* Limit any path through the tree to 256KB worth of instructions. */
>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>
> +static long seccomp_install_filter(unsigned int flags,
> +  struct seccomp_filter *prepared);
> +
>  /*
>   * Endianness is explicitly ignored and left for BPF program authors to 
> manage
>   * as per the specific architecture.
> @@ -356,17 +359,6 @@ static struct seccomp_filter 
> *seccomp_prepare_filter(struct sock_fprog *fprog)
>
> BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
>
> -   /*
> -* Installing a seccomp filter requires that the task has
> -* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> -* This avoids scenarios where unprivileged tasks can affect the
> -* behavior of privileged children.
> -*/
> -   if (!task_no_new_privs(current) &&
> -   security_capable_noaudit(current_cred(), current_user_ns(),
> -CAP_SYS_ADMIN) != 0)
> -   return ERR_PTR(-EACCES);
> -
> /* Allocate a new seccomp_filter */
> sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
> if (!sfilter)
> @@ -510,8 +502,105 @@ static void seccomp_send_sigsys(int syscall, int reason)
> info.si_syscall = syscall;
> force_sig_info(SIGSYS, , current);
>  }
> +
>  #endif /* CONFIG_SECCOMP_FILTER */
>
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SECCOMP_FILTER)
> +static struct seccomp_filter *seccomp_prepare_ebpf(int fd)
> +{
> +   struct seccomp_filter *ret;
> +   struct bpf_prog *prog;
> +
> +   prog = bpf_prog_get(fd);
> +   if (IS_ERR(prog))
> +   return (struct seccomp_filter *) prog;
> +
> +   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
> +   bpf_prog_put(prog);
> +   return ERR_PTR(-EINVAL);
> +   }
> +
> +   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
> +   if (!ret) {
> +   bpf_prog_put(prog);
> +   return ERR_PTR(-ENOMEM);
> +   }
> +
> +   ret->prog = prog;
> +   atomic_set(>usage, 1);
> +
> +   /* Intentionally don't bpf_prog_put() here, because the underlying 
> prog
> +* is refcounted too and we're holding a reference from the struct
> +* 

Re: [PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-05 Thread Michael Kerrisk (man-pages)
On 09/04/2015 10:41 PM, Kees Cook wrote:
> On Fri, Sep 4, 2015 at 9:04 AM, Tycho Andersen
>  wrote:
>> This is the final bit needed to support seccomp filters created via the bpf
>> syscall.

Hmm. Thanks Kees, for CCinf linux-api@. That really should have been done at
the outset.

Tycho, where's the man-pages patch describing this new kernel-userspace
API feature? :-)

>> One concern with this patch is exactly what the interface should look like
>> for users, since seccomp()'s second argument is a pointer, we could ask
>> people to pass a pointer to the fd, but implies we might write to it which
>> seems impolite. Right now we cast the pointer (and force the user to cast
>> it), which generates ugly warnings. I'm not sure what the right answer is
>> here.
>>
>> Signed-off-by: Tycho Andersen 
>> CC: Kees Cook 
>> CC: Will Drewry 
>> CC: Oleg Nesterov 
>> CC: Andy Lutomirski 
>> CC: Pavel Emelyanov 
>> CC: Serge E. Hallyn 
>> CC: Alexei Starovoitov 
>> CC: Daniel Borkmann 
>> ---
>>  include/linux/seccomp.h  |  3 +-
>>  include/uapi/linux/seccomp.h |  1 +
>>  kernel/seccomp.c | 70 
>> 
>>  3 files changed, 61 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index d1a86ed..a725dd5 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -3,7 +3,8 @@
>>
>>  #include 
>>
>> -#define SECCOMP_FILTER_FLAG_MASK   (SECCOMP_FILTER_FLAG_TSYNC)
>> +#define SECCOMP_FILTER_FLAG_MASK   (\
>> +   SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_EBPF)
>>
>>  #ifdef CONFIG_SECCOMP
>>
>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index 0f238a4..c29a423 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -16,6 +16,7 @@
>>
>>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>>  #define SECCOMP_FILTER_FLAG_TSYNC  1
>> +#define SECCOMP_FILTER_FLAG_EBPF   (1 << 1)
>>
>>  /*
>>   * All BPF programs must return a 32-bit value.
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index a2c5b32..9c6bea6 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -355,17 +355,6 @@ static struct seccomp_filter 
>> *seccomp_prepare_filter(struct sock_fprog *fprog)
>>
>> BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
>>
>> -   /*
>> -* Installing a seccomp filter requires that the task has
>> -* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
>> -* This avoids scenarios where unprivileged tasks can affect the
>> -* behavior of privileged children.
>> -*/
>> -   if (!task_no_new_privs(current) &&
>> -   security_capable_noaudit(current_cred(), current_user_ns(),
>> -CAP_SYS_ADMIN) != 0)
>> -   return ERR_PTR(-EACCES);
>> -
>> /* Allocate a new seccomp_filter */
>> sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
>> if (!sfilter)
>> @@ -509,6 +498,48 @@ static void seccomp_send_sigsys(int syscall, int reason)
>> info.si_syscall = syscall;
>> force_sig_info(SIGSYS, , current);
>>  }
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +static struct seccomp_filter *seccomp_prepare_ebpf(const char __user 
>> *filter)
>> +{
>> +   /* XXX: this cast generates a warning. should we make people pass in
>> +* , or is there some nicer way of doing this?
>> +*/
>> +   u32 fd = (u32) filter;
> 
> I think this is probably the right way to do it, modulo getting the
> warning fixed. Let me invoke the great linux-api subscribers to get
> some more opinions.

Sigh. It's sad, but the using a cast does seem the simplest option.
But, how about another idea...

> tl;dr: adding SECCOMP_FILTER_FLAG_EBPF to the flags changes the
> pointer argument into an fd argument. Is this sane, should it be a
> pointer to an fd, or should it not be a flag at all, creating a new
> seccomp command instead (SECCOMP_MODE_FILTER_EBPF)?

What about

seccomp(SECCOMP_MODE_FILTER_EBPF, flags, structp)

Where structp is a pointer to something like

struct seccomp_ebpf {
int size;  /* Size of this whole struct */
int fd;
}

'size' allows for future expansion of the struct (in case we want to 
expand it later), and placing 'fd' inside a struct avoids unpleasant
implication that would be made by passing a pointer to an fd as the
third argument.

Cheers,

Michael


> -Kees
> 
>> +   struct seccomp_filter *ret;
>> +   struct bpf_prog *prog;
>> +
>> +   prog = bpf_prog_get(fd);
>> +   if (IS_ERR(prog))
>> +   return (struct seccomp_filter *) prog;
>> +
>> +   if (prog->type !=