Re: [PATCH] rtnetlink.7: Remove IPv4 from description

2021-01-17 Thread Michael Kerrisk (man-pages)
Hi Alex and Pali

On 1/16/21 4:04 PM, Alejandro Colomar wrote:
> From: Pali Rohár 
> 
> rtnetlink is not only used for IPv4
> 
> Signed-off-by: Pali Rohár 
> Signed-off-by: Alejandro Colomar 

Thanks. Patch applied.

Cheers,

Michael

> ---
>  man7/rtnetlink.7 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man7/rtnetlink.7 b/man7/rtnetlink.7
> index cd6809320..aec005ff9 100644
> --- a/man7/rtnetlink.7
> +++ b/man7/rtnetlink.7
> @@ -13,7 +13,7 @@
>  .\"
>  .TH RTNETLINK  7 2020-06-09 "Linux" "Linux Programmer's Manual"
>  .SH NAME
> -rtnetlink \- Linux IPv4 routing socket
> +rtnetlink \- Linux routing socket
>  .SH SYNOPSIS
>  .nf
>  .B #include 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: document SO_INCOMING_NAPI_ID

2020-10-28 Thread Michael Kerrisk (man-pages)
On 10/28/20 2:15 AM, Sridhar Samudrala wrote:
> Add documentation for SO_INCOMING_NAPI_ID in socket.7 man page.

Hello Sridhar,

Thank you!

Would it be possible for you to resubmit the patch, with a commit
message that says how you obtained or verified the information.
This info is useful for review, but also for understand changes
when people look at the history in the future.

Also, please start new sentences on new lines (so-called
semantic newlines).

Thanks,

Michael

> Signed-off-by: Sridhar Samudrala 
> ---
>  man7/socket.7 | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index 850d3162f..1f38273e9 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -519,6 +519,18 @@ This provides optimal NUMA behavior and keeps CPU caches 
> hot.
>  .\" SO_REUSEPORT logic, selecting the socket to receive the packet, ignores
>  .\" SO_INCOMING_CPU setting.
>  .TP
> +.BR SO_INCOMING_NAPI_ID " (gettable since Linux 4.12)"
> +.\" getsockopt 6d4339028b350efbf87c61e6d9e113e5373545c9
> +Returns a system level unique ID called NAPI ID that is associated with a RX
> +queue on which the last packet associated with that socket is received.
> +.IP
> +This can be used by an application to split the incoming flows among worker
> +threads based on the RX queue on which the packets associated with the flows
> +are received. It allows each worker thread to be associated with a NIC HW
> +receive queue and service all the connection requests received on that RX
> +queue. This mapping between a app thread and a HW NIC queue streamlines the
> +flow of data from the NIC to the application.
> +.TP
>  .B SO_KEEPALIVE
>  Enable sending of keep-alive messages on connection-oriented sockets.
>  Expects an integer boolean flag.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: document SO_INCOMING_NAPI_ID

2020-10-28 Thread Michael Kerrisk (man-pages)
On 10/28/20 7:13 AM, Michael Kerrisk (man-pages) wrote:
> On 10/28/20 2:15 AM, Sridhar Samudrala wrote:
>> Add documentation for SO_INCOMING_NAPI_ID in socket.7 man page.
> 
> Hello Sridhar,
> 
> Thank you!
> 
> Would it be possible for you to resubmit the patch, with a commit
> message that says how you obtained or verified the information.
> This info is useful for review, but also for understand changes
> when people look at the history in the future.

D'oh! One thing I should have checked before I hit send, I guess:

[[
commit 6d4339028b350efbf87c61e6d9e113e5373545c9
Author: Sridhar Samudrala 
Date:   Fri Mar 24 10:08:36 2017 -0700

net: Introduce SO_INCOMING_NAPI_ID
]]

But, it helps if you tell me that in the accompanying mail
message.

Thanks again for the patch. I';ll apply and fix the newlines.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] freeaddrinfo.3: memory leaks in freeaddrinfo examples

2020-09-17 Thread Michael Kerrisk (man-pages)
[CC += beej, to alert the author about the memory leaks 
in the network programming guide]

Hello Marko,

> On Thu, Sep 17, 2020 at 7:42 AM Michael Kerrisk (man-pages) <
> mtk.manpa...@gmail.com> wrote:
> 
>> Hi Marko,
>>
>> On Thu, 17 Sep 2020 at 07:34, Marko Hrastovec 
>> wrote:
>>>
>>> Hi,
>>>
>>> examples in freeaddrinfo.3 have a memory leak, which is replicated in
>> many real world programs copying an example from manual pages. The two
>> examples should have different order of lines, which is done in the
>> following patch.
>>>
>>> diff --git a/man3/getaddrinfo.3 b/man3/getaddrinfo.3
>>> index c9a4b3e43..4d383bea0 100644
>>> --- a/man3/getaddrinfo.3
>>> +++ b/man3/getaddrinfo.3
>>> @@ -711,13 +711,13 @@ main(int argc, char *argv[])
>>>  close(sfd);
>>>  }
>>>
>>> +freeaddrinfo(result);   /* No longer needed */
>>> +
>>>  if (rp == NULL) {   /* No address succeeded */
>>>  fprintf(stderr, "Could not bind\en");
>>>  exit(EXIT_FAILURE);
>>>  }
>>>
>>> -freeaddrinfo(result);   /* No longer needed */
>>> -
>>>  /* Read datagrams and echo them back to sender */
>>>
>>>  for (;;) {
>>> @@ -804,13 +804,13 @@ main(int argc, char *argv[])
>>>  close(sfd);
>>>  }
>>>
>>> +freeaddrinfo(result);   /* No longer needed */
>>> +
>>>  if (rp == NULL) {   /* No address succeeded */
>>>  fprintf(stderr, "Could not connect\en");
>>>  exit(EXIT_FAILURE);
>>>  }
>>>
>>> -freeaddrinfo(result);   /* No longer needed */
>>> -
>>>  /* Send remaining command\-line arguments as separate
>>> datagrams, and read responses from server */
>>>
>>
>> When you say "memory leak", do you mean that something like valgrind
>> complains? I mean, strictly speaking, there is no memory leak that I
>> can see that is fixed by that patch, since the if-branches that the
>> freeaddrinfo() calls are shifted above terminates the process in each
>> case.
>
> you are right about terminating the process. However, people copy that
> example and put the code in function changing "exit" to "return". There are
> a bunch of examples like that here https://beej.us/guide/bgnet/html/#poll,
> for instance.

Oh -- I see what you mean.

> That error bothered me when reading the network programming
> guide https://beej.us/guide/bgnet/html/. Than I looked for information
> elsewhere:
> -
> https://stackoverflow.com/questions/6712740/valgrind-reporting-that-getaddrinfo-is-leaking-memory
> -
> https://stackoverflow.com/questions/15690303/server-client-sockets-freeaddrinfo3-placement
> And finally, I checked manual pages and saw where these errors come from.
> 
> When you change that to a function and return without doing freeaddrinfo,
> that is a memory leak. I believe an example should show good programming
> practices. Relying on exiting and clearing the memory in that case is not
> such a case. In my opinion, these examples lead people to make mistakes in
> their programs.

Yes, I can buy that argument. I've applied your patch.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] freeaddrinfo.3: memory leaks in freeaddrinfo examples

2020-09-16 Thread Michael Kerrisk (man-pages)
Hi Marko,

On Thu, 17 Sep 2020 at 07:34, Marko Hrastovec  wrote:
>
> Hi,
>
> examples in freeaddrinfo.3 have a memory leak, which is replicated in many 
> real world programs copying an example from manual pages. The two examples 
> should have different order of lines, which is done in the following patch.
>
> diff --git a/man3/getaddrinfo.3 b/man3/getaddrinfo.3
> index c9a4b3e43..4d383bea0 100644
> --- a/man3/getaddrinfo.3
> +++ b/man3/getaddrinfo.3
> @@ -711,13 +711,13 @@ main(int argc, char *argv[])
>  close(sfd);
>  }
>
> +freeaddrinfo(result);   /* No longer needed */
> +
>  if (rp == NULL) {   /* No address succeeded */
>  fprintf(stderr, "Could not bind\en");
>  exit(EXIT_FAILURE);
>  }
>
> -freeaddrinfo(result);   /* No longer needed */
> -
>  /* Read datagrams and echo them back to sender */
>
>  for (;;) {
> @@ -804,13 +804,13 @@ main(int argc, char *argv[])
>  close(sfd);
>  }
>
> +freeaddrinfo(result);   /* No longer needed */
> +
>  if (rp == NULL) {   /* No address succeeded */
>  fprintf(stderr, "Could not connect\en");
>  exit(EXIT_FAILURE);
>  }
>
> -freeaddrinfo(result);   /* No longer needed */
> -
>  /* Send remaining command\-line arguments as separate
> datagrams, and read responses from server */
>

When you say "memory leak", do you mean that something like valgrind
complains? I mean, strictly speaking, there is no memory leak that I
can see that is fixed by that patch, since the if-branches that the
freeaddrinfo() calls are shifted above terminates the process in each
case.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] veth.4: Add a more direct example

2020-05-19 Thread Michael Kerrisk (man-pages)
Hello Devin

On 5/18/20 10:58 PM, Devin J. Pohly wrote:
> iproute2 allows you to specify the netns for either side of a veth
> interface at creation time.  Add an example of this to veth(4) so it
> doesn't sound like you have to move the interfaces in a separate step.
> 
> Verified with commands:
> # ip netns add alpha
> # ip netns add bravo
> # ip link add foo netns alpha type veth peer bar netns bravo
> # ip -n alpha link show
> # ip -n bravo link show

Nice patch, and nice commit message! Thanks. Applied.

Cheers,

Michael

> ---
>  man4/veth.4 | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/man4/veth.4 b/man4/veth.4
> index 20294c097..2d59882a0 100644
> --- a/man4/veth.4
> +++ b/man4/veth.4
> @@ -63,13 +63,23 @@ A particularly interesting use case is to place one end 
> of a
>  .B veth
>  pair in one network namespace and the other end in another network namespace,
>  thus allowing communication between network namespaces.
> -To do this, one first creates the
> +To do this, one can provide the
> +.B netns
> +parameter when creating the interfaces:
> +.PP
> +.in +4n
> +.EX
> +# ip link add  netns  type veth peer  netns 
> +.EE
> +.in
> +.PP
> +or, for an existing
>  .B veth
> -device as above and then moves one side of the pair to the other namespace:
> +pair, move one side to the other namespace:
>  .PP
>  .in +4n
>  .EX
> -# ip link set  netns 
> +# ip link set  netns 
>  .EE
>  .in
>  .PP
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] vsock.7: document VSOCK socket address family

2018-02-01 Thread Michael Kerrisk (man-pages)
On 1 February 2018 at 19:03, Stefan Hajnoczi  wrote:
> On Tue, Jan 30, 2018 at 10:31:54PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi Stefan,
>>
>> Ping on the below please, since it either blocks the man-pages release
>> I'd currently like to make, or I must remove the vsock.7 page for this
>> release.
>
> Sorry for the delay.  The verbatim license is fine.

Thanks, Stefan!

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] vsock.7: document VSOCK socket address family

2018-01-30 Thread Michael Kerrisk (man-pages)
Hi Stefan,

Ping on the below please, since it either blocks the man-pages release
I'd currently like to make, or I must remove the vsock.7 page for this
release.

Thanks,

Michael



On 26 January 2018 at 22:47, Michael Kerrisk (man-pages)
 wrote:
> Stefan,
>
> I've just now noted that your page came with no license. What license
> do you want to use Please see
> https://www.kernel.org/doc/man-pages/licenses.html
>
> Thanks,
>
> Michael
>
>
> On 30 November 2017 at 12:21, Stefan Hajnoczi  wrote:
>> The AF_VSOCK address family has been available since Linux 3.9 without a
>> corresponding man page.
>>
>> This patch adds vsock.7 and describes its use along the same lines as
>> existing ip.7, unix.7, and netlink.7 man pages.
>>
>> CC: Jorgen Hansen 
>> CC: Dexuan Cui 
>> Signed-off-by: Stefan Hajnoczi 
>> ---
>>  man7/vsock.7 | 175 
>> +++
>>  1 file changed, 175 insertions(+)
>>  create mode 100644 man7/vsock.7
>>
>> diff --git a/man7/vsock.7 b/man7/vsock.7
>> new file mode 100644
>> index 0..48c6c2e1e
>> --- /dev/null
>> +++ b/man7/vsock.7
>> @@ -0,0 +1,175 @@
>> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +vsock \- Linux VSOCK address family
>> +.SH SYNOPSIS
>> +.B #include 
>> +.br
>> +.B #include 
>> +.PP
>> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
>> +.br
>> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
>> +.SH DESCRIPTION
>> +The VSOCK address family facilitates communication between virtual machines 
>> and
>> +the host they are running on.  This address family is used by guest agents 
>> and
>> +hypervisor services that need a communications channel that is independent 
>> of
>> +virtual machine network configuration.
>> +.PP
>> +Valid socket types are
>> +.B SOCK_STREAM
>> +and
>> +.B SOCK_DGRAM .
>> +.B SOCK_STREAM
>> +provides connection-oriented byte streams with guaranteed, in-order 
>> delivery.
>> +.B SOCK_DGRAM
>> +provides a connectionless datagram packet service.  Availability of these
>> +socket types is dependent on the underlying hypervisor.
>> +.PP
>> +A new socket is created with
>> +.PP
>> +socket(AF_VSOCK, socket_type, 0);
>> +.PP
>> +When a process wants to establish a connection it calls
>> +.BR connect (2)
>> +with a given destination socket address.  The socket is automatically bound 
>> to
>> +a free port if unbound.
>> +.PP
>> +A process can listen for incoming connections by first binding to a socket 
>> address using
>> +.BR bind (2)
>> +and then calling
>> +.BR listen (2).
>> +.PP
>> +Data is transferred using the usual
>> +.BR send (2)
>> +and
>> +.BR recv (2)
>> +family of socket system calls.
>> +.SS Address format
>> +A socket address is defined as a combination of a 32-bit Context Identifier 
>> (CID) and a 32-bit port number.  The CID identifies the source or 
>> destination, which is either a virtual machine or the host.  The port number 
>> differentiates between multiple services running on a single machine.
>> +.PP
>> +.in +4n
>> +.EX
>> +struct sockaddr_vm {
>> +sa_family_t svm_family; /* address family: AF_VSOCK */
>> +unsigned short  svm_reserved1;
>> +unsigned intsvm_port;   /* port in native byte order */
>> +unsigned intsvm_cid;/* address in native byte order */
>> +};
>> +.EE
>> +.in
>> +.PP
>> +.I svm_family
>> +is always set to
>> +.BR AF_VSOCK .
>> +.I svm_reserved1
>> +is always set to 0.
>> +.I svm_port
>> +contains the port in native byte order.
>> +The port numbers below 1024 are called
>> +.IR "privileged ports" .
>> +Only a process with
>> +.B CAP_NET_BIND_SERVER
>> +capability may
>> +.BR bind (2)
>> +to these port numbers.
>> +.PP
>> +There are several special addresses:
>> +.B VMADDR_CID_ANY
>> +(-1U)
>> +means any address for binding;
>> +.B VMADDR_CID_HYPERVISOR
>> +(0) and
>> +.B VMADDR_CID_RESERVED
>> +(1) are unused addresses;
>> +.B VMADDR_CID_HOST
>> +(2)
>> +is the well-known address of the host.
>> +.PP
>> +The special constant
>> +.B VMADDR_PORT_ANY
>> +(-1U)
>> +means any port number for binding.
>> +.SS Live migration
>> +Sockets are affect

Re: [PATCH] vsock.7: document VSOCK socket address family

2018-01-26 Thread Michael Kerrisk (man-pages)
nding instead of getting the local CID with
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket that is already bound, providing an invalid 
> struct
> +.B sockaddr_vm ,
> +and other input validation errors.
> +.TP
> +.B EOPNOTSUPP
> +Operation not supported.  This includes:
> +the
> +.B MSG_OOB
> +flag that is not implemented for
> +.B sendmsg (2)
> +and
> +.B MSG_PEEK
> +for
> +.B recvmsg (2).
> +.TP
> +.B EADDRINUSE
> +Unable to bind to a port that is already in use.
> +.TP
> +.B EADDRNOTAVAIL
> +Unable to find a free port for binding or unable to bind to a non-local CID.
> +.TP
> +.B ENOTCONN
> +Unable to perform operation on an unconnected socket.
> +.TP
> +.B ENOPROTOOPT
> +Invalid socket option in
> +.B setsockopt (2)
> +or
> +.B getsockopt (2).
> +.TP
> +.B EPROTONOSUPPORT
> +Invalid socket protocol number.  Protocol should always be 0.
> +.TP
> +.B ESOCKTNOSUPPORT
> +Unsupported socket type in
> +.B socket (2).
> +Only
> +.B SOCK_STREAM
> +and
> +.B SOCK_DGRAM
> +are valid.
> +.SH VERSIONS
> +Support for VMware has been available since Linux 3.9.  KVM (virtio) is
> +supported since Linux 4.8.  Hyper-V is supported since 4.14.
> +.SH SEE ALSO
> +.BR socket (2),
> +.BR bind (2),
> +.BR connect (2),
> +.BR listen (2),
> +.BR send (2),
> +.BR recv (2),
> +.BR capabilities (7)
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-man" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: aio poll, io_pgetevents and a new in-kernel poll API V2

2018-01-10 Thread Michael Kerrisk (man-pages)
Hi Christoph,

On 01/10/2018 04:58 PM, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.
> 
> To make the interface fully useful a new io_pgetevents system call is
> added, which atomically saves and restores the signal mask over the
> io_pgetevents system call.  It it the logical equivalent to pselect and
> ppoll for io_pgetevents.
> 
> The corresponding libaio changes for io_pgetevents support and
> documentation, as well as a test case will be posted in a separate
> series.
> 
> The changes were sponsored by Scylladb, and improve performance
> of the seastar framework up to 10%, while also removing the need
> for a privileged SCHED_FIFO epoll listener thread.
> 
> The patches are on top of Als __poll_t annoations, so I've also
> prepared a git branch on top of those here:
> 
> git://git.infradead.org/users/hch/vfs.git aio-poll
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.2
> 
> Libaio changes:
> 
> http://git.infradead.org/users/hch/libaio.git/shortlog/refs/heads/aio-poll
> 
> Seastar changes:
> 
> https://github.com/avikivity/seastar/commits/aio
> 
> Changes since V1:
>  - handle the NULL ->poll case in vfs_poll
>  - dropped the file argument to the ->poll_mask socket operation
>  - replace the ->pre_poll socket operation with ->get_poll_head as
>in the file operations

Are there some man pages patches already for these changes?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-30 Thread Michael Kerrisk (man-pages)
Hello Mahesh,

On 12/28/2017 01:45 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages)
>  wrote:
>> Hello Mahesh,
>>
>> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
>>  wrote:
>>> Hello James,
>>>
>>> Seems like I missed your name to be added into the review of this
>>> patch series. Would you be willing be pull this into the security
>>> tree? Serge Hallyn has already ACKed it.
>>
>> We seem to have no formal documentation/specification of this feature.
>> I think that should be written up before this patch goes into
>> mainline...
>>
> absolutely. I have added enough information into the Documentation dir
> relevant to this feature (please look at the  individual patches),
> that could be used. I could help if needed.

Yes, but I think that the documentation is rather incomplete.
I'll also reply to the relevant Documentation thread.

See also some comments below about this commit message, which
should make things *much* easier for the reader.

>>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>>>> From: Mahesh Bandewar 
>>>>
>>>> TL;DR version
>>>> -
>>>> Creating a sandbox environment with namespaces is challenging
>>>> considering what these sandboxed processes can engage into. e.g.
>>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>>>> Current form of user-namespaces, however, if changed a bit can allow
>>>> us to create a sandbox environment without locking down user-
>>>> namespaces.
>>>>
>>>> Detailed version
>>>> 
>>>>
>>>> Problem
>>>> ---
>>>> User-namespaces in the current form have increased the attack surface as
>>>> any process can acquire capabilities which are not available to them (by
>>>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>>>
>>>> #define _GNU_SOURCE
>>>> #include 
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int ac, char **av)
>>>> {
>>>> int sock = -1;
>>>>
>>>> printf("Attempting to open RAW socket before unshare()...\n");
>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>> if (sock < 0) {
>>>> perror("socket() SOCK_RAW failed: ");
>>>> } else {
>>>> printf("Successfully opened RAW-Sock before unshare().\n");
>>>> close(sock);
>>>> sock = -1;
>>>> }
>>>>
>>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>>> perror("unshare() failed: ");
>>>> return 1;
>>>> }
>>>>
>>>> printf("Attempting to open RAW socket after unshare()...\n");
>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>>> if (sock < 0) {
>>>> perror("socket() SOCK_RAW failed: ");
>>>> } else {
>>>> printf("Successfully opened RAW-Sock after unshare().\n");
>>>> close(sock);
>>>> sock = -1;
>>>> }
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> The above example shows how easy it is to acquire NET_RAW capabilities
>>>> and once acquired, these processes could take benefit of above mentioned
>>>> or similar issues discovered/undiscovered with malicious intent.

But you do not actually describe what the problem is. I think
it's not sufficient to simply refer to some CVEs.
Your mail message/commit should clearly describe what the issue is,
rather than leave the reader to decipher a bunch of CVEs, and derive
your concerns from those CVEs.

>>>> Note
>>>> that this is just an example and the problem/solution is not limited
>>>> to NET_RAW capability *only*.
>>>>
>>>> The easiest fix one can apply here is to lock-down user-namespaces which
>>>> many of the distros do (i.e. don't allow users to create user namespaces),
>>>> but unfortunately that prevents everyone from using them.
>>>>
>>>> Approach
>>>> 
>>>> Introduce a notion of '

Re: [PATCHv3 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-12-30 Thread Michael Kerrisk (man-pages)
CTL
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos)
> +{
> + DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
> + struct ctl_table caps_table;
> + char tbuf[NAME_MAX];
> + int ret;
> +
> + ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
> +controlled_userns_caps_whitelist.cap,
> +_KERNEL_CAPABILITY_U32S);
> + if (ret != CAP_LAST_CAP)
> + return -1;
> +
> + scnprintf(tbuf, NAME_MAX, "%*pb", CAP_LAST_CAP, caps_bitmap);
> +
> + caps_table.data = tbuf;
> + caps_table.maxlen = NAME_MAX;
> + caps_table.mode = table->mode;
> + ret = proc_dostring(&caps_table, write, buff, lenp, ppos);
> + if (ret)
> + return ret;
> + if (write) {
> + kernel_cap_t tmp;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + ret = bitmap_parse_user(buff, *lenp, caps_bitmap, CAP_LAST_CAP);
> + if (ret)
> + return ret;
> +
> + ret = bitmap_to_u32array(tmp.cap, _KERNEL_CAPABILITY_U32S,
> +  caps_bitmap, CAP_LAST_CAP);
> + if (ret != CAP_LAST_CAP)
> + return -1;
> +
> + controlled_userns_caps_whitelist = tmp;
> + }
> + return 0;
> +}
> +#endif /* CONFIG_SYSCTL */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 557d46728577..759b6c286806 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1217,6 +1217,11 @@ static struct ctl_table kern_table[] = {
>   .extra2 = &one,
>   },
>  #endif
> + {
> + .procname   = "controlled_userns_caps_whitelist",
> + .mode   = 0644,
> + .proc_handler   = proc_douserns_caps_whitelist,
> + },
>   { }
>  };
>  
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCHv3 0/2] capability controlled user-namespaces

2017-12-27 Thread Michael Kerrisk (man-pages)
Hello Mahesh,

On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> Hello James,
>
> Seems like I missed your name to be added into the review of this
> patch series. Would you be willing be pull this into the security
> tree? Serge Hallyn has already ACKed it.

We seem to have no formal documentation/specification of this feature.
I think that should be written up before this patch goes into
mainline...

Cheers,

Michael


>
> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar  wrote:
>> From: Mahesh Bandewar 
>>
>> TL;DR version
>> -
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> 
>>
>> Problem
>> ---
>> User-namespaces in the current form have increased the attack surface as
>> any process can acquire capabilities which are not available to them (by
>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>
>> #define _GNU_SOURCE
>> #include 
>> #include 
>> #include 
>>
>> int main(int ac, char **av)
>> {
>> int sock = -1;
>>
>> printf("Attempting to open RAW socket before unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock before unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>> perror("unshare() failed: ");
>> return 1;
>> }
>>
>> printf("Attempting to open RAW socket after unshare()...\n");
>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>> if (sock < 0) {
>> perror("socket() SOCK_RAW failed: ");
>> } else {
>> printf("Successfully opened RAW-Sock after unshare().\n");
>> close(sock);
>> sock = -1;
>> }
>>
>> return 0;
>> }
>>
>> The above example shows how easy it is to acquire NET_RAW capabilities
>> and once acquired, these processes could take benefit of above mentioned
>> or similar issues discovered/undiscovered with malicious intent. Note
>> that this is just an example and the problem/solution is not limited
>> to NET_RAW capability *only*.
>>
>> The easiest fix one can apply here is to lock-down user-namespaces which
>> many of the distros do (i.e. don't allow users to create user namespaces),
>> but unfortunately that prevents everyone from using them.
>>
>> Approach
>> 
>> Introduce a notion of 'controlled' user-namespaces. Every process on
>> the host is allowed to create user-namespaces (governed by the limit
>> imposed by per-ns sysctl) however, mark user-namespaces created by
>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>> capability check in conjunction with a global capability whitelist.
>> If the capability is not whitelisted, processes that belong to
>> controlled user-namespaces will not be allowed.
>>
>> Once a user-ns is marked as 'controlled'; all its child user-
>> namespaces are marked as 'controlled' too.
>>
>> A global whitelist is list of capabilities governed by the
>> sysctl which is available to (privileged) user in init-ns to modify
>> while it's applicable to all controlled user-namespaces on the host.
>>
>> Marking user-namespaces controlled without modifying the whitelist is
>> equivalent of the current behavior. The default value of whitelist includes
>> all capabilities so that the compatibility is maintained. However it gives
>> admins fine-grained ability to control various capabilities system wide
>> without locking down user-namespaces.
>>
>> Please see individual patches in this series.
>>
>> Mahesh Bandewar (2):
>>   capability: introduce sysctl for controlled user-ns capability whitelist
>>   userns: control capabilities of some user namespaces
>>
>>  Docume

Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
 live migration if the old CID is not 
>> available
>> +on the new host.  Bound sockets are automatically updated to the new CID.
>> +.SS Ioctls
>> +.TP
>> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
>> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
>> int.
>> +.IP
>> +.in +4n
>> +.EX
>> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid 
>> ");"
>> +.EE
>> +.in
>> +.IP
>> +Consider using
>> +.B VMADDR_CID_ANY
>> +when binding instead of getting the local CID with
>> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
>> +.SH ERRORS
>> +.TP
>> +.B EACCES
>> +Unable to bind to a privileged port without the
>> +.B CAP_NET_BIND_SERVICE
>> +capability.
>> +.TP
>> +.B EINVAL
>> +Invalid parameters.  This includes:
>> +attempting to bind a socket that is already bound, providing an invalid 
>> struct
>> +.BR sockaddr_vm ,
>> +and other input validation errors.
>> +.TP
>> +.B EOPNOTSUPP
>> +Operation not supported.  This includes:
>> +the
>> +.B MSG_OOB
>> +flag that is not implemented for
>> +.BR sendmsg (2)
>> +and
>> +.B MSG_PEEK
>> +for
>> +.BR recvmsg (2).
>> +.TP
>> +.B EADDRINUSE
>> +Unable to bind to a port that is already in use.
>> +.TP
>> +.B EADDRNOTAVAIL
>> +Unable to find a free port for binding or unable to bind to a non-local CID.
>> +.TP
>> +.B ENOTCONN
>> +Unable to perform operation on an unconnected socket.
>> +.TP
>> +.B ENOPROTOOPT
>> +Invalid socket option in
>> +.BR setsockopt (2)
>> +or
>> +.BR getsockopt (2).
>> +.TP
>> +.B EPROTONOSUPPORT
>> +Invalid socket protocol number.  Protocol should always be 0.
>> +.TP
>> +.B ESOCKTNOSUPPORT
>> +Unsupported socket type in
>> +.BR socket (2).
>> +Only
>> +.B SOCK_STREAM
>> +and
>> +.B SOCK_DGRAM
>> +are valid.
>> +.SH VERSIONS
>> +Support for VMware (VMCI) has been available since Linux 3.9.  KVM (virtio) 
>> is
>> +supported since Linux 4.8.  Hyper-V is supported since 4.14.
>> +.SH SEE ALSO
>> +.BR socket (2),
>> +.BR bind (2),
>> +.BR connect (2),
>> +.BR listen (2),
>> +.BR send (2),
>> +.BR recv (2),
>> +.BR capabilities (7)
>> -- 
>> 2.14.3
>>
> 
> Looks great to me. Thanks for doing this. I don’t have anything to add.
> 
> Reviewed-by: Jorgen Hansen 

Thanks, Jorgen!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
+.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid 
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket that is already bound, providing an invalid 
> struct
> +.BR sockaddr_vm ,
> +and other input validation errors.
> +.TP
> +.B EOPNOTSUPP
> +Operation not supported.  This includes:
> +the
> +.B MSG_OOB
> +flag that is not implemented for
> +.BR sendmsg (2)
> +and
> +.B MSG_PEEK
> +for
> +.BR recvmsg (2).

So these errors might also occur for send() and recv(), right?

> +.TP
> +.B EADDRINUSE
> +Unable to bind to a port that is already in use.
> +.TP
> +.B EADDRNOTAVAIL
> +Unable to find a free port for binding or unable to bind to a non-local CID.
> +.TP
> +.B ENOTCONN
> +Unable to perform operation on an unconnected socket.
> +.TP
> +.B ENOPROTOOPT
> +Invalid socket option in
> +.BR setsockopt (2)
> +or
> +.BR getsockopt (2).
> +.TP
> +.B EPROTONOSUPPORT
> +Invalid socket protocol number.  Protocol should always be 0.
> +.TP
> +.B ESOCKTNOSUPPORT
> +Unsupported socket type in
> +.BR socket (2).
> +Only
> +.B SOCK_STREAM
> +and
> +.B SOCK_DGRAM
> +are valid.
> +.SH VERSIONS
> +Support for VMware (VMCI) has been available since Linux 3.9.  KVM (virtio) 
> is
> +supported since Linux 4.8.  Hyper-V is supported since 4.14.
> +.SH SEE ALSO
> +.BR socket (2),
> +.BR bind (2),
> +.BR connect (2),
> +.BR listen (2),
> +.BR send (2),
> +.BR recv (2),
> +.BR capabilities (7)

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Incorrect behaviour or documentation problem of SO_RXQ_OVFL

2017-11-20 Thread Michael Kerrisk (man-pages)
[Adding Neil, who wrote the original text. Maybe he has also some
suggested improvement.]

Hello Petr and Tobias,

Thank you both for your reports about the incorrect documentation. See below.

On 15 November 2017 at 16:14, Petr Malat  wrote:
> Hi!
> Generic SO_RXQ_OVFL helpers sock_skb_set_dropcount() and sock_recv_drops()
> implements returning of sk->sk_drops (the total number of dropped packets),
> although the documentation says the number of dropped packets since the
> last received one should be returned (quoting the current socket.7):
>   SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message (cmsg)
>   should be attached to received skbs indicating the number of packets
>   dropped by the socket between the last received packet and this
>   received packet.
>
> I assume the documentation needs to be updated, as fixing this in the
> code could break programs depending on the current behavior, although
> the formerly planned functionality seems to be more usefull.
>
> The problem can be revealed with the following program:
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int extract_drop(struct msghdr *msg)
> {
> struct cmsghdr *cmsg;
> int rtn;
>
> for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg,cmsg)) {
> if (cmsg->cmsg_level == SOL_SOCKET &&
> cmsg->cmsg_type == SO_RXQ_OVFL) {
> memcpy(&rtn, CMSG_DATA(cmsg), sizeof rtn);
> return rtn;
> }
> }
> return -1;
> }
>
> int main(int argc, char *argv[])
> {
> struct sockaddr_in addr = { .sin_family = AF_INET };
> char msg[48*1024], cmsgbuf[256];
> struct iovec iov = { .iov_base = msg, .iov_len = sizeof msg };
> int sk1, sk2, i, one = 1;
>
> sk1 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
> sk2 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
>
> inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);
> addr.sin_port = htons(5);
>
> bind(sk1, (struct sockaddr*)&addr, sizeof addr);
> connect(sk2, (struct sockaddr*)&addr, sizeof addr);
>
> // Kernel doubles this limit, but it accounts also the SKB overhead,
> // but it receives as long as there is at least 1 byte free.
> i = sizeof msg;
> setsockopt(sk1, SOL_SOCKET, SO_RCVBUF, &i, sizeof i);
> setsockopt(sk1, SOL_SOCKET, SO_RXQ_OVFL, &one, sizeof one);
>
> for (i = 0; i < 4; i++) {
> int rtn;
>
> send(sk2, msg, sizeof msg, 0);
> send(sk2, msg, sizeof msg, 0);
> send(sk2, msg, sizeof msg, 0);
>
> do {
> struct msghdr msghdr = {
> .msg_iov = &iov, .msg_iovlen = 1,
> .msg_control = &cmsgbuf,
> .msg_controllen = sizeof cmsgbuf };
> rtn = recvmsg(sk1, &msghdr, MSG_DONTWAIT);
> if (rtn > 0) {
> printf("rtn: %d drop %d\n", rtn,
> extract_drop(&msghdr));
> } else {
> printf("rtn: %d\n", rtn);
> }
> } while (rtn > 0);
> }
>
> return 0;
> }
>
> which prints
>   rtn: 49152 drop -1
>   rtn: 49152 drop -1
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 1
>   rtn: -1
>   rtn: 49152 drop 2
>   rtn: 49152 drop 2
>   rtn: -1
>   rtn: 49152 drop 3
>   rtn: 49152 drop 3
>   rtn: -1
> although it should print (according to the documentation):
>   rtn: 49152 drop 0
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>   rtn: 49152 drop 1
>   rtn: 49152 drop 0
>   rtn: -1
>
> Please keep me on To:/CC: as I'm not on the list.

Thanks for the test program. Tobias reported the same issue, and I've
applied his suggested change to the page. (See below.)

Cheers,

Michael

diff --git a/man7/socket.7 b/man7/socket.7
index 79966a6fd..1a2cfe9cc 100644
--- a/man7/socket.7
+++ b/man7/socket.7
@@ -881,8 +881,7 @@ compete to receive datagrams on the same socket.
 .\" commit 3b885787ea4112eaa80945999ea0901bf742707f
 Indicates that an unsigned 32-bit value ancillary message (cmsg)
 should be attached to received skbs indicating
-the number of packets dropped by the socket between
-the last received packet and this received packet.
+the number of packets dropped by the socket since its creation.
 .TP
 .B SO_SNDBUF
 Sets or gets the maximum socket send buffer in bytes.


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Bug in socket(7) man page

2017-11-20 Thread Michael Kerrisk (man-pages)
[CC widended]

Tobias,

On 7 August 2017 at 13:53, Tobias Klausmann  wrote:
> Hi!
>
> This bug pertains to the manpage as visible on man7.org right
> now.
>
> The socket(7) man page has this paragraph:
>
>SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message 
> (cmsg) should be attached to
>   received skbs indicating the number of packets dropped by the 
> socket between the  last
>   received packet and this received packet.
>
> The second half is wrong: the counter (internally,
> SOCK_SKB_CB(skb)->dropcount is *not* reset after every packet.
> That is, it is a proper counter, not a gauge, in monitoring
> parlance.
>
> A better version of that paragraph:
>
>SO_RXQ_OVFL (since Linux 2.6.33)
>   Indicates that an unsigned 32-bit value ancillary message 
> (cmsg) should be attached to
>   received skbs indicating the number of packets dropped by the 
> socket since its
>   creation.

Thanks for the report. See also my reply to Petr in just a moment.
I've taken your suggested text change.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2017-08-15 Thread Michael Kerrisk (man-pages)
On 11/14/2016 11:36 PM, Rick Jones wrote:
>> Lets change the example so others don't propagate the problem further.
>>
>> Signed-off-by David Wilder 
>>
>> --- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
>> +++ man7/netlink.7  2016-11-14 13:30:51.002086354 -0800
>> @@ -511,7 +511,7 @@
>>  .in +4n
>>  .nf
>>  int len;
>> -char buf[4096];
>> +char buf[8192];
> 
> Since there doesn't seem to be a define one could use in the user space 
> linux/netlink.h (?), but there are comments in the example code in the 
> manpage, how about also including a brief comment to the effect that 
> using 8192 bytes will avoid message truncation problems on platforms 
> with a large PAGE_SIZE?
> 
> /* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */
> 
> or something like that.

Thanks for the suggestion, Rick. Done!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2017-08-15 Thread Michael Kerrisk (man-pages)
On 11/14/2016 11:20 PM, dwilder wrote:
> The example code in netlink(7) (for reading netlink message) suggests 
> using
> a 4k read buffer with recvmsg.  This can cause truncated messages on 
> systems
> using a page size is >4096.  Please see:
> linux/include/linux/netlink.h (in the kernel source)
> 
> 
> /*
>   *  skb should fit one page. This choice is good for headerless 
> malloc.
>   *  But we should limit to 8K so that userspace does not have to
>   *  use enormous buffer sizes on recvmsg() calls just to avoid
>   *  MSG_TRUNC when PAGE_SIZE is very large.
>   */
> #if PAGE_SIZE < 8192UL
> #define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(PAGE_SIZE)
> #else
> #define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(8192UL)
> #endif
> 
> #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
> 
> 
> I was troubleshooting some up-stream code on a ppc64le system
> (page:size of 64k) This code had duplicated the example from netlink(7) 
> and
> was using a 4k buffer.  On x86-64 with a 4k page size this is not a 
> problem,
> however on the 64k page system some messages were truncated.  Using an 
> 8k buffer
> as implied in netlink.h prevents problems with any page size.
> 
> Lets change the example so others don't propagate the problem further.
> 
> Signed-off-by David Wilder 

Thanks, David. Patch applied.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-20 Thread Michael Kerrisk (man-pages)
On 04/19/2017 10:13 PM, Eric Dumazet wrote:
> On Wed, 2017-04-19 at 20:48 +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Eric,
>>
>> [reodering for clarity]
>>
>>>> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
>>>>> [CC += Eric, so that he might review]
>>>>>
>>>>> Hello Francois,
>>>>>
>>>>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>>>>>> This socket option is undocumented. Applies on the latest version
>>>>>> (man-pages-4.09-511).
>>>>>>
>>>>>> diff --git a/man7/socket.7 b/man7/socket.7
>>>>>> index 3efd7a5d8..1a3ffa253 100644
>>>>>> --- a/man7/socket.7
>>>>>> +++ b/man7/socket.7
>>>>>> @@ -490,6 +490,26 @@ flag on a socket
>>>>>>  operation.
>>>>>>  Expects an integer boolean flag.
>>>>>>  .TP
>>>>>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>>>>>> Linux 4.4)"
>>>>>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>>>>>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>>>>>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>>>>>> +.sp
>>>>>> +.in +4n
>>>>>> +.nf
>>>>>> +int cpu = 1;
>>>>>> +socklen_t len = sizeof(cpu);
>>>>>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>>>>>> +.fi
>>>>>> +.in
>>>>>> +.sp
>>>>>> +The typical use case is one listener per RX queue, as the associated 
>>>>>> listener
>>>>>> +should only accept flows handled in softirq by the same cpu.  This 
>>>>>> provides
>>>>>> +optimal NUMA behavior and keep cpu caches hot.
>>>>>> +.TP
>>>>>>  .B SO_KEEPALIVE
>>>>>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>>>>>  Expects an integer boolean flag.
>>>>>
>>>>> Thank you! Patch applied.
>>>>>
>>>>> I have tried to enhance the description somewhat. I'm not sure whether
>>>>> what I've written is quite correct (or whether it should be further
>>>>> extended). Eric, could you please take a look at the following, and let 
>>>>> me know if anything needs fixing:
>>>>>
>>>>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>>>>>4.4)
>>>>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>>>>>   integer flag.
>>>>>
>>>>>   int cpu = 1;
>>>>>   socklen_t len = sizeof(cpu);
>>>>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>>>>>
>>>>>   Because  all  of the packets for a single stream (i.e., all
>>>>>   packets for the same 4-tuple) arrive on the single RX queue
>>>>>   that  is  associated with a particular CPU, the typical use
>>>>>   case is to employ one listening process per RX queue,  with
>>>>>   the  incoming  flow being handled by a listener on the same
>>>>>   CPU that is handling the RX queue.  This  provides  optimal
>>>>>   NUMA behavior and keeps CPU caches hot.
>>
>>> Hi Michael
>>>
>>> Sorry for the delay.
>>
>> Thanks for the reply, but I think you are assuming I know more than 
>> I do. I'd like you to elaborate a little please. See below.
>>
>>> Note that setting the option is not supported if SO_REUSEPORT is used.
>>
>> Please define "not supported". Does this yield an API diagnostic?
>> If so, what is it?
>>
>>> Socket will be selected from an array, either by a hash or BPF program
>>> that has no access to this information.
>>
>> Sorry -- I'm lost here. How does this comment relate to the proposed
>> man page text above?
> 
> Simply that :
> 
> If an application uses both SO_INCOMING_CPU and SO_REUSEPORT, then
> SO_REUSEPORT logic, selecting the socket to receive the packet, ignores
> SO_INCOMING_CPU setting.
> 
> This does not need to be documented, because it is an implementation
> detail/bug that could be changed, if someone cares enough.

Okay, thanks, Eric. I'll just merge the page text as it currently 
is then.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Michael Kerrisk (man-pages)
Hi Eric,

[reodering for clarity]

>> On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
>>> [CC += Eric, so that he might review]
>>>
>>> Hello Francois,
>>>
>>> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>>>> This socket option is undocumented. Applies on the latest version
>>>> (man-pages-4.09-511).
>>>>
>>>> diff --git a/man7/socket.7 b/man7/socket.7
>>>> index 3efd7a5d8..1a3ffa253 100644
>>>> --- a/man7/socket.7
>>>> +++ b/man7/socket.7
>>>> @@ -490,6 +490,26 @@ flag on a socket
>>>>  operation.
>>>>  Expects an integer boolean flag.
>>>>  .TP
>>>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>>>> Linux 4.4)"
>>>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>>>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>>>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>>>> +.sp
>>>> +.in +4n
>>>> +.nf
>>>> +int cpu = 1;
>>>> +socklen_t len = sizeof(cpu);
>>>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>>>> +.fi
>>>> +.in
>>>> +.sp
>>>> +The typical use case is one listener per RX queue, as the associated 
>>>> listener
>>>> +should only accept flows handled in softirq by the same cpu.  This 
>>>> provides
>>>> +optimal NUMA behavior and keep cpu caches hot.
>>>> +.TP
>>>>  .B SO_KEEPALIVE
>>>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>>>  Expects an integer boolean flag.
>>>
>>> Thank you! Patch applied.
>>>
>>> I have tried to enhance the description somewhat. I'm not sure whether
>>> what I've written is quite correct (or whether it should be further
>>> extended). Eric, could you please take a look at the following, and let 
>>> me know if anything needs fixing:
>>>
>>>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>>>4.4)
>>>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>>>   integer flag.
>>>
>>>   int cpu = 1;
>>>   socklen_t len = sizeof(cpu);
>>>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>>>
>>>   Because  all  of the packets for a single stream (i.e., all
>>>   packets for the same 4-tuple) arrive on the single RX queue
>>>   that  is  associated with a particular CPU, the typical use
>>>   case is to employ one listening process per RX queue,  with
>>>   the  incoming  flow being handled by a listener on the same
>>>   CPU that is handling the RX queue.  This  provides  optimal
>>>   NUMA behavior and keeps CPU caches hot.

> Hi Michael
> 
> Sorry for the delay.

Thanks for the reply, but I think you are assuming I know more than 
I do. I'd like you to elaborate a little please. See below.

> Note that setting the option is not supported if SO_REUSEPORT is used.

Please define "not supported". Does this yield an API diagnostic?
If so, what is it?

> Socket will be selected from an array, either by a hash or BPF program
> that has no access to this information.

Sorry -- I'm lost here. How does this comment relate to the proposed
man page text above?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-04-19 Thread Michael Kerrisk (man-pages)
Ping Eric!

Would you have a chance to review the proposed text below, please.

Thanks,

Michael

On 02/19/2017 09:55 PM, Michael Kerrisk (man-pages) wrote:
> [CC += Eric, so that he might review]
> 
> Hello Francois,
> 
> On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
>> This socket option is undocumented. Applies on the latest version
>> (man-pages-4.09-511).
>>
>> diff --git a/man7/socket.7 b/man7/socket.7
>> index 3efd7a5d8..1a3ffa253 100644
>> --- a/man7/socket.7
>> +++ b/man7/socket.7
>> @@ -490,6 +490,26 @@ flag on a socket
>>  operation.
>>  Expects an integer boolean flag.
>>  .TP
>> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
>> Linux 4.4)"
>> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
>> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
>> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
>> +.sp
>> +.in +4n
>> +.nf
>> +int cpu = 1;
>> +socklen_t len = sizeof(cpu);
>> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>> +.fi
>> +.in
>> +.sp
>> +The typical use case is one listener per RX queue, as the associated 
>> listener
>> +should only accept flows handled in softirq by the same cpu.  This provides
>> +optimal NUMA behavior and keep cpu caches hot.
>> +.TP
>>  .B SO_KEEPALIVE
>>  Enable sending of keep-alive messages on connection-oriented sockets.
>>  Expects an integer boolean flag.
> 
> Thank you! Patch applied.
> 
> I have tried to enhance the description somewhat. I'm not sure whether
> what I've written is quite correct (or whether it should be further
> extended). Eric, could you please take a look at the following, and let 
> me know if anything needs fixing:
> 
>SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
>4.4)
>   Sets or gets the CPU affinity  of  a  socket.   Expects  an
>   integer flag.
> 
>   int cpu = 1;
>   socklen_t len = sizeof(cpu);
>   setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
> 
>   Because  all  of the packets for a single stream (i.e., all
>   packets for the same 4-tuple) arrive on the single RX queue
>   that  is  associated with a particular CPU, the typical use
>       case is to employ one listening process per RX queue,  with
>   the  incoming  flow being handled by a listener on the same
>   CPU that is handling the RX queue.  This  provides  optimal
>   NUMA behavior and keeps CPU caches hot.
> 
> Cheers,
> 
> Michael
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [net-next PATCH 0/5] Add busy poll support for epoll under certain circumstances

2017-03-18 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]

Hello Alexander

Since this is a kernel-user-space API change, please CC linux-api@
(and on future iterations of the patch). The kernel source file
Documentation/SubmitChecklist notes that all Linux kernel patches that
change userspace interfaces should be CCed to
linux-...@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Thanks,

Michael



On Thu, Mar 16, 2017 at 7:32 PM, Alexander Duyck
 wrote:
> This patch series is meant to add busy polling support to epoll when all of
> the sockets on a given epoll are either local or are being sourced by the
> same NAPI ID.
>
> In order to support this the first two patches clean up a few issues we
> found with the NAPI ID tracking and infrastructure.
>
> In the third patch we introduce SO_INCOMING_NAPI_ID so that applications
> have a means of trying to sort their incoming sockets to identify which
> requests should be routed where in order to keep the epoll listener aligned
> to a given Rx queue without having to rely on IRQ pinning.
>
> Finally the last two patches refactor the existing busy poll infrastructure
> to make it so that we can call it without necessarily needing a socket, and
> enable the bits needed to support epoll when all of the sockets on the
> epoll either share the same NAPI ID, or simply are reporting no NAPI ID.
>
> ---
>
> Sridhar Samudrala (5):
>   net: Do not record sender_cpu as napi_id in socket receive paths
>   net: Call sk_mark_napi_id() in the ACK receive path
>   net: Introduce SO_INCOMING_NAPI_ID
>   net: Commonize busy polling code to focus on napi_id instead of socket
>   epoll: Add busy poll support to epoll with socket fds.
>
>
>  arch/alpha/include/uapi/asm/socket.h   |2 +
>  arch/avr32/include/uapi/asm/socket.h   |2 +
>  arch/frv/include/uapi/asm/socket.h |2 +
>  arch/ia64/include/uapi/asm/socket.h|2 +
>  arch/m32r/include/uapi/asm/socket.h|2 +
>  arch/mips/include/uapi/asm/socket.h|2 +
>  arch/mn10300/include/uapi/asm/socket.h |2 +
>  arch/parisc/include/uapi/asm/socket.h  |2 +
>  arch/powerpc/include/uapi/asm/socket.h |2 +
>  arch/s390/include/uapi/asm/socket.h|2 +
>  arch/sparc/include/uapi/asm/socket.h   |2 +
>  arch/xtensa/include/uapi/asm/socket.h  |2 +
>  fs/eventpoll.c |  115 
> 
>  include/net/busy_poll.h|   14 +++-
>  include/uapi/asm-generic/socket.h  |2 +
>  net/core/dev.c |   16 ++--
>  net/core/sock.c    |   22 ++
>  net/ipv4/tcp_ipv4.c|1
>  18 files changed, 183 insertions(+), 11 deletions(-)
>
> --



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/


Re: [PATCH net-next RFC v1 00/27] afnetns: new namespace type for separation on protocol level

2017-03-13 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]

Hannes,

Since this is a kernel-user-space API change, please CC linux-api@
(and on future iterations of the series). The kernel source file
Documentation/SubmitChecklist notes that all Linux kernel patches that
change userspace interfaces should be CCed to
linux-...@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Thanks,

Michael


On Mon, Mar 13, 2017 at 12:44 AM, Hannes Frederic Sowa
 wrote:
> Hi,
>
> On Sun, 2017-03-12 at 16:26 -0700, David Miller wrote:
>> From: Hannes Frederic Sowa 
>> Date: Mon, 13 Mar 2017 00:01:24 +0100
>>
>> > afnetns behaves like ordinary namespaces: clone, unshare, setns syscalls
>> > can work with afnetns with one limitation: one cannot cross the realm
>> > of a network namespace while changing the afnetns compartement. To get
>> > into a new afnetns in a different net namespace, one must first change
>> > to the net namespace and afterwards switch to the desired afnetns.
>>
>> Please explain why this is useful, who wants this kind of facility,
>> and how it will be used.
>
> Yes, I have to enhance the cover letter:
>
> The work behind all this is to provide more dense container hosting.
> Right now we lose performance, because all packets need to be forwarded
> through either a bridge or must be routed until they reach the
> containers. For example, we can't make use of early demuxing for the
> incoming packets. We basically pass the networking stack twice for
> every packet.
>
> The usage is very much in line with how network namespaces are used
> nowadays:
>
> ip afnetns add afns-1
> ip address add 192.168.1.1/24 dev eth0 afnetns afns-1
> ip afnetns exec afns-1 /usr/sbin/httpd
>
> this spawns a shell where all child processes will only have access to
> the specific ip addresses, even though they do a wildcard bind. Source
> address selection will also use only the ip addresses available to the
> children.
>
> In some sense it has lots of characteristics like ipvlan, allowing a
> single MAC address to host lots of IP addresses which will end up in
> different namespaces. Unlink ipvlan however, it will also solve the
> problem around duplicate address detection and multiplexing packets to
> the IGMP or MLD state machines.
>
> The resource consumption in comparison with ordinary namespaces will be
> much lower. All in all, we will have far less networking subsystems to
> cross compared to normal netns solutions.
>
> Some more information also in the first patch, which adds a
> Documentation.
>
> Bye,
> Hannes
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/


Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-27 Thread Michael Kerrisk
cles--  cpu cycles
>std  zc   %  std zc   %
> 4K  27,609  11,217  41  49,217  39,175  79
> 16K 21,370   3,823  18  43,540  29,213  67
> 64K 20,557   2,312  11  42,189  26,910  64
> 256K21,110   2,134  10  43,006  27,104  63
> 1M  20,987   1,610   8  42,759  25,931  61
>
> Perf record indicates the main source of these differences. Process
> cycles only at 1M writes (perf record; perf report -n):
>
> std:
> Samples: 42K of event 'cycles', Event count (approx.): 21258597313
>  79.41% 33884  netperf  [kernel.kallsyms]  [k] 
> copy_user_generic_string
>   3.27%  1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
>   1.66%   694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
>   0.79%   325  netperf  [kernel.kallsyms]  [k] tcp_ack
>   0.43%   188  netperf  [kernel.kallsyms]  [k] __alloc_skb
>
> zc:
> Samples: 1K of event 'cycles', Event count (approx.): 1439509124
>  30.36%   584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
>  14.63%   284  netperf.zerocop  [kernel.kallsyms]  [k] 
> __zerocopy_sg_from_iter
>   8.03%   159  netperf.zerocop  [kernel.kallsyms]  [k] 
> skb_zerocopy_add_frags_iter
>   4.84%96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
>   3.10%60  netperf.zerocop  [kernel.kallsyms]  [k] 
> kmem_cache_alloc_node
>
>
> * Safety
>
> The number of pages that can be pinned on behalf of a user with
> MSG_ZEROCOPY is bound by the locked memory ulimit.
>
> While the kernel holds process memory pinned, a process cannot safely
> reuse those pages for other purposes. Packets looped onto the receive
> stack and queued to a socket can be held indefinitely. Avoid unbounded
> notification latency by restricting user pages to egress paths only.
> skb_orphan_frags_rx() will create a private copy of pages even for
> refcounted packets when these are looped, as did skb_orphan_frags for
> the original tun zerocopy implementation.
>
> Pages are not remapped read-only. Processes can modify packet contents
> while packets are in flight in the kernel path. Bytes on which kernel
> control flow depends (headers) are copied to avoid TOCTTOU attacks.
> Datapath integrity does not otherwise depend on payload, with three
> exceptions: checksums, optional sk_filter/tc u32/.. and device +
> driver logic. The effect of wrong checksums is limited to the
> misbehaving process. TC filters that access contents may have to be
> excluded by adding an skb_orphan_frags_rx.
>
> Processes can also safely avoid OOM conditions by bounding the number
> of bytes passed with MSG_ZEROCOPY and by removing shared pages after
> transmission from their own memory map.
>
>
> * Limitations / Known Issues
>
> - PF_INET6 is not yet supported.
> - TCP does not build max GSO packets, especially for
>  small send buffers (< 4 KB)
>
> Willem de Bruijn (12):
>   sock: allocate skbs from optmem
>   sock: skb_copy_ubufs support for compound pages
>   sock: add generic socket zerocopy
>   sock: enable sendmsg zerocopy
>   sock: sendmsg zerocopy notification coalescing
>   sock: sendmsg zerocopy ulimit
>   sock: sendmsg zerocopy limit bytes per notification
>   tcp: enable sendmsg zerocopy
>   udp: enable sendmsg zerocopy
>   raw: enable sendmsg zerocopy with IP_HDRINCL
>   packet: enable sendmsg zerocopy
>   test: add sendmsg zerocopy tests
>
>  drivers/net/tun.c |   2 +-
>  drivers/vhost/net.c   |   1 +
>  include/linux/sched.h |   2 +-
>  include/linux/skbuff.h|  94 +++-
>  include/linux/socket.h|   1 +
>  include/net/sock.h|   4 +
>  include/uapi/linux/errqueue.h |   1 +
>  net/core/datagram.c   |  35 +-
>  net/core/dev.c|   4 +-
>  net/core/skbuff.c | 327 --
>  net/core/sock.c   |  29 ++
>  net/ipv4/ip_output.c  |  34 +-
>  net/ipv4/raw.c|  27 +-
>  net/ipv4/tcp.c            |  37 +-
>  net/packet/af_packet.c|  52 ++-
>  tools/testing/selftests/net/.gitignore|   2 +
>  tools/testing/selftests/net/Makefile  |   1 +
>  tools/testing/selftests/net/snd_zerocopy.c| 354 +++
>  tools/testing/selftests/net/snd_zerocopy_lo.c | 596 
> ++
>  19 files changed, 1536 insertions(+), 67 deletions(-)
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c
>
> --
> 2.11.0.483.g087da7b7c-goog
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/


Re: [patch] socket.7: Document SO_INCOMING_CPU

2017-02-19 Thread Michael Kerrisk (man-pages)
[CC += Eric, so that he might review]

Hello Francois,

On 02/18/2017 05:06 AM, Francois Saint-Jacques wrote:
> This socket option is undocumented. Applies on the latest version
> (man-pages-4.09-511).
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index 3efd7a5d8..1a3ffa253 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -490,6 +490,26 @@ flag on a socket
>  operation.
>  Expects an integer boolean flag.
>  .TP
> +.BR SO_INCOMING_CPU " (getsockopt since Linux 3.19, setsockopt since
> Linux 4.4)"
> +.\" getsocktop 2c8c56e15df3d4c2af3d656e44feb18789f75837
> +.\" setsocktop 70da268b569d32a9fddeea85dc18043de9d89f89
> +Sets or gets the cpu affinity of a socket. Expects an integer flag.
> +.sp
> +.in +4n
> +.nf
> +int cpu = 1;
> +socklen_t len = sizeof(cpu);
> +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
> +.fi
> +.in
> +.sp
> +The typical use case is one listener per RX queue, as the associated listener
> +should only accept flows handled in softirq by the same cpu.  This provides
> +optimal NUMA behavior and keep cpu caches hot.
> +.TP
>  .B SO_KEEPALIVE
>  Enable sending of keep-alive messages on connection-oriented sockets.
>  Expects an integer boolean flag.

Thank you! Patch applied.

I have tried to enhance the description somewhat. I'm not sure whether
what I've written is quite correct (or whether it should be further
extended). Eric, could you please take a look at the following, and let 
me know if anything needs fixing:

   SO_INCOMING_CPU  (gettable  since Linux 3.19, settable since Linux
   4.4)
  Sets or gets the CPU affinity  of  a  socket.   Expects  an
  integer flag.

  int cpu = 1;
  socklen_t len = sizeof(cpu);
  setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

  Because  all  of the packets for a single stream (i.e., all
  packets for the same 4-tuple) arrive on the single RX queue
  that  is  associated with a particular CPU, the typical use
  case is to employ one listening process per RX queue,  with
  the  incoming  flow being handled by a listener on the same
  CPU that is handling the RX queue.  This  provides  optimal
  NUMA behavior and keeps CPU caches hot.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)
On 26 July 2016 at 18:52, Kees Cook  wrote:
> On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman
>  wrote:
>> "Michael Kerrisk (man-pages)"  writes:
>>
>>> Hello Eric,
>>>
>>> I realized I had a question after the last mail.
>>>
>>> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>>>>
>>>> This patchset addresses two use cases:
>>>> - Implement a sane upper bound on the number of namespaces.
>>>> - Provide a way for sandboxes to limit the attack surface from
>>>>   namespaces.
>>>
>>> Can you say more about the second point? What exactly is the
>>> problem that is being addressed, and how does the patch series
>>> address it? (It would be good to have those details in the
>>> revised commit message...)
>>
>> At some point it was reported that seccomp was not sufficient to disable
>> namespace creation.  I need to go back and look at that claim to see
>> which set of circumstances that was referring to.  Seccomp doesn't stack
>> so I can see why it is an issue.
>
> seccomp does stack. The trouble usually comes from a perception that
> seccomp overhead is not trivial, so setting a system-wide policy is a
> bit of a large hammer for such a limitiation. Also, at the time,
> seccomp could be bypasses with ptrace, but this (as of v4.8) is no
> longer true.

Sounds like someone needs to send me a patch for the seccomp.2 man page?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

I realized I had a question after the last mail.

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.


Can you say more about the second point? What exactly is the
problem that is being addressed, and how does the patch series
address it? (It would be good to have those details in the
revised commit message...)

Cheers,

Michael




Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.

The maximum sane case I can imagine is if every process is a fat
process, so I set the maximum number of namespaces to the maximum
number of threads.

I make these limits recursive and per user namespace so that a
usernamespace root can reduce the limits further.  If a user namespace
root raises the limit the limit in the parent namespace will be honored.

I have cut this implementation to the bare minimum needed to achieve
these objectives.

Does anyone know if there is a proper error code to return for resource
limit exceeded?  I am currently using -EUSERS or -ENFILE but both of
those feel a little wrong.


ENFILE certainly seems weird. I suppose my first question is: why two
different errors?

Some alternatives you might want to consider: E2BIG, EOVERFLOW,
or (maybe) ERANGE.

Cheers,

Michael








Re: [PATCH] netlink.7: describe netlink socket options

2016-06-12 Thread Michael Kerrisk (man-pages)
Hi Andrey,

On 06/10/2016 10:28 PM, Andrey Vagin wrote:
> Cc: Kir Kolyshkin 
> Cc: Michael Kerrisk 
> Cc: Herbert Xu 
> Cc: Patrick McHardy 
> Cc: Christophe Ricard 
> Cc: Nicolas Dichtel 
> Signed-off-by: Andrey Vagin 
> ---
>  man7/netlink.7 | 75 
> ++
>  1 file changed, 75 insertions(+)


Thanks for the nicely done patch. Applied!

Cheers,

Michael


> diff --git a/man7/netlink.7 b/man7/netlink.7
> index 513f854..b4848df 100644
> --- a/man7/netlink.7
> +++ b/man7/netlink.7
> @@ -368,6 +368,81 @@ and
>  .BR NETLINK_SELINUX
>  groups allow other users to receive messages.
>  No groups allow other users to send messages.
> +
> +.SS Socket options
> +To set or get a netlink socket option, call
> +.BR getsockopt (2)
> +to read or
> +.BR setsockopt (2)
> +to write the option with the option level argument set to
> +.BR SOL_NETLINK .
> +Unless otherwise noted,
> +.I optval
> +is a pointer to an
> +.IR int .
> +.TP
> +.BR NETLINK_PKTINFO " (since Linux 2.6.14)"
> +Enable
> +.B nl_pktinfo
> +control messages for received packets to get the extended
> +destination group number.
> +.TP
> +.BR NETLINK_ADD_MEMBERSHIP ,\  NETLINK_DROP_MEMBERSHIP " (since Linux 
> 2.6.14)"
> +Join/leave a group specified by
> +.IR optval .
> +.\"  commit 9a4595bc7e67962f13232ee55a64e063062c3a99
> +.\"  Author: Patrick McHardy 
> +.TP
> +.BR NETLINK_LIST_MEMBERSHIPS " (since Linux 4.2)"
> +Retrieve all groups a socket is a member of.
> +.I optval
> +is a pointer to
> +.B __u32
> +and
> +.I optlen
> +is the size of the array. The array is filled with the full membership set 
> of the
> +socket, and the required array size is returned in
> +.I optlen.
> +.\"  commit b42be38b2778eda2237fc759e55e3b698b05b315
> +.\"  Author: David Herrmann 
> +.TP
> +.BR NETLINK_BROADCAST_ERROR " (since Linux 2.6.30)"
> +When not set,
> +.B netlink_broadcast()
> +only reports
> +.B ESRCH
> +errors and silently ignore
> +.B NOBUFS
> +errors.
> +.\"  commit be0c22a46cfb79ab2342bb28fde99afa94ef868e
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_NO_ENOBUFS " (since Linux 2.6.30)"
> +This flag can be used by unicast and broadcast listeners to avoid receiving
> +.B ENOBUFS
> +errors.
> +.\"  commit 38938bfe3489394e2eed5e40c9bb8f66a2ce1405
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_LISTEN_ALL_NSID " (since Linux 4.2)"
> +When set, this socket will receive netlink notifications from all network 
> namespaces that
> +have an
> +.I nsid
> +assigned into the network namespace where the socket has been opened. The
> +.I nsid
> +is sent to user space via an ancillary data.
> +.\"  commit 59324cf35aba5336b611074028777838a963d03b
> +.\"  Author: Nicolas Dichtel 
> +.TP
> +.BR NETLINK_CAP_ACK " (since Linux 4.2)"
> +The kernel may fail to allocate the necessary room for the acknowledgment
> +message back to userspace. This option trims off the payload of the original
> +netlink message.
> +The netlink message header is still included, so the user can guess from the
> +sequence number what is the message that has triggered the acknowledgment.
> +.\"  commit 0a6a3a23ea6efde079a5b77688541a98bf202721
> +.\"  Author: Christophe Ricard 
> +
>  .SH VERSIONS
>  The socket interface to netlink is a new feature of Linux 2.2.
>  
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH] ip.7: Fix incorrect sockopt name

2016-03-25 Thread Michael Kerrisk (man-pages)
Hello Benjamin,

On 03/22/2016 09:28 AM, Benjamin Poirier wrote:
> "IP_LEAVE_GROUP" does not exist. It was perhaps a confusion with
> MCAST_LEAVE_GROUP. Change the text to IP_DROP_MEMBERSHIP which has the same
> function as MCAST_LEAVE_GROUP and is documented in the ip.7 man page.
> 
> Reference:
> Linux kernel net/ipv4/ip_sockglue.c do_ip_setsockopt()

Thanks! Applied.

Cheers,

Michael


> Cc: Radek Pazdera 
> Signed-off-by: Benjamin Poirier 
> ---
>  man7/ip.7 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man7/ip.7 b/man7/ip.7
> index 3905573..37e2c86 100644
> --- a/man7/ip.7
> +++ b/man7/ip.7
> @@ -376,7 +376,7 @@ a given multicast group that come from a given source.
>  If the application has subscribed to multiple sources within
>  the same group, data from the remaining sources will still be delivered.
>  To stop receiving data from all sources at once, use
> -.BR IP_LEAVE_GROUP .
> +.BR IP_DROP_MEMBERSHIP .
>  .IP
>  Argument is an
>  .I ip_mreq_source
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] socket.7: Document some BPF-related socket options

2016-03-01 Thread Michael Kerrisk (man-pages)
On 03/01/2016 11:10 AM, Vincent Bernat wrote:
>  ❦  1 mars 2016 11:03 +0100, "Michael Kerrisk (man-pages)" 
>  :
> 
>>   Once   the   SO_LOCK_FILTER  option  has  been  enabled,
>>   attempts by an unprivileged process to change or  remove
>>   the  filter  attached  to  a  socket,  or to disable the
>>   SO_LOCK_FILTER option will fail with the error EPERM.
> 
> You should remove "unprivileged". I didn't try to check for permissions
> because I was just lazy (and I didn't have a need for it). As root, you
> can just recreate another socket.

Bother. That's what I meant to do, and then I omitted to do it! Done now
And thanks for catching that, Vincent.

Revised text below, with another query.

   SO_LOCK_FILTER
  When set, this option will prevent changing the  filters
  associated  with  the socket.  These filters include any
  set   using   the   socket   options   SO_ATTACH_FILTER,
  SO_ATTACH_BPF,SO_ATTACH_REUSEPORT_CBPF   and
  SO_ATTACH_REUSEPORT_EPBF.

  The typical use case is for a privileged process to  set
  up  a  socket with restrictive filters, set SO_LOCK_FIL‐
  TER, and then either drop its  privileges  or  pass  the
  socket file descriptor to an unprivileged process.

  Once   the   SO_LOCK_FILTER  option  has  been  enabled,
  attempts to change or remove the filter  attached  to  a
  socket,  or  to  disable  the SO_LOCK_FILTER option will
  fail with the error EPERM.

I think the second paragraph should probably drop mention of privileges,
right? In fact, maybe just drop the paragraph altogether?

Cheers,

Michael
 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2] socket.7: Document some BPF-related socket options

2016-03-01 Thread Michael Kerrisk (man-pages)
Hi Craig,

On 02/29/2016 06:36 PM, Craig Gallek wrote:
> From: Craig Gallek 

Thanks for improvements. I've applied the patch and tweaked things 
somewhat, but I have a few comments and queries below. I'd be 
grateful if you'd check these, in case I have introduced any errors.
(The tweaked version of the page can be found in the Git repo.)

> Document the behavior and the first kernel version for each of the
> following socket options:
> SO_ATTACH_FILTER
> SO_ATTACH_BPF
> SO_ATTACH_REUSEPORT_CBPF
> SO_ATTACH_REUSEPORT_EBPF
> SO_DETACH_FILTER
> SO_DETACH_BPF
> SO_LOCK_FILTER
> 
> Signed-off-by: Craig Gallek 
> ---
> v2 changes:
> - Content suggestions from Michael Kerrisk :
>   * Clarify socket filter return value semantics
>   * Clarify wording of minimal kernel versions
>   * Explain behavior of multiple calls using SO_ATTACH_[BPF|FILTER]
>   * Define 'reuseport groups' in SO_ATTACH_REUSEPORT_*
> - Include SO_LOCK_FILTER documentation mostly based off of the wording
>   in the commit message by Vincent Bernat 
>   d59577b6ffd3 ("sk-filter: Add ability to lock a socket filter program")
> 
> ---
>  man7/socket.7 | 136 
> +-
>  1 file changed, 115 insertions(+), 21 deletions(-)
> 
> diff --git a/man7/socket.7 b/man7/socket.7
> index db7cb8324dde..d22107cc47d7 100644
> --- a/man7/socket.7
> +++ b/man7/socket.7
> @@ -41,9 +41,6 @@
>  .\"  SO_GET_FILTER (3.8)
>  .\"  commit a8fc92778080c845eaadc369a0ecf5699a03bef0
>  .\"  Author: Pavel Emelyanov 
> -.\"  SO_LOCK_FILTER (3.9)
> -.\"  commit d59577b6ffd313d0ab3be39cb1ab47e29bdc9182
> -.\"  Author: Vincent Bernat 
>  .\"  SO_SELECT_ERR_QUEUE (3.10)
>  .\" commit 7d4c04fc170087119727119074e72445f2bb192b
>  .\"  Author: Keller, Jacob E 
> @@ -53,13 +50,6 @@
>  .\" SO_BPF_EXTENSIONS (3.14)
>  .\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e
>  .\"  Author: Michal Sekletar 
> -.\" SO_ATTACH_BPF (3.19)
> -.\" and SO_DETACH_BPF as synonym for SO_DETACH_FILTER
> -.\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
> -.\"  Author: Alexei Starovoitov 
> -.\"  SO_ATTACH_REUSEPORT_CBPF, SO_ATTACH_REUSEPORT_EBPF (4.5)
> -.\"  commit 538950a1b7527a0a52ccd9337e3fcd304f027f13
> -.\"  Author: Craig Gallek 
>  .\"
>  .TH SOCKET 7 2015-05-07 Linux "Linux Programmer's Manual"
>  .SH NAME
> @@ -311,6 +301,90 @@ The value 0 indicates that this is not a listening 
> socket,
>  the value 1 indicates that this is a listening socket.
>  This socket option is read-only.
>  .TP
> +.BR SO_ATTACH_FILTER " and " SO_ATTACH_BPF
> +Attach a classic or extended BPF program (respectively) to the socket
> +for use as a filter of incoming packets. A packet will be dropped if
> +the filter program returns zero.  If the filter program returns a
> +non-zero value which is less than the packet's data length, the packet
> +will be truncated to the length returned.  If the value returned by
> +the filter is greater than or equal to the packet's data length, the
> +packet is allowed to proceed unmodified.
> +
> +The argument for
> +.BR SO_ATTACH_FILTER
> +is a
> +.I sock_fprog
> +structure in
> +.B .
> +.sp
> +.in +4n
> +.nf
> +struct sock_fprog {
> +unsigned short  len;
> +struct sock_filter *filter;
> +};
> +.fi
> +.in
> +.IP
> +The argument for
> +.BR SO_ATTACH_BPF
> +is a file descriptor returned by the
> +.BR bpf (2)
> +system call and must refer to a program of type
> +.BR BPF_PROG_TYPE_SOCKET_FILTER.
> +These options may be set multiple times for a given socket, each time
> +replacing the previous filter program.  The classic and extended
> +versions may be called on the same socket, but the previous filter
> +will always be replaced such that a socket never has more than one
> +filter defined.
> +
> +.BR SO_ATTACH_FILTER
> +is available since Linux 2.2.
> +.BR SO_ATTACH_BPF
> +is available since Linux 3.19.  Both classic and extended BPF are
> +explained in the kernel source file
> +.I Documentation/networking/filter.txt
> +.TP
> +.BR SO_ATTACH_REUSEPORT_CBPF " and " SO_ATTACH_REUSEPORT_EBPF " (since Linux 
> 4.5)"
> +For use with the
> +.BR SO_REUSEPORT
> +option, these options allow the user to set a classic or extended
> +BPF program (respectively) which defines how packets are assigned to
> +the sockets in the reuseport group (that is, all sockets which have
> +.BR SO_REUSEPORT

Re: [PATCH] socket.7: Document some BPF-related socket options

2016-02-28 Thread Michael Kerrisk (man-pages)
behavior that seems like it could easily 
trip up users!

> +
> +These options may be set repeatedly at any time on any single socket
> +in the group to replace the current BPF program used by all sockets in
> +the group.
> +.BR SO_ATTACH_REUSEPORT_CBPF
> +takes the same socket argument type as
> +.BR SO_ATTACH_FILTER
> +and
> +.BR SO_ATTACH_REUSEPORT_EBPF
> +takes the same socket argument type as
> +.BR SO_ATTACH_BPF.
> +UDP support for this feature is available in Linux 4.5.

s/in/since/

> +TCP support for this feature is available in Linux 4.6.

s/in/since/

> +.TP
>  .B SO_BINDTODEVICE
>  Bind this socket to a particular device like \(lqeth0\(rq,
>  as specified in the passed interface name.
> @@ -368,6 +435,18 @@ Only allowed for processes with the
>  .B CAP_NET_ADMIN
>  capability or an effective user ID of 0.
>  .TP
> +.BR SO_DETACH_FILTER " and " SO_DETACH_BPF
> +These options may be used to remove the BPF program attached to the
> +socket with either
> +.BR SO_ATTACH_FILTER
> +or
> +.BR SO_ATTACH_BPF.
> +The option value is ignored.
> +.BR SO_DETACH_FILTER
> +is available in Linux 2.2.

s/in/since/

> +.BR SO_DETACH_BPF
> +is available in Linux 3.19.

s/in/since/

> +.TP
>  .BR SO_DOMAIN " (since Linux 2.6.32)"
>  Retrieves the socket domain as an integer, returning a value such as
>  .BR AF_INET6 .
> @@ -991,17 +1070,6 @@ where only the later program needs to set the
>  option.
>  Typically this difference is invisible, since, for example, a server
>  program is designed to always set this option.
> -.SH BUGS
> -The
> -.B CONFIG_FILTER
> -socket options
> -.B SO_ATTACH_FILTER
> -and
> -.B SO_DETACH_FILTER
> -.\" FIXME Document SO_ATTACH_FILTER and SO_DETACH_FILTER
> -are not documented.
> -The suggested interface to use them is via the libpcap
> -library.
>  .\" .SH AUTHORS
>  .\" This man page was written by Andi Kleen.
>  .SH SEE ALSO

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH 1/1] include/uapi/linux/sockios.h: mark SIOCRTMSG unused

2015-12-30 Thread Michael Kerrisk (man-pages)
Hi Heinrich,

On 12/29/2015 11:22 PM, Heinrich Schuchardt wrote:
> IOCTL SIOCRTMSG does nothing but return EINVAL.
> 
> So comment it as unused.

Can you say something about how you confirmed this?
It's not immediately obvious from the code.

Cheers,

Michael


> Signed-off-by: Heinrich Schuchardt 
> ---
>  include/uapi/linux/sockios.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/sockios.h b/include/uapi/linux/sockios.h
> index e888b1a..8e7890b 100644
> --- a/include/uapi/linux/sockios.h
> +++ b/include/uapi/linux/sockios.h
> @@ -27,7 +27,7 @@
>  /* Routing table calls. */
>  #define SIOCADDRT0x890B  /* add routing table entry  */
>  #define SIOCDELRT0x890C  /* delete routing table entry   */
> -#define SIOCRTMSG0x890D  /* call to routing system   */
> +#define SIOCRTMSG0x890D  /* unused   */
>  
>  /* Socket configuration controls. */
>  #define SIOCGIFNAME  0x8910  /* get iface name   */
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] poll.2: timeout_ts is a pointer, so use -> not . for member access

2015-12-23 Thread Michael Kerrisk (man-pages)
Hello Richard,

On 23 December 2015 at 20:30, richardvo...@gmail.com
 wrote:
> From the context, it is apparent that in the code explaining ppoll in
> terms of poll, timeout_ts must be a pointer.
>
> Usage #1:   ready = ppoll(&fds, nfds, timeout_ts, &sigmask);
>
> Usage #2:(timeout_ts == NULL)
>
> Thus member access in (timeout_ts.tv_sec * 1000 + timeout_ts.tv_nsec /
> 100) is an error.

Thanks. Patch applied.

Cheers,

Michael


> man2/poll.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/man2/poll.2 b/man2/poll.2
> index bcbecad..34b55a6 100644
> --- a/man2/poll.2
> +++ b/man2/poll.2
> @@ -266,7 +266,7 @@ executing the following calls:
>  int timeout;
>
>  timeout = (timeout_ts == NULL) ? \-1 :
> -  (timeout_ts.tv_sec * 1000 + timeout_ts.tv_nsec / 100);
> +  (timeout_ts\->tv_sec * 1000 + timeout_ts\->tv_nsec / 100);
>  pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
>  ready = poll(&fds, nfds, timeout);
>  pthread_sigmask(SIG_SETMASK, &origmask, NULL);



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/5] ebpf: add a way to dump an eBPF program

2015-09-11 Thread Michael Kerrisk (man-pages)
Hi Tycho,

On 11 September 2015 at 02:21, Tycho Andersen
 wrote:
> This commit adds a way to dump eBPF programs. The initial implementation
> doesn't support maps, and therefore only allows dumping seccomp ebpf
> programs which themselves don't currently support maps.

Same broken record :-).

Cheers,

Michael


> v2: don't export a prog_id for the filter
>
> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/bpf.h | 14 ++
>  kernel/bpf/syscall.c | 41 +
>  2 files changed, 55 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 631cdee..e037a76 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -107,6 +107,13 @@ enum bpf_cmd {
>  * returns fd or negative error
>  */
> BPF_PROG_LOAD,
> +
> +   /* dump an existing bpf
> +* err = bpf(BPF_PROG_DUMP, union bpf_attr *attr, u32 size)
> +* Using attr->prog_fd, attr->dump_insn_cnt, attr->dump_insns
> +* returns zero or negative error
> +*/
> +   BPF_PROG_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -161,6 +168,13 @@ union bpf_attr {
> __aligned_u64   log_buf;/* user supplied buffer */
> __u32   kern_version;   /* checked when 
> prog_type=kprobe */
> };
> +
> +   struct { /* anonymous struct used by BPF_PROG_DUMP command */
> +   __u32   prog_fd;
> +   __u32   dump_insn_cnt;
> +   __aligned_u64   dump_insns; /* user supplied buffer */
> +   __u8gpl_compatible;
> +   };
>  } __attribute__((aligned(8)));
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index dc9b464..58ae9f4 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -586,6 +586,44 @@ free_prog:
> return err;
>  }
>
> +static int bpf_prog_dump(union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +   int ufd = attr->prog_fd;
> +   struct fd f = fdget(ufd);
> +   struct bpf_prog *prog;
> +   int ret = -EINVAL;
> +
> +   prog = get_prog(f);
> +   if (IS_ERR(prog))
> +   return PTR_ERR(prog);
> +
> +   /* For now, let's refuse to dump anything that isn't a seccomp 
> program.
> +* Other program types have support for maps, which our current dump
> +* code doesn't support.
> +*/
> +   if (prog->type != BPF_PROG_TYPE_SECCOMP)
> +   goto out;
> +
> +   ret = -EFAULT;
> +   if (put_user(prog->len, &uattr->dump_insn_cnt))
> +   goto out;
> +
> +   if (put_user((u8) prog->gpl_compatible, &uattr->gpl_compatible))
> +   goto out;
> +
> +   if (attr->dump_insns) {
> +   u32 len = prog->len * sizeof(struct bpf_insn);
> +
> +   if (copy_to_user(u64_to_ptr(attr->dump_insns),
> +prog->insns, len) != 0)
> +   goto out;
> +   }
> +
> +   ret = 0;
> +out:
> +   return ret;
> +}
> +
>  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
> size)
>  {
> union bpf_attr attr = {};
> @@ -650,6 +688,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
> uattr, unsigned int, siz
> case BPF_PROG_LOAD:
> err = bpf_prog_load(&attr);
> break;
> +   case BPF_PROG_DUMP:
> +   err = bpf_prog_dump(&attr, uattr);
> +   break;
> default:
> err = -EINVAL;
> break;
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-11 Thread Michael Kerrisk (man-pages)
c_set(&ret->usage, 1);
> +
> +   /* Intentionally don't bpf_prog_put() here, because the underlying 
> prog
> +* is refcounted too and we're holding a reference from the struct
> +* seccomp_filter object.
> +*/
> +   return ret;
> +}
> +
> +static long seccomp_ebpf_add_fd(struct seccomp_ebpf *ebpf)
> +{
> +   struct seccomp_filter *prepared;
> +
> +   prepared = seccomp_prepare_ebpf(ebpf->add_fd);
> +   if (IS_ERR(prepared))
> +   return PTR_ERR(prepared);
> +
> +   return seccomp_install_filter(ebpf->add_flags, prepared);
> +}
> +
> +static long seccomp_mode_filter_ebpf(unsigned int cmd, const char __user 
> *uargs)
> +{
> +   const struct seccomp_ebpf __user *uebpf;
> +   struct seccomp_ebpf ebpf;
> +   unsigned int size;
> +   long ret = -EFAULT;
> +
> +   uebpf = (const struct seccomp_ebpf __user *) uargs;
> +
> +   if (get_user(size, &uebpf->size) != 0)
> +   return -EFAULT;
> +
> +   /* If we're handed a bigger struct than we know of,
> +* ensure all the unknown bits are 0 - i.e. new
> +* user-space does not rely on any kernel feature
> +* extensions we dont know about yet.
> +*/
> +   if (size > sizeof(ebpf)) {
> +   unsigned char __user *addr;
> +   unsigned char __user *end;
> +   unsigned char val;
> +
> +   addr = (void __user *)uebpf + sizeof(ebpf);
> +   end  = (void __user *)uebpf + size;
> +
> +   for (; addr < end; addr++) {
> +   int err = get_user(val, addr);
> +
> +   if (err)
> +   return err;
> +   if (val)
> +   return -E2BIG;
> +   }
> +   size = sizeof(ebpf);
> +   }
> +
> +   if (copy_from_user(&ebpf, uebpf, size) != 0)
> +   return -EFAULT;
> +
> +   switch (cmd) {
> +   case SECCOMP_EBPF_ADD_FD:
> +   ret = seccomp_ebpf_add_fd(&ebpf);
> +   break;
> +   }
> +
> +   return ret;
> +}
> +#else
> +static long seccomp_mode_filter_ebpf(unsigned int cmd, const char __user 
> *uargs)
> +{
> +   return -EINVAL;
> +}
> +#endif
> +
>  /*
>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
>   * To be fully secure this must be combined with rlimit
> @@ -760,9 +849,7 @@ out:
>  static long seccomp_set_mode_filter(unsigned int flags,
> const char __user *filter)
>  {
> -   const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> struct seccomp_filter *prepared = NULL;
> -   long ret = -EINVAL;
>
> /* Validate flags. */
> if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -773,6 +860,26 @@ static long seccomp_set_mode_filter(unsigned int flags,
> if (IS_ERR(prepared))
> return PTR_ERR(prepared);
>
> +   return seccomp_install_filter(flags, prepared);
> +}
> +
> +static long seccomp_install_filter(unsigned int flags,
> +  struct seccomp_filter *prepared)
> +{
> +   const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> +   long ret = -EINVAL;
> +
> +   /*
> +* Installing a seccomp filter requires that the task has
> +* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> +* This avoids scenarios where unprivileged tasks can affect the
> +* behavior of privileged children.
> +*/
> +   if (!task_no_new_privs(current) &&
> +   security_capable_noaudit(current_cred(), current_user_ns(),
> +CAP_SYS_ADMIN) != 0)
> +       return -EACCES;
> +
> /*
>  * Make sure we cannot change seccomp or nnp state via TSYNC
>  * while another thread is in the middle of calling exec.
> @@ -875,6 +982,8 @@ static long do_seccomp(unsigned int op, unsigned int 
> flags,
> return seccomp_set_mode_strict();
> case SECCOMP_SET_MODE_FILTER:
> return seccomp_set_mode_filter(flags, uargs);
> +   case SECCOMP_MODE_FILTER_EBPF:
> +   return seccomp_mode_filter_ebpf(flags, uargs);
> default:
> return -EINVAL;
> }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] ebpf: add a seccomp program type

2015-09-11 Thread Michael Kerrisk (man-pages)
On 11 September 2015 at 02:20, Tycho Andersen
 wrote:
> seccomp uses eBPF as its underlying storage and execution format, and eBPF
> has features that seccomp would like to make use of in the future. This
> patch adds a formal seccomp type to the eBPF verifier.
>
> The current implementation of the seccomp eBPF type is very limited, and
> doesn't support some interesting features (notably, maps) of eBPF. However,
> the primary motivation for this patchset is to enable checkpoint/restore
> for seccomp filters later in the series, to this limited feature set is ok
> for now.

Hi Tycho,

Seems like a man-pages patch is warranted here also?

Cheers,

Michael


> v2: * don't allow seccomp eBPF programs to call any functions
> * get rid of superfluous seccomp_convert_ctx_access
>
> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/bpf.h |  1 +
>  net/core/filter.c| 31 +++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 92a48e2..631cdee 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -123,6 +123,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_KPROBE,
> BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT,
> +   BPF_PROG_TYPE_SECCOMP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD  1
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 13079f0..faaae67 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1612,6 +1612,15 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> }
>  }
>
> +static const struct bpf_func_proto *
> +seccomp_func_proto(enum bpf_func_id func_id)
> +{
> +   /* At some point in the future seccomp filters may grow support for
> +* eBPF functions. For now, these are disabled.
> +*/
> +   return NULL;
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
> /* check bounds */
> @@ -1662,6 +1671,17 @@ static bool tc_cls_act_is_valid_access(int off, int 
> size,
> return __is_valid_access(off, size, type);
>  }
>
> +static bool seccomp_is_valid_access(int off, int size,
> +   enum bpf_access_type type)
> +{
> +   if (type == BPF_WRITE)
> +   return false;
> +
> +   if (off < 0 || off >= sizeof(struct seccomp_data) || off & 3)
> +   return false;
> +
> +   return true;
> +}
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>   int src_reg, int ctx_off,
>   struct bpf_insn *insn_buf)
> @@ -1795,6 +1815,11 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
> .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops seccomp_ops = {
> +   .get_func_proto = seccomp_func_proto,
> +   .is_valid_access = seccomp_is_valid_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
> .ops = &sk_filter_ops,
> .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -1810,11 +1835,17 @@ static struct bpf_prog_type_list sched_act_type 
> __read_mostly = {
> .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list seccomp_type __read_mostly = {
> +   .ops = &seccomp_ops,
> +   .type = BPF_PROG_TYPE_SECCOMP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
> bpf_register_prog_type(&sk_filter_type);
> bpf_register_prog_type(&sched_cls_type);
>     bpf_register_prog_type(&sched_act_type);
> +   bpf_register_prog_type(&seccomp_type);
>
> return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Michael Kerrisk (man-pages)
mp;bpf_prog_fops, prog, O_RDWR | 
> O_CLOEXEC);
> +   err = bpf_new_fd(prog, O_RDWR | O_CLOEXEC);
> if (err < 0)
> /* failed to allocate fd */
> goto free_used_maps;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index c8e0e05..a151c35 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1003,6 +1003,13 @@ int ptrace_request(struct task_struct *child, long 
> request,
> break;
> }
>  #endif
> +
> +   case PTRACE_SECCOMP_GET_FILTER_FD:
> +   return seccomp_get_filter_fd(child);
> +
> +   case PTRACE_SECCOMP_NEXT_FILTER:
> +   return seccomp_next_filter(child, data);
> +
> default:
> break;
> }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index afaeddf..1856f69 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -26,6 +26,8 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -807,6 +809,61 @@ static inline long seccomp_set_mode_filter(unsigned int 
> flags,
>  }
>  #endif
>
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +long seccomp_get_filter_fd(struct task_struct *child)
> +{
> +   long fd;
> +   struct seccomp_filter *filter;
> +
> +   if (!capable(CAP_SYS_ADMIN))
> +   return -EACCES;
> +
> +   if (child->seccomp.mode != SECCOMP_MODE_FILTER)
> +   return -EINVAL;
> +
> +   filter = child->seccomp.filter;
> +
> +   fd = bpf_new_fd(filter->prog, O_RDONLY);
> +   if (fd > 0)
> +   atomic_inc(&filter->prog->aux->refcnt);
> +
> +   return fd;
> +}
> +
> +long seccomp_next_filter(struct task_struct *child, u32 fd)
> +{
> +   struct seccomp_filter *cur;
> +   struct bpf_prog *prog;
> +   long ret = -ESRCH;
> +
> +   if (!capable(CAP_SYS_ADMIN))
> +   return -EACCES;
> +
> +   if (child->seccomp.mode != SECCOMP_MODE_FILTER)
> +   return -EINVAL;
> +
> +   prog = bpf_prog_get(fd);
> +   if (IS_ERR(prog)) {
> +   ret = PTR_ERR(prog);
> +   goto out;
> +   }
> +
> +   for (cur = child->seccomp.filter; cur; cur = cur->prev) {
> +   if (cur->prog == prog) {
> +   if (!cur->prev)
> +   ret = -ENOENT;
> +   else
> +   ret = bpf_prog_set(fd, cur->prev->prog);
> +   break;
> +   }
> +   }
> +
> +out:
> +   bpf_prog_put(prog);
> +   return ret;
> +}
> +#endif
> +
>  /* Common entry point for both prctl and syscall. */
>  static long do_seccomp(unsigned int op, unsigned int flags,
>const char __user *uargs)
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] seccomp: add a way to attach a filter via eBPF fd

2015-09-05 Thread Michael Kerrisk (man-pages)
 placing 'fd' inside a struct avoids unpleasant
implication that would be made by passing a pointer to an fd as the
third argument.

Cheers,

Michael


> -Kees
> 
>> +   struct seccomp_filter *ret;
>> +   struct bpf_prog *prog;
>> +
>> +   prog = bpf_prog_get(fd);
>> +   if (IS_ERR(prog))
>> +   return (struct seccomp_filter *) prog;
>> +
>> +   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
>> +   bpf_prog_put(prog);
>> +   return ERR_PTR(-EINVAL);
>> +   }
>> +
>> +   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
>> +   if (!ret) {
>> +   bpf_prog_put(prog);
>> +   return ERR_PTR(-ENOMEM);
>> +   }
>> +
>> +   ret->prog = prog;
>> +   atomic_set(&ret->usage, 1);
>> +
>> +   /* Intentionally don't bpf_prog_put() here, because the underlying 
>> prog
>> +* is refcounted too and we're holding a reference from the struct
>> +* seccomp_filter object.
>> +*/
>> +
>> +   return ret;
>> +}
>> +#else
>> +static struct seccomp_filter *seccomp_prepare_ebpf(const char __user 
>> *filter)
>> +{
>> +   return ERR_PTR(-EINVAL);
>> +}
>> +#endif
>>  #endif /* CONFIG_SECCOMP_FILTER */
>>
>>  /*
>> @@ -775,8 +806,23 @@ static long seccomp_set_mode_filter(unsigned int flags,
>> if (flags & ~SECCOMP_FILTER_FLAG_MASK)
>> return -EINVAL;
>>
>> +   /*
>> +* Installing a seccomp filter requires that the task has
>> +* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
>> +* This avoids scenarios where unprivileged tasks can affect the
>> +* behavior of privileged children.
>> +*/
>> +   if (!task_no_new_privs(current) &&
>> +   security_capable_noaudit(current_cred(), current_user_ns(),
>> +CAP_SYS_ADMIN) != 0)
>> +   return -EACCES;
>> +
>> /* Prepare the new filter before holding any locks. */
>> -   prepared = seccomp_prepare_user_filter(filter);
>> +   if (flags & SECCOMP_FILTER_FLAG_EBPF)
>> +   prepared = seccomp_prepare_ebpf(filter);
>> +   else
>> +   prepared = seccomp_prepare_user_filter(filter);
>> +
>> if (IS_ERR(prepared))
>> return PTR_ERR(prepared);
>>
>> --
>> 2.1.4
>>
> 
> 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] add tcp congestion control relevant parts

2008-01-02 Thread Michael Kerrisk


Stephen Hemminger wrote:
> On Fri, 14 Dec 2007 09:48:32 +0100
> Michael Kerrisk <[EMAIL PROTECTED]> wrote:
> 
>> Hello Linux networking folk,
>>
>> I received the patch below for the tcp.7 man page.  Would anybody here be
>> prepared to review the new material / double check the details?
>>
>> Cheers,
>>
>> Michael
>>
>>  Original Message 
>> Subject: [patch] add tcp congestion control relevant parts
>> Date: Wed, 12 Dec 2007 16:40:23 +0100
>> From: Thomas Egerer <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> CC: [EMAIL PROTECTED]
>>
>> Hello *,
>>
>> man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/
>> All required information were obtained by reading the kernel
>> code/documentation.
>> I'm not sure, whether it is completely bullet proof on when the sysctl
>> variables/socket option first appeared in the kernel, so you might as well
>> drop this information, but I'm pretty sure about how it works.
>> Here we go with my patch:
>>
>> diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7
>> --- man-pages-2.70/man7/tcp.7   2007-11-24 14:33:34.0 +0100
>> +++ man-pages-2.70.new/man7/tcp.7   2007-12-12 16:34:52.0 +0100
>> @@ -177,8 +177,6 @@
>>  .\" FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are
>>  .\"not yet documented (shown with default values):
>>  .\"
>> -.\" /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13)
>> -.\" bic
>>  .\" /proc/sys/net/ipv4/tcp_moderate_rcvbuf
>>  .\" 1
>>  .\" /proc/sys/net/ipv4/tcp_no_metrics_save
>> @@ -224,6 +222,20 @@
>>  are reserved for the application buffer.
>>  A value of 0
>>  implies that no amount is reserved.
>> +.TP
>> +.BR tcp_allowed_congestion_control \
>> +" (String; default: cubic reno) (since 2.6.13) "
>> +Show/set the congestion control choices available to non-privileged
>> +processes. The list is a subset of those listed in
>> +.IR tcp_available_congestion_control "."
>> +Default is "cubic reno" and the default setting
>> +.RI ( tcp_congestion_control ).
>> +.TP
>> +.BR tcp_available_congestion_control \
>> +" (String; default: cubic reno) (since 2.6.13) "
>> +Lists the TCP congestion control algorithms available on the system. This
>> value
>> +can only be changed by loading/unloading modules responsible for congestion
>> +control.
>>  .\"
>>  .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt
>>  .TP
>> @@ -257,6 +269,17 @@
>>  Allows two flows sharing the same connection to converge
>>  more rapidly.
>>  .TP
>> +.BR tcp_congestion_control " (String; default: cubic reno) (since 2.6.13) "
>> +Determines the congestion control algorithm used for newly created TCP
>> +sockets. By default Linux uses cubic with reno as fallback. If you want
>> +to have more control over the algorithm used, you must enable the symbol
>> +CONFIG_TCP_CONG_ADVANCED in your kernel config.
> 
> You can choose the default congestion control as well as part of the kernel
> configuration.

Hi Stephen,

Other than this, did the doc patch look okay?  (I'm not sure whether there
was an implied ACK in your message for the rest of the patch.)

Cheers,

Michael

-- 
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug?  Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] add tcp congestion control relevant parts

2007-12-14 Thread Michael Kerrisk
Hello Linux networking folk,

I received the patch below for the tcp.7 man page.  Would anybody here be
prepared to review the new material / double check the details?

Cheers,

Michael

 Original Message 
Subject: [patch] add tcp congestion control relevant parts
Date: Wed, 12 Dec 2007 16:40:23 +0100
From: Thomas Egerer <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]

Hello *,

man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/
All required information were obtained by reading the kernel
code/documentation.
I'm not sure, whether it is completely bullet proof on when the sysctl
variables/socket option first appeared in the kernel, so you might as well
drop this information, but I'm pretty sure about how it works.
Here we go with my patch:

diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7
--- man-pages-2.70/man7/tcp.7   2007-11-24 14:33:34.0 +0100
+++ man-pages-2.70.new/man7/tcp.7   2007-12-12 16:34:52.0 +0100
@@ -177,8 +177,6 @@
 .\" FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are
 .\"not yet documented (shown with default values):
 .\"
-.\" /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13)
-.\" bic
 .\" /proc/sys/net/ipv4/tcp_moderate_rcvbuf
 .\" 1
 .\" /proc/sys/net/ipv4/tcp_no_metrics_save
@@ -224,6 +222,20 @@
 are reserved for the application buffer.
 A value of 0
 implies that no amount is reserved.
+.TP
+.BR tcp_allowed_congestion_control \
+" (String; default: cubic reno) (since 2.6.13) "
+Show/set the congestion control choices available to non-privileged
+processes. The list is a subset of those listed in
+.IR tcp_available_congestion_control "."
+Default is "cubic reno" and the default setting
+.RI ( tcp_congestion_control ).
+.TP
+.BR tcp_available_congestion_control \
+" (String; default: cubic reno) (since 2.6.13) "
+Lists the TCP congestion control algorithms available on the system. This
value
+can only be changed by loading/unloading modules responsible for congestion
+control.
 .\"
 .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt
 .TP
@@ -257,6 +269,17 @@
 Allows two flows sharing the same connection to converge
 more rapidly.
 .TP
+.BR tcp_congestion_control " (String; default: cubic reno) (since 2.6.13) "
+Determines the congestion control algorithm used for newly created TCP
+sockets. By default Linux uses cubic with reno as fallback. If you want
+to have more control over the algorithm used, you must enable the symbol
+CONFIG_TCP_CONG_ADVANCED in your kernel config.
+You can use
+.BR setsockopt (2)
+to individually change the algorithm on a single socket.
+Requires CAP_NET_ADMIN or congestion algorithm to be listed in
+.IR tcp_allowed_congestion_control "."
+.TP
 .BR tcp_dsack " (Boolean; default: enabled)"
 Enable RFC\ 2883 TCP Duplicate SACK support.
 .TP
@@ -649,7 +672,21 @@
 socket options are valid on TCP sockets.
 For more information see
 .BR ip (7).
-.\" FIXME Document TCP_CONGESTION (new in 2.6.13)
+.TP
+.BR TCP_CONGESTION " (new since kernel version 2.6.13)"
+If set to the name of an available congestion control algorithm,
+it will henceforth be used for the socket. To get a list of
+available congestion control algorithms, consult the sysctl variable
+.IR net.ipv4.tcp_available_congestion_control "."
+The algorithm that is used by default for all newly created
+TCP sockets can be viewed/changed via the sysctl variable
+.IR net.ipv4.tcp_congestion_control "."
+If you feel, you are missing an algorithm in the list,
+you may try to load the corresponding module using
+.BR modprobe (8),
+or if your kernel is built with module autoloading support
+.RI ( CONFIG_KMOD )
+and the algorithm has been compiled as a module, it will be autoloaded.
 .TP
 .B TCP_CORK
 If set, don't send out partial frames.


-- 
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug?  Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Undocumented IPv6 options

2007-10-15 Thread Michael Kerrisk
> It really looks like time for major overhaul of that (and related) man-pages
> is needed...

Yes.  Andi Kleen did a good job of putting some pages together in
the 2.2 timeframe, but no-one else carried on the work since then,
and there is much that sould be updated in the *.7 networking
pages.

Cheers,

Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Undocumented IPv6 options

2007-10-14 Thread Michael Kerrisk
Hello netdev,

Andrew McDonald kindly fixed the description of IPV6_ROUTER_ALERT in the
ipv7.7 man page.  As long as we're on the topic, I'll point out that the
following IPV6 options (and possibly others) are still not documented on
that page:

IPV6_CHECKSUM
IPV6_JOIN_ANYCAST
IPV6_LEAVE_ANYCAST
IPV6_V6ONLY
IPV6_RECVPKTINFO
IPV6_2292PKTINFO

Can anyone help with documenting any of these please?

Cheers,

Michael
-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] ipv6.7: IPV6_ROUTER_ALERT sockopt correction

2007-10-14 Thread Michael Kerrisk
Hello Andrew,

> I discovered that the current description of the IPV6_ROUTER_ALERT
> sockopt in ipv6.7 is significantly wrong. A patch to fix the
> description is below. I sent a version of this earlier in the year to
> [EMAIL PROTECTED], but nothing happened with it at the time.

Hmmm -- somehow that message got dropped.  I found it in my trash -- sorry
about that.

I've applied your patch for 2.68.

Thanks,

Michael

> The correction is based on reading the relevant parts of the kernel
> source code, and backed up by some test programs. The main bits of code
> in the kernel (in case someone wants to double-check my update) are
> net/ipv6/ipv6_sockglue.c:ip6_ra_control() and
> net/ipv6/ip6_output.c:ip6_call_ra_chain().
> 
> The patch is against man-pages-2.66.

> --- man7/ipv6.7.orig  2007-10-14 11:59:46.0 +0100
> +++ man7/ipv6.7   2007-10-14 12:05:15.0 +0100
> @@ -233,10 +233,17 @@
>  Argument is a pointer to boolean.
>  .TP
>  .B IPV6_ROUTER_ALERT
> -Pass all forwarded packets containing an router alert option to
> +Pass forwarded packets containing a router alert hop-by-hop option to
>  this socket.
> -Only allowed for datagram sockets and for root.
> -Argument is a pointer to boolean.
> +Only allowed for SOCK_RAW sockets.
> +The tapped packets are not forwarded by the kernel, it is the
> +user's responsibility to send them out again.
> +Argument is a pointer to an integer.
> +A positive integer indicates a router alert option value to intercept.
> +Packets carrying a router alert option with a value field containing
> +this integer will be delivered to the socket.
> +A negative integer disables delivery of packets with router alert options
> +to this socket.
>  .TP
>  .B IPV6_UNICAST_HOPS
>  Set the unicast hop limit for the socket.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-man" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Zero-length write() does not generate a datagram on connected socket

2007-09-28 Thread Michael Kerrisk
On 9/28/07, Stephen Hemminger <[EMAIL PROTECTED]> wrote:
> On Thu, 27 Sep 2007 13:53:34 -0700 (PDT)
> David Miller <[EMAIL PROTECTED]> wrote:
>
> > From: Stephen Hemminger <[EMAIL PROTECTED]>
> > Date: Mon, 24 Sep 2007 15:34:35 -0700
> >
> > > The bug http://bugzilla.kernel.org/show_bug.cgi?id=5731
> > > describes an issue where write() can't be used to generate a zero-length
> > > datagram (but send, and sendto do work).
> > >
> > > I think the following is needed:
> > >
> > > --- a/net/socket.c  2007-08-20 09:54:28.0 -0700
> > > +++ b/net/socket.c  2007-09-24 15:31:25.0 -0700
> > > @@ -777,8 +777,11 @@ static ssize_t sock_aio_write(struct kio
> > > if (pos != 0)
> > > return -ESPIPE;
> > >
> > > -   if (iocb->ki_left == 0) /* Match SYS5 behaviour */
> > > -   return 0;
> > > +   if (unlikely(iocb->ki_left == 0)) {
> > > +   struct socket *sock = iocb->ki_filp->private_data;
> > > +   if (sock->type == SOCK_STREAM)
> > > +   return 0;
> > > +   }
> > >
> > > x = alloc_sock_iocb(iocb, &siocb);
> > > if (!x)
> >
> > We should simply remove the check completely.
> >
> > There is no need to add special code for different types of protocols
> > and sockets.
> >
> > As is hinted in the bugzilla, the exact same thing can happen with a
> > suitably constructed sendto() or sendmsg() call.  write() on a socket
> > is a sendmsg() with a NULL msg_control and a single entry iovec, plain
> > and simple.
> >
> > It's how BSD and many other systems behave, and I double checked
> > Steven's Volume 2 just to make sure.
> >
> > So I'm going to check in the following to fix this bugzilla.  There is
> > a similarly ugly test for len==0 in sys_read() on sockets.  If someone
> > would do some research on the validity of that thing I'd really
> > appreciate it :-)
>
> Read of zero length should be a no-op for SOCK_STREAM but
> for SOCK_DATAGRAM or SOCK_SEQPACKET it might be useful as a
> remote wait for event.

Hmm -- I hadn't checked the behavior for zero-length read() on other
systems.  i will try to do that soonish (probably only Minday or so).

Cheers,

Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with semantics?

2007-08-27 Thread Michael Kerrisk
Hi Andi,

Andi Kleen wrote:
> Shay Goikhman <[EMAIL PROTECTED]> writes:
> 
>> Dear Linux maintainers,
>>
>>  I'm doing :
>>
>>   setsockopt(s,  SO_RCVTIMEO, t1 );  // set time-out
>> t1 on socket while block receiving on it
>>   select(,,, &fd_set_including(s), .., &errs, t2);  // block till
>> receive or time-out  t 2 jointly on a set of sockets
>>
>> Apparently, I could no find reference on the coupled behavior of the two
>> above statements in Linux documentation.
>> As I understand the blocking semantics, I would expect  that  if t1> select should return after t1 with the descriptor 's' in 'errs' if 's' does
>> not become readable in the t1 interval.
>>
>> It is not so in life -- select ignores t1 altogether.
>>
>> Do you have some enlightening knowledge on the matter?
> 
> RCVTIMEO only applies to recvmsg et.al., similar to SNDTIMEO only
> apply to sendmsg etc. But select/poll only report events, they
> do not actually send or receive by themselves.
> 
> Michael, perhaps you can clarify that in the manpages

I added the following to sockets.7:

  Timeouts have
  effect   for  socket  I/O  calls  (read(2),  recv(2),
  recvfrom(2),recvmsg(2),write(2), send(2),
  sendto(2),  sendmsg(2));  timeouts have no effect for
  select(2), poll(2), epoll_wait(2), etc.

The change will be in man-pages-2.65.

Thanks for your note.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


man-pages-2.45 and man-pages-2.46 are released

2007-04-29 Thread Michael Kerrisk
ME Document the conf/*/* sysctls
 FIXME Document the route/* sysctls
 FIXME document them all

 FIXME Add a discussion of multicasting

==
man7/ipv6.7
 FIXME IPV6_CHECKSUM is not documented, and probably should be
 FIXME IPV6_JOIN_ANYCAST is not documented, and probably should be
 FIXME IPV6_LEAVE_ANYCAST is not documented, and probably should be
 FIXME IPV6_V6ONLY is not documented, and probably should be
 FIXME IPV6_RECVPKTINFO is not documented, and probably should be
 FIXME IPV6_2292PKTINFO is not documented, and probably should be
 FIXME there are probably many other IPV6_* socket options that
 should be documented

==
man7/netlink.7
 FIXME More details on NETLINK_INET_DIAG needed.

 FIXME More details on NETLINK_XFRM needed.

 FIXME More details on NETLINK_ISCSI needed.

 FIXME More details on NETLINK_AUDIT needed.

 FIXME More details on NETLINK_FIB_LOOKUP needed.

 FIXME More details on NETLINK_NETFILTER needed.

 FIXME More details on NETLINK_KOBJECT_UEVENT needed.

 FIXME NLM_F_ATOMIC is not used any more?

 FIXME Explain more about nlmsg_seq and nlmsg_pid.


==
man7/udp.7
 FIXME document UDP_ENCAP (new in kernel 2.5.67)


-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip(7) IP_PMTUDISC_PROBE

2007-04-08 Thread Michael Kerrisk
> > Document new IP_PMTUDISC_PROBE value for IP_MTU_DISCOVERY.  (Going into
> > 2.6.22).

Hi John,

Thanks -- accepted -- fix will appear in man-pages-2.47.

Andi: thanks for pointing John in the right direction.

Cheers,

Michael


> > 
> >
> > diff -rU3 man-pages-2.43-a/man7/ip.7 man-pages-2.43-b/man7/ip.7
> > --- man-pages-2.43-a/man7/ip.7  2006-09-26 09:54:29.0 -0400
> > +++ man-pages-2.43-b/man7/ip.7  2007-03-27 15:46:18.0 -0400

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IP RECVTTL

2005-08-22 Thread Michael Kerrisk
Gidday,

Thanks for this patch.  I have a few questions:

> Andi Kleen wrote:
> > The man page was supposed to document the kernel, so it's probably
> > a bug in the manpage.  You should send a patch to the manpages
> > maintainers, with a warning in NOTES that the Linux behaviour
> > differs from other OS.
> 
> OK. Attached patch fixes this and adds comment to the NOTES. Also comment
> about SOL_IP portability added to the NOTES and duplicate IP_PKTINFO
> removed in the VERSIONS section.

[patch inlined...]

> -.I IP_RECVTTL
> +.I IP_TTL

So is it the case that this option was just wrongly named in the 
original page, or is the change here reflective of something that 
has changed in the kernel?  (It doesn't look like the latter is 
true, but I thought it better to check.)

>  control message with the time to live
>  field of the received packet as a byte. Not supported for
>  .B SOCK_STREAM
> @@ -789,6 +789,20 @@ received datagrams. Linux has the more g
>  .I IP_PKTINFO
>  for the same task.
>  .PP
> +Some BSD sockets implementations also provide
> +.I IP_RECVTTL
> +option, but ancillary message with type
> +.I IP_RECVTTL
> +is passed with incoming packet. It's different from
> +.I IP_TTL
> +used in Linux.

>From reading the sources, Linux appears to have both 
IP_RECVTTL and IP_TTL.  So, does there not also need
to be some documentation of the "real" IP_RECVTTL?

> +.PP
> +Using
> +.I SOL_IP
> +socket options level isn't portable, BSD-based stacks use
> +.I IPPROTO_IP
> +level.

Recently (not yet published), I went though ip(7), tcp(7), udp(7) 
etc, and changed SOL_IP to IPPROTO_IP, SOL_TCP to IPPROTO_TCP, 
etc, on the basis that

-- the IPPROTO_* constants are what appear in POSIX, and
-- glibc defines the IPPROTO_* constants with the same values
   as the corresponding SOL_* constants.

Does anyone see a problem with this change in the docs?

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  Grab the latest
tarball at ftp://ftp.win.tue.nl/pub/linux-local/manpages/
and grep the source files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IP RECVTTL

2005-08-22 Thread Michael Kerrisk
> > >  control message with the time to live
> > >  field of the received packet as a byte. Not supported for
> > >  .B SOCK_STREAM
> > > @@ -789,6 +789,20 @@ received datagrams. Linux has the more g
> > >  .I IP_PKTINFO
> > >  for the same task.
> > >  .PP
> > > +Some BSD sockets implementations also provide
> > > +.I IP_RECVTTL
> > > +option, but ancillary message with type
> > > +.I IP_RECVTTL
> > > +is passed with incoming packet. It's different from
> > > +.I IP_TTL
> > > +used in Linux.
> > 
> > From reading the sources, Linux appears to have both 
> > IP_RECVTTL and IP_TTL.  So, does there not also need
> > to be some documentation of the "real" IP_RECVTTL?
> 
> You seems to be confused ;). In short:

Try "ignorant and in a hurry" ;-) (a holiday looms).
Thanks for the clarification.

> > > +.PP
> > > +Using
> > > +.I SOL_IP
> > > +socket options level isn't portable, BSD-based stacks use
> > > +.I IPPROTO_IP
> > > +level.
> > 
> > Recently (not yet published), I went though ip(7), tcp(7), udp(7) 
> > etc, and changed SOL_IP to IPPROTO_IP, SOL_TCP to IPPROTO_TCP, 
> > etc, on the basis that
> > 
> > -- the IPPROTO_* constants are what appear in POSIX, and
> 
> That's good enough reason, IMHO.

Thanks.  I'd be interested in input from others also, just
to see if there is some consensus.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  Grab the latest
tarball at ftp://ftp.win.tue.nl/pub/linux-local/manpages/
and grep the source files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux networking manpages

2005-07-25 Thread Michael Kerrisk
> > Since we're introducing some changes to the network stack
> > (SO_RCVBUFFORCE, ...), I would like to update the respective man pages.
> > 
> > I think Andi wrote most of the current ones, but said they're
> > unmaintained by now.  If this is still the case, can someone please 
> > send me the latest version?  
> 
> Actually they're kind of maintained in the manpages package on kernel.org

Yes.

> The maintainer is Michael Kerrisk <[EMAIL PROTECTED]>

Yes.

> ftp://ftp.kernel.org/pub/linux/docs/manpages/
> 
> Just I didn't think he or anybody did a systematic effort to update

Correct.  I have adjusted some details here and there, as I come
across new stuff.  But I am not intimate enough with the Linux
TCP/IP stack to do a thorough going update (nor do I the amount
of time that that would require -- the network pages probably 
amount to about 1% of the total that I try to maintain).

> the network specific pages to 2.6, so a lot of stuff is missing and some
> of the notes are quite obsolete now. There are probably too many in the
> package for Michael to keep them all uptodate. 

Exactly.

> Basically somebody needs to go through
> the network code and make sure all the ioctls/socket options/cmsgs etc.
> are documented when they are stable enough.  Also some of the manpages I
> originally never completely finished 
> (like netlink or ipv6). These probably could take a overhaul or rewrite.
> Also some stuff like netfilter was always undocumented.  I would only
> bother implementing it if the interface is fairly stable though.
>
> If you do any improvements please send a patch to Michael.

Yes please.

Always work for the latest versions of the pages (available at the
location above) when sending updates to me (the networking pages 
have slowly changed since Andi wrote them).

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 

Want to help with man page maintenance?  Grab the latest
tarball at ftp://ftp.win.tue.nl/pub/linux-local/manpages/
and grep the source files for 'FIXME'.

5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse f�r Mail, Message, More +++
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html