Solving ENOBUFS returned by write()

Michal Sojka Fri, 30 Sep 2011 05:38:23 -0700

Dear SocketCAN developers,

recently, we worked with Oliver on evaluation of queuing disciplines for
use in CAN networks. We will publish our results soon, but before that I
would like to discuss with you one issue that annoys me for a long time
(and according to the mailing list, I'm not alone). Now we may have a
solution for it - see the description and the patch below.


This issue is that write()/send() syscalls return ENOBUFS under
certain conditions. The easiest way to reproduce the problem is to
send some frames to an unconnected CAN interface (i.e. no frame cannot
leave the box, all stay queued somewhere). After attempting to send
10+(number of HW TX buffers) frames, the application (e.g. cangen)
gets ENOBUFS. cangen tries to overcome the problem by calling poll()
in this case but this doesn't work. Poll never block and cangen ends
up busy waiting. Just run cangen .... -i -p 1000 and top. cangen will
eat 100% CPU.

Such a problem usually doesn't appear in Ethernet networks.
Applications block automatically when there is lack of resources for
sending. We dig into source codes to find out what is different in CAN
compared to the Ethernet and here is what we have found.

So why is ENOBUFS typically returned? In the default configuration,
CAN interfaces have attached pfifo_fast queuing discipline. Therefore,
dev_queue_xmit() calls pfifo_fast_enqueue() which checks for
dev->tx_queue_len (which is 10 for CAN devices by default). If the
dev->number of queued frames is grater, it and returns
NET_XMIT_DROP. Then, can_send() calls net_xmit_errno(), which
translates NET_XMIT_DROP into -ENOBUFS which is then returned to the
application.

The difference in Ethernet networks is that the default queue size is
1000 and the reason why this limit is not reached is that there is
another limit, which is lower and causes the application to block.
This limit is SO_SNDBUF socket option.

In case of CAN_RAW sockets, this limit is checked in
sock_alloc_send_skb() like this:

if (sk->sk_wmem_alloc < sk->sk_sndbuf)
  alloc_skb();
else
 sock_wait_for_wmem(); // i.e. block

sk->sk_wmem_alloc is increased by skb->truesize whenever application
creates a skb belonging to the socket (i.e. on write) and decreased by
the same amount whenever the skb is passed to the driver. The value
of skb->truesize is the sizeof(can_frame) + sizeof(skb), which is 200
in my case (PowerPC).
 
The default value of sk->sk_wmem_alloc is 108544 which means that for
CAN, this limit is reached (and the application blocks) when it has
542 CAN frames waiting to be send to the driver. This is of cause more
then 10, allowed by dev->tx_queue_len.

Therefore, we propose apply patch like this:

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index d0f8c7e..4831c53 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -438,7 +438,7 @@ static void can_setup(struct net_device *dev)
        dev->mtu = sizeof(struct can_frame);
        dev->hard_header_len = 0;
        dev->addr_len = 0;
-       dev->tx_queue_len = 10;
+       dev->tx_queue_len = 22;
 
        /* New-style flags. */
        dev->flags = IFF_NOARP;
diff --git a/net/can/af_can.c b/net/can/af_can.c
index 094fc53..4cf10e7 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -190,6 +190,8 @@ static int can_create(struct net *net, struct socket *sock, 
int protocol,
        sock_init_data(sock, sk);
        sk->sk_destruct = can_sock_destruct;
 
+       sk->sk_sndbuf = SOCK_MIN_SNDBUF;
+
        if (sk->sk_prot->init)
                err = sk->sk_prot->init(sk);
 
This sets the minimum possible sk_sndbuf, i.e. 2048, which allows to
have 11 frames queued for a socket before the application blocks. In
my case, the driver (mpc5200) seems to utilize 3 TX buffers and
therefore cangen blocks when it tries to send the 15th frame (3 frames
are buffered in driver, 11 in pfifo_fast qdisc). If the application
does not want to block, it can set O_NONBLOCK flag on the socket and
it receives EAGAIN instead of ENOBUFS.

It is also necessary to slightly increase the default tx_queue_len.
Increasing it to 22 allows using two applications (or better two
sockets) without seeing ENOBUFS. The third application/socket then
gets ENOBUFS just for its first write().

The above described situation is not the only way how can an
application get ENOBUFS, but I think that in case of PF_CAN this is
the most common situation and having a blocking behavior as provided
by this patch would help the users a lot.

Thoughts?

Best regards,
Michal & Rosta
_______________________________________________
Socketcan-core mailing list
Socketcan-core@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/socketcan-core

Solving ENOBUFS returned by write()

Reply via email to