On Wed, Feb 24, 2010 at 02:56:35PM -0800, Ira W. Snyder wrote:
> On Wed, Feb 24, 2010 at 11:18:22PM +0100, Marc Kleine-Budde wrote:
> > Ira W. Snyder wrote:
> > > On Wed, Feb 24, 2010 at 12:52:06PM -0800, Ira W. Snyder wrote:
> > >> On Wed, Feb 24, 2010 at 08:41:06PM +0100, Marc Kleine-Budde wrote:
> > >>> Ira W. Snyder wrote:
> > >>>> On Wed, Feb 24, 2010 at 08:08:15PM +0100, Marc Kleine-Budde wrote:
> > >>>>> Ira W. Snyder wrote:
> > >>>>>
> > >>>>> [...]
> > >>>>>
> > >>>>>>>>> But we want so solve the TX problem. You only have to stop the
> > >>>>>>>>> netif_queue if you TX ring is full. Then you have to deal with
> > >>>>>>>>> calling
> > >>>>>>>>> netif_wake_queue() eventually. But we obviously don't want to
> > >>>>>>>>> poll.
> > >>>>>>>>>
> > >>>>>>>>> The solution might be to look at the TX queue in the interrupt
> > >>>>>>>>> handler
> > >>>>>>>>> and/or the NAPI loop. And now we get back to my first question.
> > >>>>>>>>> If the
> > >>>>>>>>> controller does a loopback in hardware, i.e. you receive a frame
> > >>>>>>>>> for
> > >>>>>>>>> each frame send, you don't need to poll the TX queue length.
> > >>>>>>>>>
> > >>>>>>>>> If you receive a CAN frame it might be on of your's, which means
> > >>>>>>>>> the TX
> > >>>>>>>>> queue might have at least room for one frame. In pseudo code it
> > >>>>>>>>> might
> > >>>>>>>>> look like this:
> > >>>>>>>>>
> > >>>>>>>>> xmit()
> > >>>>>>>>> {
> > >>>>>>>>> send_frame_to_hardware();
> > >>>>>>>>> if (tx_ring_full) {
> > >>>>>>>>> netif_stop_queue();
> > >>>>>>>>> priv->flags |= TX_RING_FULL;
> > >>>>>>>>> }
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> irq() and netif_poll()
> > >>>>>>>>> {
> > >>>>>>>>> if (priv->flags & TX_RING_FULL) {
> > >>>>>>>>> if (room_in_ring()) {
> > >>>>>>>>> priv->flags &= ~TX_RING_FULL;
> > >>>>>>>>> netif_wake_queue();
> > >>>>>>>>> }
> > >>>>>>>>> }
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> Should be implemented race-free, i.e. you might have to use
> > >>>>>>>>> atomic_*
> > >>>>>>>>> operations and/or don't use a shared flag variable.
> > >>>>>>>> Cool, that should work. netif_queue_stopped() could be used
> > >>>>>>>> instead of
> > >>>>>>>> the TX_RING_FULL flag.
> > >>>>>>> right...good point
> > >>>>>>>
> > >>>>>>> I should have looked at my own driver :)
> > >>>>>>>
> > >>>>>>> If room_in_ring() is cheap, it boils down to:
> > >>>>>>>
> > >>>>>>>>> irq() and netif_poll()
> > >>>>>>>>> {
> > >>>>>>>>> if (room_in_ring()) {
> > >>>>>>>>> netif_wake_queue();
> > >>>>>>>>> }
> > >>>>>>> Otherwise:
> > >>>>>>>
> > >>>>>>>>> irq() and netif_poll()
> > >>>>>>>>> {
> > >>>>>>>>> if (netif_queue_stopped() && room_in_ring()) {
> > >>>>>>>>> netif_wake_queue();
> > >>>>>>>>> }
> > >>>>>> Hmm, this *almost* works. Running "cangen -g 0 can0" quickly gets the
> > >>>>>> queue into a hung state, though.
> > >>>>> what is a "hung state"?
> > >>>>>
> > >>>>>> I added printk() to the sites where I stop/wake the queue. It is
> > >>>>>> always
> > >>>>>> woken up correctly, but cangen always exits with "write: No buffer
> > >>>>>> space
> > >>>>>> available" after a few stop/wake cycles. Immediately running cangen
> > >>>>>> again works for a few more stop/wake cycles.
> > >>>>> let's look the the cangen source if there's propper error handling...
> > >>>>> http://svn.berlios.de/viewvc/socketcan/trunk/can-utils/cangen.c?revision=787&view=markup
> > >>>>> nope, there isn't
> > >>>>>
> > >>>>> There's the option "-i" to ignore ENOBUFS, which is....we don't want
> > >>>>> to
> > >>>>> do that....
> > >>>>>
> > >>>>> Better improbe the source: (from the pengutronix' canutils...)
> > >>>>>
> > >>>>> again:
> > >>>>> len = write(s, &frame, sizeof(frame));
> > >>>>> if (len == -1) {
> > >>>>> switch (errno) {
> > >>>>> case ENOBUFS: {
> > >>>>> int err;
> > >>>>> struct pollfd fds[] = {
> > >>>>> {
> > >>>>> .fd = s,
> > >>>>> .events = POLLOUT,
> > >>>>> },
> > >>>>> };
> > >>>>>
> > >>>>> err = poll(fds, 1, 1000);
> > >>>>> if (err == -1 && errno != -EINTR) {
> > >>>>> perror("poll()");
> > >>>>> exit(EXIT_FAILURE);
> > >>>>> } else if (err == 0) {
> > >>>>> printf("a timeout occured\n");
> > >>>>> exit(EXIT_FAILURE);
> > >>>>> }
> > >>>>> }
> > >>>>> case EINTR: /* fallthrough */
> > >>>>> goto again;
> > >>>>> default:
> > >>>>> perror("write");
> > >>>>> exit(EXIT_FAILURE);
> > >>>>> }
> > >>>>> }
> > >>>>>
> > >>>>> If you hit a timeout, which means you cannot send packages for 1000
> > >>>>> ms,
> > >>>>> your send queue probably didn't wake up...
> > >>>>>
> > >>>> Is there any chance you can send me the pengutronix version of
> > >>>> cangen/canutils? I can't find it on their website.
> > >>>
> > >>> http://www.pengutronix.de/software/socket-can/download/canutils/v4.0/
> > >>> http://www.pengutronix.de/software/libsocketcan/download/
> > >>>
> > >>> you need the libsocketcan, too
> > >>>
> > >>>>>> Here is the output of ifconfig:
> > >>>>>> can0 Link encap:UNSPEC HWaddr
> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> > >>>>>> UP RUNNING NOARP MTU:16 Metric:1
> > >>>>>> RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
> > >>>>>> TX packets:1902 errors:0 dropped:0 overruns:0 carrier:0
> > >>>>>> collisions:0 txqueuelen:10
> > >>>>>> RX bytes:10767 (10.5 KiB) TX bytes:10767 (10.5 KiB)
> > >>>>>> Interrupt:22
> > >>>>>>
> > >>>>>> can1 Link encap:UNSPEC HWaddr
> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> > >>>>>> UP RUNNING NOARP MTU:16 Metric:1
> > >>>>>> RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > >>>>>> collisions:0 txqueuelen:10
> > >>>>>> RX bytes:10767 (10.5 KiB) TX bytes:0 (0.0 b)
> > >>>>>> Interrupt:22
> > >>>>>>
> > >>>>>> I'm stopping the queue correctly: the xmit() routine never complains
> > >>>>>> that it
> > >>>>>> tried to transmit but had no buffer space. You'll see that we didn't
> > >>>>>> drop any
> > >>>>>> packets in this run. In another run, we did drop packets. It is like
> > >>>>>> the
> > >>>>>> firmware didn't echo the packets back to me. Also note that the TX of
> > >>>>>> can0 always matches the RX of can1. They're cabled together, so this
> > >>>>>> is
> > >>>>>> expected.
> > >>>>> Hmm...it looks like this.
> > >>>>>
> > >>>>> Double check your RX path, so that you don't loose any frames in
> > >>>>> software. Is it possible that the RX and TX collide in your driver
> > >>>>> and/or in hardware?
> > >>>>>
> > >>>> I'm pretty sure that I do not lose RX frames in my driver. When I do,
> > >>>> stats->rx_dropped is always incremented. I do get the following warning
> > >>> I mean loosing RX frames due to an programming error...
> > >>>
> > >
> > > I've done some more investigation.
> > >
> > > 1) using a tasklet instead of the shared workqueue seems to help. We get
> > > many less rx packets lost
> >
> > Why don't you use NAPI for rx?
> >
>
> There are two kinds of messages from this firmware:
> 1) firmware control/communication messages (setup messages, error messages,
> etc.)
> 2) CAN messages
>
> In my first posting of the driver, I used a workqueue for #1, and NAPI
> for #2. Wolfgang suggested that I use just a workqueue. Is it
> legal/possible to trigger a NAPI poll() without having the netdevice
> open? Alternatively, I do all of the device setup during the netdevice
> open() instead of handling most of it at device probe time. I believe
> that NAPI can be triggered concurrently with netdevice open(), after
> napi_enable() is called.
>
> > > 2) the hangs seem to be due to the driver losing lots of interrupts. It
> > > spends all of it's time in the xmit() routine, and never gets a chance
> > > to run the recv() routine.
> >
> > You wake the tx_queue quite early in the work_queue. Try waking TX after you
> > have RX'ed all messages, not earlier.
> >
>
> Hmm, doesn't seem to help. I removed the netif_wake_queue() check from
> both ican3_recv_skb() and ican3_handle_interrupt() and moved it to the
> end of ican3_tasklet(). The TX queue can only be woken up after all
> packets have been recv'd now.
>
> I also tried checking the DESC_IVALID bit in addition to the DESC_VALID
> bit. You'll want to read the datasheet for yourself, but it basically
> says that you need to clear the IVALID bit before clearing the
> interrupt. Now process the buffer as you see fit. Once processing is
> finished, toggle the VALID bit. Now the controller can re-use the
> buffer.
>
> The datasheet says that this is so you do not lose interrupts, but I
> don't see how it can possibly work. You should always keep running
> around the ring until you're out of buffers, rather than count on not
> losing interrupts. Which is what my driver does.
>
> > > This is a problem, even using tasklet_hi_schedule(). The cansequence
> > > program is blasting the transmit queue so hard that we never get a
> > > chance to service the recv() queue. Therefore we start losing packets,
> > > and generating "rx overflow" error frames.
> >
> > Hmmm....if the tx_queue of your hardware is full,
> > the xmit functions shouldn't be called, because you halted it.
> >
>
> Right. The ican3_xmit() is not called again after netif_stop_queue() is
> called. That is working correctly. I think this is what is happening:
>
> Assume the RX/TX rings are 8 buffers long, instead of the 256 used in
> the driver. This will make the example easier.
>
> TX used: 7, almost full
> RX used: 8, full
>
> Ok, this is a bad state. We haven't been processing RX messages fast
> enough, for whatever reason.
>
> Now we TX a packet:
>
> TX used: 8, full, queue stopped
> RX used: 8, full
>
> We still haven't gotten around to processing the RX queue yet, but the
> firmware goes ahead and transmits the message. It works. Now it tries to
> do the hardware ECHO back, but the RX queue is full. So it drops the
> packet (bad). Then it sends us a control message through the control
> message queue, and we generate an "rx overflow" error frame.
>
> Does that make sense?
>
>
> I also tried tripling the length of the RX queue, without changing the
> length of the TX queue. No change in behavior. Given the above
> situation, that seems correct. :(
>
Here is another thing I just found. If I remove the handling of
DESC_IVALID in the ican3_handle_interrupt() function, we don't drop
packets, when only one of the CAN interfaces is active. (I put it
through >2 million packets per test, for >3 tests).
I wonder if the DESC_VALID and DESC_IVALID bits don't work exactly how
the datasheet describes them. The datasheet does contain two different,
mutually incompatible descriptions of them.
Once I ran cansequence on both CAN interfaces (connected together) the
driver started losing packets. Soon afterwards, the system hung with
nothing printed on the serial console. How am I going to track that bug
down?
Of course, once I'm not checking the DESC_IVALID bit anymore, my
interrupt handler is essentially nothing more than:
tasklet_schedule(&mod->task);
clear_interrupt();
return IRQ_HANDLED;
This is where NAPI (or something like it) would help cut down on the
interrupt load. I'll try enabling NAPI without having the netdevice
open, and see what happens.
Thanks,
Ira
_______________________________________________
Socketcan-core mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/socketcan-core