I think I mostly have the problem defined. As for a solution ...

The nbd protocol is prone to failure if it receives an nbd acknowledge
packet at the wrong moment - it completely falls over. This is for
swap, I don't know about "general operation".

The nbd protocol has a command packet followed by one or more data
packets, after which the machine receiving the data transmits an
acknowledge packet.

If, however, the nbd sending machine receives an acknowledge packet
before it has finished transmitting a stream of data packets, it goes
into never never land. Immediately after the command packet and
before the set of data packets is fine, as is after the set of data
packets.

I imagine something subtle is being locked along the way, and the
flag generated by a "there a packet on your port" really wrenches things.

So, reducing the maximum number of data packets to one would solve the
problem. The system transmits 4096 bytes per nbd request, corresponding
to three data packets. Transmitting one data packet means it would
never receive one "in the middle" (or, at least, I hope so).

With a pair of high speed connections which start properly, things fall
into a rhythm, where an nbd acknowledge packet is received after each
command packet, and everything is fine.

However, if the window fills up, then nbd acknowledge packets piggback
on window-reducing TCP packets, and these packets end up being sent
at the wrong time, and ... crunch.

However, the window does not normally fill up if the server process is
running with reasonable priority - I had that particular trail wrong :(.

After pausing, however, there is a chance that as the line "starts up",
acknowledge packets will be received at the wrong time. This failure
is sporadic, but happens eventually.

The next problem is that if the netstation does not receive the nbd 
acknowledge packet within reasonable time, it will kill the process I'm
trying to start.

I eliminated the nbd acknowledge packet all together, and of course this
problem resulted. But now I think I know what is causing the thing to
kill processes.

How to fix the problem ?

- Change the nbd version. Well, if it really will fix the problem. It
  means changing the kernel and recompiling to work on the netstation.
- Eliminate nbd acknowledge packets from the picture. I'm vague on how
  the driver knows there is a reply packet, because nbd_reply does not
  appear in any obvious spot.
- Lower the number of ethernet packets in a data transmission from
  3 to 1. Still don't know how to do this.
- Increase the timeout of the swap function, and send a set of 
  acknowledge packets only when the line is clear. I cannot see where 
  the timeout is set in the swap function, or the networking stuff.
  The networking stuff may deal in generally defined "packets", so
  searching for "nbd" did not show up anything.
- Change the function transmitting acknowledge so that it does not
  transmit except for the "right" times. I suspect this would involve
  driver modifications rather than something which could run in
  user space. Essentially, you'd transmit if the line had been silent
  for a while, or a command packet had just been received.

Any suggestions on how to pursue these options ?

Thanks,

-- 
John August

Some of us are paying for sins we have committed.
Others are paying for sins we still have to commit.


--
SLUG - Sydney Linux User Group Mailing List - http://slug.org.au/
More Info: http://slug.org.au/lists/listinfo/slug

Reply via email to