I think I mostly have the problem defined. As for a solution ... The nbd protocol is prone to failure if it receives an nbd acknowledge packet at the wrong moment - it completely falls over. This is for swap, I don't know about "general operation". The nbd protocol has a command packet followed by one or more data packets, after which the machine receiving the data transmits an acknowledge packet. If, however, the nbd sending machine receives an acknowledge packet before it has finished transmitting a stream of data packets, it goes into never never land. Immediately after the command packet and before the set of data packets is fine, as is after the set of data packets. I imagine something subtle is being locked along the way, and the flag generated by a "there a packet on your port" really wrenches things. So, reducing the maximum number of data packets to one would solve the problem. The system transmits 4096 bytes per nbd request, corresponding to three data packets. Transmitting one data packet means it would never receive one "in the middle" (or, at least, I hope so). With a pair of high speed connections which start properly, things fall into a rhythm, where an nbd acknowledge packet is received after each command packet, and everything is fine. However, if the window fills up, then nbd acknowledge packets piggback on window-reducing TCP packets, and these packets end up being sent at the wrong time, and ... crunch. However, the window does not normally fill up if the server process is running with reasonable priority - I had that particular trail wrong :(. After pausing, however, there is a chance that as the line "starts up", acknowledge packets will be received at the wrong time. This failure is sporadic, but happens eventually. The next problem is that if the netstation does not receive the nbd acknowledge packet within reasonable time, it will kill the process I'm trying to start. I eliminated the nbd acknowledge packet all together, and of course this problem resulted. But now I think I know what is causing the thing to kill processes. How to fix the problem ? - Change the nbd version. Well, if it really will fix the problem. It means changing the kernel and recompiling to work on the netstation. - Eliminate nbd acknowledge packets from the picture. I'm vague on how the driver knows there is a reply packet, because nbd_reply does not appear in any obvious spot. - Lower the number of ethernet packets in a data transmission from 3 to 1. Still don't know how to do this. - Increase the timeout of the swap function, and send a set of acknowledge packets only when the line is clear. I cannot see where the timeout is set in the swap function, or the networking stuff. The networking stuff may deal in generally defined "packets", so searching for "nbd" did not show up anything. - Change the function transmitting acknowledge so that it does not transmit except for the "right" times. I suspect this would involve driver modifications rather than something which could run in user space. Essentially, you'd transmit if the line had been silent for a while, or a command packet had just been received. Any suggestions on how to pursue these options ? Thanks, -- John August Some of us are paying for sins we have committed. Others are paying for sins we still have to commit. -- SLUG - Sydney Linux User Group Mailing List - http://slug.org.au/ More Info: http://slug.org.au/lists/listinfo/slug
