Hi Scott, > I'd like to hear more once you get it figured out.
We had two more outages in between. I then also learned that there are not only the three OpenVZ nodes in the same class C network, but also 50 other (physical) Linux servers. So there is a certain level of noise and congestion in that subnet, which on the average has around 16Mbit of traffic. The pfSense in front of it used to be clustered, but the clustering had been disabled for troubleshooting. Still, the remaining pfSense flooded the network with VRRP requests, which made up the majority of the broadcast related traffic in that class C. Which might (or might not) play a role here. You'd only see those VRRP requests (or this level of them) in a datacenter with a similar setup. Further segmentation of that network seems prudent, but can't be done at the moment as it would be too disruptive. The client is considering it, though. Forcing the OpenVZ nodes to do hourly arpsends looked like it had helped to improve the situation, but there were still two recorded outages. And like before two of the three nodes would usually loose network connection at the same time - just a few minutes apart. Not always the same nodes, but always two of three. We had set up some monitoring to dump the arp table and routes every minute and to diff it during the next run. On the nodes that diff didn't indicate any changes during the time the failures happened. The arp table on the pfSense also remained the same. We did run a ping from inside a VPS to the outside world and from the outside to a VPS, doing a tcpdump on both br0 and venet0 on the node and inside the VPS another tcpdump on venet0. If I recall correctly the ICMP packets arrived at the node's br0, were routed to venet0, where the VPS could see them. But on the way back the ICMP response got lost between venet0 and br0. We were kinda running out of options there (and patience as far as the client's clients were concerned). So we ditched the bridges entirely and re-configured all nodes to use eth0 directly. As we don't (at least at the moment) use KVM we can make do without the bridges. Open-V-Switch sounds like a *very* interesting alternative, but we couldn't yet find the time to experiment with it. So we chose the "devil we know", which is eth0. :p It'll take a few days to see if that solves our issues, but so far it's looking good. /knocking on wood Many thanks to all who offered suggestions. Much appreciated! -- With best regards Michael Stauber _______________________________________________ Users mailing list Users@openvz.org https://lists.openvz.org/mailman/listinfo/users