Hi, we have a setup with 2 unicast channels and we recently ran across an issue where we lost a bunch of metrics submitted with gmetric due to a problem with dns that made one of the two channels unreachable. I traced this back to libgmond.c and Ganglia_udp_send_channels_create(...) where the code exit(1) as soon as it fails to create a socket (lines 323:344). I'm not sure if this is intended or not, but it certainly damages redundant setups like ours where we'd definitely prefer to have only some of the channels getting data rather than all or nothing. I'd like to propose that the behavior is changed so that the error_msg() + exit() is replaced with a debug_msg() call and then outside of the loop and before the return we check if any channel has been created at all and fail there in case. I would have gone ahead and attach a patch, but I'm not familiar with the apr API and was unsure what was the best approach to deal with the send_channels array especially given that the code seems to preallocate space for num_udp_send_channels (line 291).
thanks for your input, Spike -- "Behind every great man there's a great backpack" - B. ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers