On Fri, Jun 3, 2011 at 5:43 PM, Steven McCoy <[email protected]> wrote:
> On 4 June 2011 03:49, Ladan Gharai <[email protected]> wrote: > >> >> >> On Wed, Jun 1, 2011 at 4:41 PM, Steven McCoy <[email protected]>wrote: >> >>> On 2 June 2011 04:17, Ladan Gharai <[email protected]> wrote: >>> >>>> I’ve turned on the openpgm trace/debug messages – afaict once the >>>> epgm receiver sustains “a lot” of packet loss its just not able to >>>> start-over again >>>> >>> >>> Every time the receiver sees packet loss it closes the socket and >>> schedules a new socket to be created to reconnect to the PGM stream. >>> >> >> I am not sure I understand this - do you mean the zmq socket gets a new >> zmq socket if the ePGM receiver experiences unrecoverable loss? (I dont see >> any new socket opening I just see the zmq recv not receiving anymore) >> > > ZMQ creates a new PGM socket. PGM is a socket based API beneath ZMQ. > I see. But the new PGM socket does not seem to reconnect to the receiver? Also, could you point out where in the zmq code does this happen?(I'd like to print out an error message or do something once this happens) >>>> >>>> My questions are: >>>> >>>> 1. Is there a way to reset the receiver once this happens? >>>> >>>> Reconnects occur with the same engine as TCP reconnects. >>> >>>> >>>> 1. >>>> 2. Has anyone experimented with changing the size of the rxw (it >>>> currently uses 33333) – and the various timers NAK_RB_IVL, NAK_RPT_IVL >>>> and >>>> NAK_RDATA_IVL (something akin to TCP tuning?) >>>> >>>> >>> If you find PGM is non-productive you should investigate tightening the >>> recovery settings so failure is raised sooner rather than later. The >>> default settings are friendly towards 10mb networks and so running at high >>> speed on 1gb networks may pose a problem with high data loss. >>> >>> For example, drop the retry count for DATA & NCF from the default 50 to >>> 2. >>> >>> ~line 211 in pgm_socket.cpp: >>> nak_data_retries = 2, >>> >> >> >>> nak_ncf_retries = 2; >>> >> >> Yes - this seems the most sensible approach, expect now it crashes - >> Segmentation fault - once it falls into a long series of packet losses. >> > > Can you provide a trace? A coredump should make it more expedient to > diagnose the bug. > well, I tried to strip the code to send you a simple piece of code - and in the process realized I had somehow contaminated the openpgm code. With a fresh OpenPgm my application is no longer crashing with the reduced values of retries :) But it seems even more of our loss problems were related to having set ZMQ_RATE to a rather high number (initially 950Mbps and then 500Mbps) - I have now reduced it to 100Mbps. I am now seeing the following behaviors: ps: thank you for the link to https://zeromq.jira.com/browse/LIBZMQ-205 > -- > Steve-o > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
