In my application I have roughly 100 main workers distributed across 50 nodes - so there are 2 main workers per node.
I’m struggling with a pipeline stage that needs to complete quickly, in less than a second, but sometimes is taking 5 or 6 seconds. In this pipeline stage each of the 50 nodes have to send their to piece of data to all of the 100 main workers - so 50 nodes each sending 100 pieces of data with each piece about 1.5M bytes. The main workers need to collect all of the pieces of data before the pipeline can advance to the next stage. During this pipeline stage each node is sending out about 150M of data (1.5M for 100 workers) and receiving 150M of data (1.5M for 2 workers from 50 nodes ). I’m currently using REQ REP sockets for this. Each of the 50 nodes has 100 REQ sockets, one connected to each worker, and 2 REPLY sockets, for the 2 workers receiving data on the node. The request from each node to worker contains the 1.5M piece of data and the reply message from the worker back to each node is empty but indicates that the worker has completed collecting the piece of data from the originating node. The reply is used to “block” the pipeline until all the data collection is complete. I'm not sure if REQ REP is the best pattern for this case. There seems to be a lot of connections involved and it doesn't seem like this would scale very well as more workers/nodes are introduced. I'm looking to see if someone could recommend a better pattern/solution that would provided consistent behavior with low latency.
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
