2016-01-31 15:14 GMT+01:00 Pieter Hintjens <[email protected]>: > On Thu, Jan 28, 2016 at 6:30 PM, Mario Steinhoff > <[email protected]> wrote: > > > It would be great if someone could confirm them, or if statements do not > > hold true, clarify the inner zproto workings. > > Your description is 100% accurate. Well done. > > Feel free to add to the README.txt if you feel you can make it clearer > to newcomers. >
Sure :) > > With a single client, my implementation works just fine. But when there > are > > multiple clients connected, and a large file is sent to one of the > clients, > > other clients timeout. > > This is an interesting problem. I've not tried this test in the > original code so perhaps the best is that you study it and figure out > what's happening. Where is it blocking? > During the last few days, I mitigated the problem by moving the code that sends heartbeats to the client side and raising the expiry timeout to 30s. What I currently have is a system that distributes a set of 'records': - One record is sent in one message, while record are limited in size (<1kb). - One publisher distributes records to many (or should I say few, expected to be < 100) subscribers. - The records are grouped via path-like names and transferred in batches with a clear start and end, using credit-based flow control. - A client subscribes to the server for all record sets its interested in. - Server and clients calculate MD5 hashes for all record groups and transmit/check them during subscriptions, so only actual changes are sent. - For now, clients can subscribe to a server without any form of authentication. So I'd say its very similar to FileMQ although not exactly the same. Today I created a little test case to demonstrate the problem: 1. Set heartbeat interval to 1s, expiry timeout to 3s. 2. Launch a large enough number of 'empty' client processes (clients that have not received any data during a previous run, 8 clients seem to be sufficient). 3. Launch the server process. On the server side, I added logging code to calculate the time it takes to execute all actions within a poll loop iteration, e.g. execute client FSM, remove stale connections, monitor server, etc. It also gives me a warning when that time exceeds one second. On the client side, I added logging code to calculate the time it takes between heartbeat request and response and a warning if this exceeds 100ms. Server logfile: http://pastebin.com/raw/rcpZEUwX Client logfile: http://pastebin.com/raw/19KjirFz (similar to other client logs) And now we can observe the following behavior: - Client starts, connects to socket, sends hello, waits for response - Server starts, binds on socket, receives hello from all clients, sends hello ok to all clients - Clients receive hello ok, send subscriptions, send credit - Server receives subscribe from all clients, stores subscription requests, sends subscribe ok to all clients (subscription *requests*, because mounts can be added later on) - Clients receive subscribe ok (no further action required) - Server receives credit from all clients, but has nothing to send yet - Server receives a few heartbeats, sends heartbeat ok, everything cool until: - Server adds the first mount with record data - Server finds that there are pending subscription requests and adds them to the mount - Server finds that the MD5 sums from the subscription requests differ, finds it has credits for all clients and starts sending records for that mount to _all_ clients In this case, sending the first record set to all clients takes ~7 seconds, blocking the poll loop. During this time, the server will queue up heartbeat messages from clients but can not send heartbeat ok back. When the expiry timeout on the client is too low, a client will think the server is gone, expire the connection, reset its internal state back to connecting and send a hello. After the blocking action is done, the server still thinks it can send data and happily sends a payload, while the client expects a hello ok and throws a protocol error. So I'd say this is not a problem with the zproto engines per se, all single-threaded event driven systems will show such behavior when the event thread is blocked. Possible solutions I can think of: 1. Raise the expiry timeout on client side. More like a workaround, because blocking still occurs but the clients won't care anymore. Will cause timeouts if server actions block longer than the timeout. 2. Change the logic how I notify clients about changed data. Currently my server actor receives a change event when a record set has changed and then sends records to all clients in one go. A better solution would be to have some sort of internal messaging where the engine would consume the change event, update its mount, and then send an internal message for each client to the poll loop. That way, blocking would still occur but is limited to one client and the time required to send one record set. Will cause timeouts if sending a single record set takes longer than the timeout, 3. Make server multi-threaded Change the server thread to be a proxy that handles client connections and heartbeats and offloads the data sending logic to producer threads that can block as long as they want. Will avoid timeouts completely, because the server engine then only cares about protocol validation, heartbeating and forwarding messages from producer threads to clients. Then we get something like this on the server side: http://zguide.zeromq.org/page:all#The-Asynchronous-Client-Server-Pattern Cheers Mario
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
