Multi-get implementation in binary protocol
Hello, For now, I'm trying to evaluate the performance of memcached server by using several client workloads. I have a question about multi-get implementation in binary protocol. As I know, in ascii protocol, we can send multiple keys in a single request packet to implement multi-get. But, in a binary protocol, it seems that we should send multiple request packets (one request packet per key) to implement multi-get. Even though we send multiple getQ, then sends get for the last key, we only can save the number of response packets only for cache miss. If I understand correctly, multi-get in binary protocol cannot reduce the number of request packets, and it also cannot reduce the number of response packets if hit-ratio is very high (like 99% get hit). If the performance bottleneck is on the network side not on the CPU, I think reducing the number of packets is still very important, but I don't understand why the binary protocol doesn't care about this. I missed something? Thanks in advance, Byungchul. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Multi-get implementation in binary protocol
At least in my experience at Facebook, 1 request != 1 packet. That is, if you send several/many requests to the same memcached box quickly, they will tend to go out in the same packet or group of packets, so you still get the benefits of fewer packets (and in fact, we take advantage of this because it is very important at very high request rates -- eg, over 1M gets per second). The same thing happens on reply -- the results tend to come back in just one packet (or more, if the replies are larger than a packet). At Facebook, our main way of talking to memcached (mcrouter) doesn't even support multi-gets on the client side, and it *doesn't matter* because the batching happens anyway. I don't have any experience with the memcached-defined binary protocol, but I think there's probably something similar going on here. You can verify by using a tool like tcpdump or ngrep to see what goes into each packet when you do a series of gets of the same box over the binary protocol. My bet is that you'll see them going in the same packet (as long as there aren't any delays in sending them out from your client application). That being said, I'd love to see what you learn if you do this experiment. Cheers, ~Ryan On Wed, May 7, 2014 at 1:24 AM, Byung-chul Hong byungchul.h...@gmail.comwrote: Hello, For now, I'm trying to evaluate the performance of memcached server by using several client workloads. I have a question about multi-get implementation in binary protocol. As I know, in ascii protocol, we can send multiple keys in a single request packet to implement multi-get. But, in a binary protocol, it seems that we should send multiple request packets (one request packet per key) to implement multi-get. Even though we send multiple getQ, then sends get for the last key, we only can save the number of response packets only for cache miss. If I understand correctly, multi-get in binary protocol cannot reduce the number of request packets, and it also cannot reduce the number of response packets if hit-ratio is very high (like 99% get hit). If the performance bottleneck is on the network side not on the CPU, I think reducing the number of packets is still very important, but I don't understand why the binary protocol doesn't care about this. I missed something? Thanks in advance, Byungchul. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notificati...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout http://code.google.com/p/memcached/issues/detail?id=363 Any repeat crashes? I'm going to close this. it looks like remi shipped .19. reopen or open a new one if it hangs in the same way somehow... Well. 19 won't be printing anything, and it won't hang, but if it's actually our bug and not libevent it would end up spinning CPU. Keep an eye out I guess. -- You received this message because this project is configured to send all issue notifications to this address. You may adjust your notification preferences at: https://code.google.com/hosting/settings -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving
Re: Multi-get implementation in binary protocol
Hello, For now, I'm trying to evaluate the performance of memcached server by using several client workloads. I have a question about multi-get implementation in binary protocol. As I know, in ascii protocol, we can send multiple keys in a single request packet to implement multi-get. But, in a binary protocol, it seems that we should send multiple request packets (one request packet per key) to implement multi-get. Even though we send multiple getQ, then sends get for the last key, we only can save the number of response packets only for cache miss. If I understand correctly, multi-get in binary protocol cannot reduce the number of request packets, and it also cannot reduce the number of response packets if hit-ratio is very high (like 99% get hit). If the performance bottleneck is on the network side not on the CPU, I think reducing the number of packets is still very important, but I don't understand why the binary protocol doesn't care about this. I missed something? you're right, it sucks. I was never happy with it, but haven't had time to add adjustments to the protocol for this. To note, with .19 some inefficiencies with the protocol were lifted, and most network cards are fast enough for most situations, even if it's one packet per response (and for large enough responses they split into multiple packets, anyway). The reason why this was done is for latency and streaming of responses: - In ascii multiget, I can send 10,000 keys, then I'm forced to wait for the server to look up all of the keys before sending its responses, this isn't typically very high but there's some latency to it. - In binary multiget, the responses are sent back as it receives them from the network more or less. This reduces the latency to when you start seeing responses, regardless of how large your multiget is. this is useful if you have a kind of client which can start processing responses in a streaming fashion. This potentially reduces the total time to render your response since you can keep the CPU busy unmarshalling responses instead of sleeping. However, it should have some tunables: One where it at least does one write per complete packet (TCP_CORK'ed, or similar), and one where it buffers up to some size. In my tests I can get ascii multiget up to 16.2 million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and is spending all of its time calling sendmsg(). Most people need far, far less than that, so the binprot as is should be okay though. The code isn't too friendly to this and there're other higher priority things I'd like to get done sooner. The relatively few number of people who do 500,000+ requests per second in binprot (they're almost always ascii at that scale) is the other reason. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io javascript: wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout http://code.google.com/p/memcached/issues/detail?id=363 Any repeat crashes? I'm going to close this. it looks like remi shipped .19. reopen or open a new one if it hangs in the same way somehow... Well. 19 won't be printing anything, and it won't hang, but if it's actually our bug and not libevent it would end up spinning CPU. Keep an eye out I guess. -- You received this
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
That doesn't really tell us anything about the nature of the problem though. With 2 threads it might still happen, but is a lot less likely. On Wed, 7 May 2014, notificati...@commando.io wrote: Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout