Multi-get implementation in binary protocol

2014-05-07 Thread Byung-chul Hong
Hello,

For now, I'm trying to evaluate the performance of memcached server by 
using several client workloads.
I have a question about multi-get implementation in binary protocol.
As I know, in ascii protocol, we can send multiple keys in a single request 
packet to implement multi-get.

But, in a binary protocol, it seems that we should send multiple request 
packets (one request packet per key) to implement multi-get.
Even though we send multiple getQ, then sends get for the last key, we only 
can save the number of response packets only for cache miss.
If I understand correctly, multi-get in binary protocol cannot reduce the 
number of request packets, and
it also cannot reduce the number of response packets if hit-ratio is very 
high (like 99% get hit). 

If the performance bottleneck is on the network side not on the CPU, I 
think reducing the number of packets is still very important, 
but I don't understand why the binary protocol doesn't care about this.
I missed something?

Thanks in advance,
Byungchul.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Multi-get implementation in binary protocol

2014-05-07 Thread Ryan McElroy
At least in my experience at Facebook, 1 request != 1 packet. That is, if
you send several/many requests to the same memcached box quickly, they will
tend to go out in the same packet or group of packets, so you still get the
benefits of fewer packets (and in fact, we take advantage of this because
it is very important at very high request rates -- eg, over 1M gets per
second). The same thing happens on reply -- the results tend to come back
in just one packet (or more, if the replies are larger than a packet). At
Facebook, our main way of talking to memcached (mcrouter) doesn't even
support multi-gets on the client side, and it *doesn't matter* because the
batching happens anyway.

I don't have any experience with the memcached-defined binary protocol, but
I think there's probably something similar going on here. You can verify by
using a tool like tcpdump or ngrep to see what goes into each packet when
you do a series of gets of the same box over the binary protocol. My bet is
that you'll see them going in the same packet (as long as there aren't any
delays in sending them out from your client application). That being said,
I'd love to see what you learn if you do this experiment.

Cheers,

~Ryan


On Wed, May 7, 2014 at 1:24 AM, Byung-chul Hong byungchul.h...@gmail.comwrote:

 Hello,

 For now, I'm trying to evaluate the performance of memcached server by
 using several client workloads.
 I have a question about multi-get implementation in binary protocol.
 As I know, in ascii protocol, we can send multiple keys in a single
 request packet to implement multi-get.

 But, in a binary protocol, it seems that we should send multiple request
 packets (one request packet per key) to implement multi-get.
 Even though we send multiple getQ, then sends get for the last key, we
 only can save the number of response packets only for cache miss.
 If I understand correctly, multi-get in binary protocol cannot reduce the
 number of request packets, and
 it also cannot reduce the number of response packets if hit-ratio is very
 high (like 99% get hit).

 If the performance bottleneck is on the network side not on the CPU, I
 think reducing the number of packets is still very important,
 but I don't understand why the binary protocol doesn't care about this.
 I missed something?

 Thanks in advance,
 Byungchul.

 --

 ---
 You received this message because you are subscribed to the Google Groups
 memcached group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to memcached+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-07 Thread dormando
Hey,

try this branch:
https://github.com/dormando/memcached/tree/double_close

so far as I can tell that emulates the behavior in .17...

to build:
./autogen.sh  ./configure  make

run it in screen like you were doing with the other tests, see if it
prints ERROR: Double Close [somefd]. If it prints that once then stops,
I guess that's what .17 was doing... if it print spams, then something
else may have changed.

I'm mostly convinced something about your OS or build is corrupt, but I
have no idea what it is. The only other thing I can think of is to
instrument .17 a bit more and have you try that (with the connection code
laid out the old way, but with a conn_closed flag to detect a double close
attempt), and see if the old .17 still did it.

On Tue, 6 May 2014, notificati...@commando.io wrote:

 Changing from 4 threads to 1 seems to have resolved the problem. No timeouts 
 since. Should I set to 2 threads and wait and see how things go?

 On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
   and how'd that work out?

   Still no other reports :/ a few thousand more downloads of .19...

   On Sun, 4 May 2014, notifi...@commando.io wrote:

I'm going to try switching threads from 4 to 1. This host web2 is on 
 the only one I am seeing it on, but it also is the only hosts
   that gets any
real traffic. Super frustrating.
   
On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote:
      I'm stumped. (also, your e-mails aren't updating the ticket...).
   
      It's impossible for a connection to get into the closed state 
 without
      having event_del() and close() called on the socket. A socket 
 slot isn't
      event_add()'ed again until after the state is reset to 
 'init_state'.
   
      There was no code path for event_del to actually fail so far as 
 I could
      see.
   
      I've e-mailed steven grimm for ideas but either that's not his 
 e-mail
      anymore or he's not going to respond.
   
      I really don't know. I guess the old code would've just called 
 conn_close
      again by accident... I don't see how the logic changed in any 
 significant
      way in .18. Though again, if it happened with any frequency 
 people's
      curr_conns stat would go negative.
   
      So... either that always happened and we never noticed, or your 
 particular
      OS is corrupt. There're probably 10,000+ installs of .18+ now 
 and only one
      complaint, so I'm a little hesitant to spend a ton of time on 
 this until
      we get more reports.
   
      You should downgrade to .17.
   
      On Sun, 4 May 2014, notifi...@commando.io wrote:
   
       Damn it, got network timeout. CPU 3 is using 100% cpu from 
 memcached.
       Here is the result of stat to verify using new version of 
 memcached and libevent:
      
       STAT version 1.4.19
       STAT libevent 2.0.18-stable
      
      
       On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
 notifi...@commando.io wrote:
             Just upgraded all 5 web-servers to memcached 1.4.19 
 with libevent 2.0.18. Will advise if I see memcached timeouts.
   Should be
      good
             though.
      
       Thanks so much for all the help and patience. Really 
 appreciated.
      
       On Friday, May 2, 2014 10:20:26 PM UTC-7, 
 memc...@googlecode.com wrote:
             Updates:
             Status: Invalid
      
             Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server  
             127.0.0.1 (tcp 11211, udp 0) failed with: Network 
 timeout
             http://code.google.com/p/memcached/issues/detail?id=363
      
             Any repeat crashes? I'm going to close this. it looks 
 like remi  
             shipped .19. reopen or open a new one if it hangs in 
 the same way somehow...
      
             Well. 19 won't be printing anything, and it won't hang, 
 but if it's  
             actually our bug and not libevent it would end up 
 spinning CPU. Keep an eye  
             out I guess.
      
             --
             You received this message because this project is 
 configured to send all  
             issue notifications to this address.
             You may adjust your notification preferences at:
             https://code.google.com/hosting/settings
      
       --
      
       ---
       You received this message because you are subscribed to the 
 Google Groups memcached group.
       To unsubscribe from this group and stop receiving 

Re: Multi-get implementation in binary protocol

2014-05-07 Thread dormando
 Hello,

 For now, I'm trying to evaluate the performance of memcached server by using 
 several client workloads.
 I have a question about multi-get implementation in binary protocol.
 As I know, in ascii protocol, we can send multiple keys in a single request 
 packet to implement multi-get.

 But, in a binary protocol, it seems that we should send multiple request 
 packets (one request packet per key) to implement multi-get.
 Even though we send multiple getQ, then sends get for the last key, we only 
 can save the number of response packets only for cache miss.
 If I understand correctly, multi-get in binary protocol cannot reduce the 
 number of request packets, and
 it also cannot reduce the number of response packets if hit-ratio is very 
 high (like 99% get hit).

 If the performance bottleneck is on the network side not on the CPU, I think 
 reducing the number of packets is still very important,
 but I don't understand why the binary protocol doesn't care about this.
 I missed something?

you're right, it sucks. I was never happy with it, but haven't had time to
add adjustments to the protocol for this. To note, with .19 some
inefficiencies with the protocol were lifted, and most network cards are
fast enough for most situations, even if it's one packet per response (and
for large enough responses they split into multiple packets, anyway).

The reason why this was done is for latency and streaming of responses:

- In ascii multiget, I can send 10,000 keys, then I'm forced to wait for
the server to look up all of the keys before sending its responses, this
isn't typically very high but there's some latency to it.

- In binary multiget, the responses are sent back as it receives them from
the network more or less. This reduces the latency to when you start
seeing responses, regardless of how large your multiget is. this is useful
if you have a kind of client which can start processing responses in a
streaming fashion. This potentially reduces the total time to render your
response since you can keep the CPU busy unmarshalling responses instead
of sleeping.

However, it should have some tunables: One where it at least does one
write per complete packet (TCP_CORK'ed, or similar), and one where it
buffers up to some size. In my tests I can get ascii multiget up to 16.2
million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and
is spending all of its time calling sendmsg(). Most people need far, far
less than that, so the binprot as is should be okay though.

The code isn't too friendly to this and there're other higher priority
things I'd like to get done sooner. The relatively few number of people
who do 500,000+ requests per second in binprot (they're almost always
ascii at that scale) is the other reason.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
memcached group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-07 Thread notifications
Bumped up to 2 threads and so far no timeout errors. I'm going to let it 
run for a few more days, then revert back to 4 threads and see if timeout 
errors come up again. That will tell us the problem lies in spawning more 
than 2 threads.

On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:

 Hey, 

 try this branch: 
 https://github.com/dormando/memcached/tree/double_close 

 so far as I can tell that emulates the behavior in .17... 

 to build: 
 ./autogen.sh  ./configure  make 

 run it in screen like you were doing with the other tests, see if it 
 prints ERROR: Double Close [somefd]. If it prints that once then stops, 
 I guess that's what .17 was doing... if it print spams, then something 
 else may have changed. 

 I'm mostly convinced something about your OS or build is corrupt, but I 
 have no idea what it is. The only other thing I can think of is to 
 instrument .17 a bit more and have you try that (with the connection code 
 laid out the old way, but with a conn_closed flag to detect a double close 
 attempt), and see if the old .17 still did it. 

 On Tue, 6 May 2014, notifi...@commando.io javascript: wrote: 

  Changing from 4 threads to 1 seems to have resolved the problem. No 
 timeouts since. Should I set to 2 threads and wait and see how things go? 
  
  On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: 
and how'd that work out? 
  
Still no other reports :/ a few thousand more downloads of .19... 
  
On Sun, 4 May 2014, notifi...@commando.io wrote: 
  
 I'm going to try switching threads from 4 to 1. This host web2 
 is on the only one I am seeing it on, but it also is the only hosts 
that gets any 
 real traffic. Super frustrating. 
 
 On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: 
   I'm stumped. (also, your e-mails aren't updating the 
 ticket...). 
 
   It's impossible for a connection to get into the closed 
 state without 
   having event_del() and close() called on the socket. A 
 socket slot isn't 
   event_add()'ed again until after the state is reset to 
 'init_state'. 
 
   There was no code path for event_del to actually fail so 
 far as I could 
   see. 
 
   I've e-mailed steven grimm for ideas but either that's not 
 his e-mail 
   anymore or he's not going to respond. 
 
   I really don't know. I guess the old code would've just 
 called conn_close 
   again by accident... I don't see how the logic changed in 
 any significant 
   way in .18. Though again, if it happened with any 
 frequency people's 
   curr_conns stat would go negative. 
 
   So... either that always happened and we never noticed, or 
 your particular 
   OS is corrupt. There're probably 10,000+ installs of .18+ 
 now and only one 
   complaint, so I'm a little hesitant to spend a ton of time 
 on this until 
   we get more reports. 
 
   You should downgrade to .17. 
 
   On Sun, 4 May 2014, notifi...@commando.io wrote: 
 
Damn it, got network timeout. CPU 3 is using 100% cpu 
 from memcached. 
Here is the result of stat to verify using new version 
 of memcached and libevent: 

STAT version 1.4.19 
STAT libevent 2.0.18-stable 


On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
 notifi...@commando.io wrote: 
  Just upgraded all 5 web-servers to memcached 
 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. 
Should be 
   good 
  though. 

Thanks so much for all the help and patience. Really 
 appreciated. 

On Friday, May 2, 2014 10:20:26 PM UTC-7, 
 memc...@googlecode.com wrote: 
  Updates: 
  Status: Invalid 

  Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server   
  127.0.0.1 (tcp 11211, udp 0) failed with: Network 
 timeout 
  
 http://code.google.com/p/memcached/issues/detail?id=363 

  Any repeat crashes? I'm going to close this. it 
 looks like remi   
  shipped .19. reopen or open a new one if it hangs 
 in the same way somehow... 

  Well. 19 won't be printing anything, and it won't 
 hang, but if it's   
  actually our bug and not libevent it would end up 
 spinning CPU. Keep an eye   
  out I guess. 

  -- 
  You received this 

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-07 Thread dormando
That doesn't really tell us anything about the nature of the problem
though. With 2 threads it might still happen, but is a lot less likely.

On Wed, 7 May 2014, notificati...@commando.io wrote:

 Bumped up to 2 threads and so far no timeout errors. I'm going to let it run 
 for a few more days, then revert back to 4 threads and see if timeout
 errors come up again. That will tell us the problem lies in spawning more 
 than 2 threads.

 On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
   Hey,

   try this branch:
   https://github.com/dormando/memcached/tree/double_close

   so far as I can tell that emulates the behavior in .17...

   to build:
   ./autogen.sh  ./configure  make

   run it in screen like you were doing with the other tests, see if it
   prints ERROR: Double Close [somefd]. If it prints that once then 
 stops,
   I guess that's what .17 was doing... if it print spams, then something
   else may have changed.

   I'm mostly convinced something about your OS or build is corrupt, but I
   have no idea what it is. The only other thing I can think of is to
   instrument .17 a bit more and have you try that (with the connection 
 code
   laid out the old way, but with a conn_closed flag to detect a double 
 close
   attempt), and see if the old .17 still did it.

   On Tue, 6 May 2014, notifi...@commando.io wrote:

Changing from 4 threads to 1 seems to have resolved the problem. No 
 timeouts since. Should I set to 2 threads and wait and see how
   things go?
   
On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
      and how'd that work out?
   
      Still no other reports :/ a few thousand more downloads of 
 .19...
   
      On Sun, 4 May 2014, notifi...@commando.io wrote:
   
       I'm going to try switching threads from 4 to 1. This host 
 web2 is on the only one I am seeing it on, but it also is the only
   hosts
      that gets any
       real traffic. Super frustrating.
      
       On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote:
             I'm stumped. (also, your e-mails aren't updating the 
 ticket...).
      
             It's impossible for a connection to get into the closed 
 state without
             having event_del() and close() called on the socket. A 
 socket slot isn't
             event_add()'ed again until after the state is reset to 
 'init_state'.
      
             There was no code path for event_del to actually fail 
 so far as I could
             see.
      
             I've e-mailed steven grimm for ideas but either that's 
 not his e-mail
             anymore or he's not going to respond.
      
             I really don't know. I guess the old code would've just 
 called conn_close
             again by accident... I don't see how the logic changed 
 in any significant
             way in .18. Though again, if it happened with any 
 frequency people's
             curr_conns stat would go negative.
      
             So... either that always happened and we never noticed, 
 or your particular
             OS is corrupt. There're probably 10,000+ installs of 
 .18+ now and only one
             complaint, so I'm a little hesitant to spend a ton of 
 time on this until
             we get more reports.
      
             You should downgrade to .17.
      
             On Sun, 4 May 2014, notifi...@commando.io wrote:
      
              Damn it, got network timeout. CPU 3 is using 100% cpu 
 from memcached.
              Here is the result of stat to verify using new 
 version of memcached and libevent:
             
              STAT version 1.4.19
              STAT libevent 2.0.18-stable
             
             
              On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
 notifi...@commando.io wrote:
                    Just upgraded all 5 web-servers to memcached 
 1.4.19 with libevent 2.0.18. Will advise if I see memcached
   timeouts.
      Should be
             good
                    though.
             
              Thanks so much for all the help and patience. Really 
 appreciated.
             
              On Friday, May 2, 2014 10:20:26 PM UTC-7, 
 memc...@googlecode.com wrote:
                    Updates:
                    Status: Invalid
             
                    Comment #20 on issue 363 by dorma...@rydia.net: 
 MemcachePool::get(): Server  
                    127.0.0.1 (tcp 11211, udp 0) failed with: 
 Network timeout