Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Updates: Status: Fixed Comment #12 on issue 106 by dorma...@rydia.net: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 think this was merged up. closing.
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #10 on issue 106 by dorma...@rydia.net: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 The incoming packet is consumed in try_read_udp(), so if you get into a conn_close state after reading one packet, calling an empty recvfrom will consume the *next*, unseen packet. Unless there is some case you're talking about which causes the next packet on the wire to be an error? I did push my tree though, so take a look at what I ended up with: https://github.com/dormando/memcached/commit/954f6dddc41c80d2e625bce63744df5556a425bb ... I can't break it with your tests anymore, and it doesn't hang before responding. If you look at how TCP connections are intialized, you'll see they're set to conn_new_cmd, but the UDP listeners are initialized to conn_read. The latter makes sense in the flow. Either way, can you try the patch and see if you can still break it? :)
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #11 on issue 106 by pi3or...@gmail.com: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 I think you are right. In my first patch (Comment 1), the recvfrom() call is critical because in that situation, if a UDP socket is in conn_closing state, it will never be recovered, all incoming packets should be dropped. If we recover its state to conn_read, there's no need for the recvfrom() again.
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Updates: Status: Started Owner: trond.no...@gmail.com Comment #8 on issue 106 by dorma...@rydia.net: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 I just pulled a patch similar to this into my 1.4.7 tree. This was confusing me a bit... the extra recvfrom seemed pointless and it'd hang while running. The test from #158 would spin out of control without the recvfrom. But the initial state of a UDP connection is actually conn_read, not conn_new_cmd. Using the correct state makes it find the next UDP packet and throw it away properly. Still iffy on the rbytes bit, but so long as multiple command parsing works (which I don't think it does now anyway), that should be correct. Punting to trond in case he wants to port this properly to 1.6.0, or take a look at least. I'll be pushing my for_147 branch to github at some point.
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #6 on issue 106 by dorma...@rydia.net: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 Can you issue a patch against 1.6.0-beta1? Or were you talking about 1.4.5, not 1.3.5?
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #7 on issue 106 by pi3or...@gmail.com: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 Ok. 1.6.0-beta1 still suffer from this problem, the principle is similar. Following patch solve it (I thought). --- ./daemon/memcached.ori.c2011-06-28 09:43:45.12072 +0800 +++ ./daemon/memcached.c2011-06-28 09:57:18.39071 +0800 @@ -689,13 +689,19 @@ c-sasl_conn = NULL; } -c-engine_storage = NULL; -c-tap_iterator = NULL; -c-thread = NULL; -assert(c-next == NULL); -c-ascii_cmd = NULL; -c-sfd = INVALID_SOCKET; -c-tap_nack_mode = false; +if (IS_UDP(c-transport)) { +recvfrom(c-sfd, NULL, 0, 0, NULL, NULL); +conn_set_state(c, conn_new_cmd); +} else { +c-sfd = INVALID_SOCKET; +c-engine_storage = NULL; +c-tap_iterator = NULL; +c-thread = NULL; +assert(c-next == NULL); +c-ascii_cmd = NULL; +c-tap_nack_mode = false; +} + } void conn_close(conn *c) { @@ -4826,7 +4832,7 @@ res -= 8; memmove(c-rbuf, c-rbuf + 8, res); -c-rbytes += res; +c-rbytes = res; c-rcurr = c-rbuf; return READ_DATA_RECEIVED; }
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #4 on issue 106 by airat.ha...@gmail.com: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 may be this specific issue is solved by the provided patch (haven't tried), but in general the problem remains. Even after applying the path suggested the following packet sent through udp hangs memcached udp: \x00\x00\x00\x00\x00\x01\x00\x00\x80\x05\x00\x01\x08\x00\x00\x00\x00\x00\x00\x0a\xef\xbe\xad\xde\x00\x00\x00\x00\x00\x00\x00\x00 (in fact any subsequent packet causes 100% usage) Is is a storage_command (in terms of testapp.c), but the cmd (\x05, increment) is of type 'arithmetic_command' (in terms of testapp.c). About this packet - all fields are 'correct' except for the type of cmd. In the attachment - script that reproduces the livelock. Attachments: udp_livelock.py 931 bytes
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Updates: Owner: eric.d.lambert Comment #3 on issue 106 by eric.d.lambert: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 this looks similar to the problem with issue 158, so i'll take a look at this
Re: Issue 106 in memcached: binary protocol parsing can cause memcached server lockup
Comment #1 on issue 106 by pi3orama: binary protocol parsing can cause memcached server lockup http://code.google.com/p/memcached/issues/detail?id=106 The root cause of this issue is try_read_udp() never reset c-rbytes. The processing of conn_nread minus c-rbytes by c-rlbytes, so if extra bytes exists, they will be 'left' in c-rbytes. However, each time when recvfrom()s UDP socket, the whole read buffer is overwritten, so 'c-rbytes += res' sets c-rbytes incorrectly. After the binary 'GET' command fail, the state of connection is transferred to 'conn_new_cmd' in write_bin_error(), then 'conn_parse_cmd' in reset_cmd_handler(). When c-rbytes less than sizeof(c-binary_header) (24 bytes), the server believes that there's no enough bytes to hold binary header so the processing continues normally. However, in the processing of attached test program, c-rbytes increases by 1 each time a request is completed. Therefore, after 24 requests, server (wrongly) thinks there's an extra binary header, and it finally gets wrong magic in try_read_command(). The dead lock is caused by the error processing code. If the UDP connection is in conn_closing state, it never get closed but only calls conn_cleanup(). Therefore, when there are incoming packets in the udp socket, they will never be consumed, so epoll_wait() (in libevent) triggers input events over and over again for a closing connection. To set c-rbytes correctly: --- ./memcached.ori.c 2010-08-23 22:59:12.0 +0800 +++ ./memcached.c 2010-08-23 22:59:17.0 +0800 @@ -3223,7 +3223,7 @@ res -= 8; memmove(c-rbuf, c-rbuf + 8, res); -c-rbytes += res; +c-rbytes = res; c-rcurr = c-rbuf; return READ_DATA_RECEIVED; } To consume incoming packets when cleanup: --- ./memcached.ori.c 2010-08-23 22:59:12.0 +0800 +++ ./memcached.c 2010-08-23 23:27:38.0 +0800 @@ -471,6 +471,9 @@ sasl_dispose(c-sasl_conn); c-sasl_conn = NULL; } + +/* consume incoming packets */ +recvfrom(c-sfd, NULL, 0, 0, NULL, NULL); } /*