Re: A barrage of unexplained timeouts
Eric Wong normalper...@yhbt.net said: Jimmy Soho jimmy.s...@gmail.com wrote: fwiw, we experience this too irregularly, about once a week or 2 weeks for a few seconds, and sometimes up to a minute or 2 minutes long. for no apparent reasons indeed. and at random times. Wait, based on what you say below, is this happening to unicorn or nginx? my suspicion so far has always been that this is due to bad (wifi) networks of clients. what happens when a request is handled successfully by rails, but for whatever reason the last bit of the response can't be delivered because the receiving end is not consuming in a timely fashion? So nginx is complaining about disconnected clients, not unicorn, correct? If i'm not mistaken i think that rails logs a 200 success message before the response is actually send back? Correct, rails/unicorn won't know if the client disconnected until nginx breaks the connection to unicorn. Is a response send back by rails / unicorn fully buffered by nginx? Or is a unicorn worker blocked until the response is confirmed to have been successfully received by the client? nginx fully buffers both requests and responses by default. Which leads us to... Aside: how do you protect your unicorn workers against a barrage of client requests that eat the response veery slwly? That's the whole point of using nginx with unicorn. nginx shields unicorn from slow clients. Is it possible that the behavior could be related to nginx? Nginx hanging onto the connection too long from the worker, so that the worker never makes it back into the pool and eventually get's killed for not checking in? ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: Jimmy Soho jimmy.s...@gmail.com wrote: Is a response send back by rails / unicorn fully buffered by nginx? Or is a unicorn worker blocked until the response is confirmed to have been successfully received by the client? nginx fully buffers both requests and responses by default. Which leads us to... Aside: how do you protect your unicorn workers against a barrage of client requests that eat the response veery slwly? That's the whole point of using nginx with unicorn. nginx shields unicorn from slow clients. Is it possible that the behavior could be related to nginx? Nginx hanging onto the connection too long from the worker, so that the worker never makes it back into the pool and eventually get's killed for not checking in? Unlikely. That'd be a serious bug in nginx since nginx was originally designed to do this buffering since the earliest versions. Which version of nginx are you running? Can you be sure proxy_buffering off is _not_ set anywhere? ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
Jimmy Soho jimmy.s...@gmail.com wrote: fwiw, we experience this too irregularly, about once a week or 2 weeks for a few seconds, and sometimes up to a minute or 2 minutes long. for no apparent reasons indeed. and at random times. Wait, based on what you say below, is this happening to unicorn or nginx? my suspicion so far has always been that this is due to bad (wifi) networks of clients. what happens when a request is handled successfully by rails, but for whatever reason the last bit of the response can't be delivered because the receiving end is not consuming in a timely fashion? So nginx is complaining about disconnected clients, not unicorn, correct? If i'm not mistaken i think that rails logs a 200 success message before the response is actually send back? Correct, rails/unicorn won't know if the client disconnected until nginx breaks the connection to unicorn. Is a response send back by rails / unicorn fully buffered by nginx? Or is a unicorn worker blocked until the response is confirmed to have been successfully received by the client? nginx fully buffers both requests and responses by default. Which leads us to... Aside: how do you protect your unicorn workers against a barrage of client requests that eat the response veery slwly? That's the whole point of using nginx with unicorn. nginx shields unicorn from slow clients. ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
A barrage of unexplained timeouts
We've been running unicorn-3.6.2 on REE 1.8.7 2011.12 in production for quite some time and we use monit to monitor each unicorn worker. Occasionally, I'll get a notification that a worker has timed-out and has been re-spawned. In all these cases, when I look at the rails logs, I can see the last request that the worker handled, and they all have appeared to complete successfully from the client's perspective (rails and nginx respond with 200), but the unicorn log shows that it was killed due to timeout. This has always been relatively rare and I thought it was a non-problem. Until today. Today, for about a 7 minute period, our workers would continually report as having timed-out and would be killed by the master. After re-spawning, the workers would serve a handful of requests and then eventually be killed again. During this time, our servers (Web, PG DB, and redis) were not under load and IO was normal. After the last monit notification at 8:30, everything went back to normal. I understand why unicorns would timeout if they were waiting (120 secs) on IO, but there aren't any orphaned requests in the rails log. For each request line, there's a corresponding completion line. No long running queries to blame on PG, either. I know we're probably due for an upgrade, but I'm hoping to get to the bottom of these unexplained timeouts. Thanks for your help! -Nick ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
Eric Wong normalper...@yhbt.net said: Can you take a look at the nginx error and access logs? From what you're saying, there's a chance a request never even got to the Rails layer. However, nginx should be logging failed/long-running requests to unicorn. The nginx access logs show frequent 499 responses. The error logs are filled with: connect() failed (110: Connection timed out) while connecting to upstream upstream timed out (110: Connection timed out) while reading response header from upstream What specific pieces of information should I be looking for in the logs? Thanks Again, -Nick ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: Can you take a look at the nginx error and access logs? From what you're saying, there's a chance a request never even got to the Rails layer. However, nginx should be logging failed/long-running requests to unicorn. The nginx access logs show frequent 499 responses. The error logs are filled with: connect() failed (110: Connection timed out) while connecting to upstream upstream timed out (110: Connection timed out) while reading response header from upstream What specific pieces of information should I be looking for in the logs? Do you have any other requests in your logs which could be taking a long time and hogging workers, but not high enough to trigger the unicorn kill timeout. (enable $request_time in nginx access logs if you haven't already) Is this with Unix or TCP sockets? If it's over a LAN, maybe there's still a bad switch/port/cable somewhere (that happens often to me). With Unix sockets, I don't recall encountering recent problems under Linux. Which OS are you running? I hit several TCP bugs in the early 3.7 Linux kernels, too... ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: Do you have any other requests in your logs which could be taking a long time and hogging workers, but not high enough to trigger the unicorn kill timeout. I don't *think* so. Most requests finish 300ms. We do have some more intensive code-paths, but they're administrative and called much less frequently. Most of these pages complete in 3seconds. For requests that made it to rails logging, the LAST processed request before the worker timed-out all completed very quickly (and no real pattern in terms of which page may be triggering it.) This is really strange. This was only really bad for a 7s period? Has it happened again? Anything else going on with the system at that time? Swapping, particularly... And if you're inside a VM, maybe your neighbors were hogging things. Large PUT/POST requests which require filesystem I/O are particularly sensitive to this. Is this with Unix or TCP sockets? If it's over a LAN, maybe there's still a bad switch/port/cable somewhere (that happens often to me). TCP sockets, with nginx and unicorn running on the same box. OK, that probably rules out a bunch of problems. Just to be thorough, anything interesting in dmesg or syslogs? With Unix sockets, I don't recall encountering recent problems under Linux. Which OS are you running? Stock RHEL 5, kernel 2.6.18. RHEL 5.0 or 5.x? I can't remember /that/ far back to 5.0 (I don't think I even tried it until 5.2), but don't recall anything being obviously broken in those... ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: This is really strange. This was only really bad for a 7s period? It was a 7 minute period. All of the workers would become busy and exceed their 120s timeout. Master would kill and re-spawn them, they'd start to respond to a handful of requests (anywhere from 5-50) after which they'd become busy again, and get force killed by master. This pattern happened 3 times over a 7 minute period. Has it happened again? No Anything else going on with the system at that time? Swapping, particularly... No swap activity or high load. Our munin graphs indicate a peak of web/app server disk latency around that time, although our graphs show many other similar peaks, without incident. I'm stumped : Do you have any background threads running that could be hanging the workers? This is Ruby 1.8, after all, so there's more likely to be some blocking call hanging the entire process. AFAIK, some monitoring software runs a background thread in the unicorn worker and maybe the OpenSSL extension doesn't work as well if it encountered network problems under Ruby 1.8 Otherwise, I don't know... ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
Eric Wong normalper...@yhbt.net said: n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: This is really strange. This was only really bad for a 7s period? It was a 7 minute period. All of the workers would become busy and exceed their 120s timeout. Master would kill and re-spawn them, they'd start to respond to a handful of requests (anywhere from 5-50) after which they'd become busy again, and get force killed by master. This pattern happened 3 times over a 7 minute period. Has it happened again? No Anything else going on with the system at that time? Swapping, particularly... No swap activity or high load. Our munin graphs indicate a peak of web/app server disk latency around that time, although our graphs show many other similar peaks, without incident. I'm stumped : I was afraid you'd say that :(. Do you have any background threads running that could be hanging the workers? This is Ruby 1.8, after all, so there's more likely to be some blocking call hanging the entire process. AFAIK, some monitoring software runs a background thread in the unicorn worker and maybe the OpenSSL extension doesn't work as well if it encountered network problems under Ruby 1.8 We don't explicitly create any threads in our rails code. We do communicate with backgroundrb worker processes, although, none of the strangeness today involved any routes that would hit backgroundrb workers. Is there any instrumentation that I could add that might help debugging in the future? ($request_time and $upstream_response_time are now in my nginx logs.) We have noticed these unexplainable timeouts before, but typically for a single worker. If there's some debugging that could be added I might be able to track it down during these one-off events. Thoughts? I absolutely appreciate all your help! -Nick ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Re: A barrage of unexplained timeouts
n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: n...@auger.net wrote: Eric Wong normalper...@yhbt.net said: I'm stumped : I was afraid you'd say that :(. Actually, another potential issue is DNS lookups timing out. But they shouldn't take *that* long... Do you have any background threads running that could be hanging the workers? This is Ruby 1.8, after all, so there's more likely to be some blocking call hanging the entire process. AFAIK, some monitoring software runs a background thread in the unicorn worker and maybe the OpenSSL extension doesn't work as well if it encountered network problems under Ruby 1.8 We don't explicitly create any threads in our rails code. We do communicate with backgroundrb worker processes, although, none of the strangeness today involved any routes that would hit backgroundrb workers. I proactively audit every piece of code (including external libraries/gems) loaded by an app for potentially blocking calls (hits to the filesystem, socket calls w/o timeout/blocking). I use strace to help me find that sometimes... Is there any instrumentation that I could add that might help debugging in the future? ($request_time and $upstream_response_time are now in my nginx logs.) We have noticed these unexplainable timeouts before, but typically for a single worker. If there's some debugging that could be added I might be able to track it down during these one-off events. As an experiment, can you replay traffic a few minutes leading up to and including that 7m period in a test setup with only one straced worker? Run strace -T -f -o $FILE -p $PID_OF_WORKER and see if there's any unexpected/surprising dependencies (connect() to unrecognized addresses, open() to networked filesystems, fcntl locks, etc...). You can play around with some other strace options (-v/-s SIZE/-e filters) Maybe you'll find something, there. ___ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying