Re: A barrage of unexplained timeouts

2013-08-22 Thread nick


Eric Wong normalper...@yhbt.net said:

 Jimmy Soho jimmy.s...@gmail.com wrote:
 fwiw, we experience this too irregularly, about once a week or 2 weeks
 for a few seconds, and sometimes up to a minute or 2 minutes long. for
 no apparent reasons indeed. and at random times.
 
 Wait, based on what you say below, is this happening to unicorn or
 nginx?
 
 my suspicion so far has always been that this is due to bad (wifi)
 networks of clients. what happens when a request is handled
 successfully by rails, but for whatever reason the last bit of the
 response can't be delivered because the receiving end is not consuming
 in a timely fashion?
 
 So nginx is complaining about disconnected clients, not unicorn,
 correct?
 
 If i'm not mistaken i think that rails logs a 200 success message
 before the response is actually send back?
 
 Correct, rails/unicorn won't know if the client disconnected until
 nginx breaks the connection to unicorn.
 
 Is a response send back by rails / unicorn fully buffered by nginx? Or
 is a unicorn worker blocked until the response is confirmed to have
 been successfully received by the client?
 
 nginx fully buffers both requests and responses by default.
 Which leads us to...
 
 Aside: how do you protect your unicorn workers against a barrage of
 client requests that eat the response veery slwly?
 
 That's the whole point of using nginx with unicorn.  nginx shields
 unicorn from slow clients.

Is it possible that the behavior could be related to nginx?  Nginx hanging onto 
the connection too long from the worker, so that the worker never makes it back 
into the pool and eventually get's killed for not checking in?

___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-22 Thread Eric Wong
n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  Jimmy Soho jimmy.s...@gmail.com wrote:
  Is a response send back by rails / unicorn fully buffered by nginx? Or
  is a unicorn worker blocked until the response is confirmed to have
  been successfully received by the client?
  
  nginx fully buffers both requests and responses by default.
  Which leads us to...
  
  Aside: how do you protect your unicorn workers against a barrage of
  client requests that eat the response veery slwly?
  
  That's the whole point of using nginx with unicorn.  nginx shields
  unicorn from slow clients.
 
 Is it possible that the behavior could be related to nginx?  Nginx
 hanging onto the connection too long from the worker, so that the
 worker never makes it back into the pool and eventually get's killed
 for not checking in?

Unlikely.  That'd be a serious bug in nginx since nginx was originally
designed to do this buffering since the earliest versions.  Which
version of nginx are you running?  Can you be sure proxy_buffering off
is _not_ set anywhere?
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-21 Thread Eric Wong
Jimmy Soho jimmy.s...@gmail.com wrote:
 fwiw, we experience this too irregularly, about once a week or 2 weeks
 for a few seconds, and sometimes up to a minute or 2 minutes long. for
 no apparent reasons indeed. and at random times.

Wait, based on what you say below, is this happening to unicorn or
nginx?

 my suspicion so far has always been that this is due to bad (wifi)
 networks of clients. what happens when a request is handled
 successfully by rails, but for whatever reason the last bit of the
 response can't be delivered because the receiving end is not consuming
 in a timely fashion?

So nginx is complaining about disconnected clients, not unicorn,
correct?

 If i'm not mistaken i think that rails logs a 200 success message
 before the response is actually send back?

Correct, rails/unicorn won't know if the client disconnected until
nginx breaks the connection to unicorn.

 Is a response send back by rails / unicorn fully buffered by nginx? Or
 is a unicorn worker blocked until the response is confirmed to have
 been successfully received by the client?

nginx fully buffers both requests and responses by default.
Which leads us to...

 Aside: how do you protect your unicorn workers against a barrage of
 client requests that eat the response veery slwly?

That's the whole point of using nginx with unicorn.  nginx shields
unicorn from slow clients.
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


A barrage of unexplained timeouts

2013-08-20 Thread nick
We've been running unicorn-3.6.2 on REE 1.8.7 2011.12 in production for quite 
some time and we use monit to monitor each unicorn worker.  Occasionally, I'll 
get a notification that a worker has timed-out and has been re-spawned.  In all 
these cases, when I look at the rails logs, I can see the last request that the 
worker handled, and they all have appeared to complete successfully from the 
client's perspective (rails and nginx respond with 200), but the unicorn log 
shows that it was killed due to timeout.  This has always been relatively rare 
and I thought it was a non-problem.

Until today.

Today, for about a 7 minute period, our workers would continually report as 
having timed-out and would be killed by the master.  After re-spawning, the 
workers would serve a handful of requests and then eventually be killed again.

During this time, our servers (Web, PG DB, and redis) were not under load and 
IO was normal.  After the last monit notification at 8:30, everything went back 
to normal.  I understand why unicorns would timeout if they were waiting (120 
secs) on IO, but there aren't any orphaned requests in the rails log.  For each 
request line, there's a corresponding completion line.  No long running queries 
to blame on PG, either.

I know we're probably due for an upgrade, but I'm hoping to get to the bottom 
of these unexplained timeouts.

Thanks for your help!

-Nick

___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread nick
Eric Wong normalper...@yhbt.net said:
 Can you take a look at the nginx error and access logs?  From what
 you're saying, there's a chance a request never even got to the Rails
 layer.  However, nginx should be logging failed/long-running requests to
 unicorn.

The nginx access logs show frequent 499 responses.  The error logs are filled 
with:

connect() failed (110: Connection timed out) while connecting to upstream
upstream timed out (110: Connection timed out) while reading response header 
from upstream

What specific pieces of information should I be looking for in the logs?

Thanks Again,

-Nick

___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread Eric Wong
n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  Can you take a look at the nginx error and access logs?  From what
  you're saying, there's a chance a request never even got to the Rails
  layer.  However, nginx should be logging failed/long-running requests to
  unicorn.
 
 The nginx access logs show frequent 499 responses.  The error logs are filled 
 with:
 
 connect() failed (110: Connection timed out) while connecting to upstream
 upstream timed out (110: Connection timed out) while reading response header 
 from upstream
 
 What specific pieces of information should I be looking for in the logs?

Do you have any other requests in your logs which could be taking
a long time and hogging workers, but not high enough to trigger the
unicorn kill timeout.

(enable $request_time in nginx access logs if you haven't already)

Is this with Unix or TCP sockets?  If it's over a LAN, maybe there's
still a bad switch/port/cable somewhere (that happens often to me).

With Unix sockets, I don't recall encountering recent problems under
Linux.  Which OS are you running?

I hit several TCP bugs in the early 3.7 Linux kernels, too...
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread Eric Wong
n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  
  Do you have any other requests in your logs which could be taking
  a long time and hogging workers, but not high enough to trigger the
  unicorn kill timeout.

 I don't *think* so.  Most requests finish 300ms.  We do have some
 more intensive code-paths, but they're administrative and called much
 less frequently.  Most of these pages complete in 3seconds.
 
 For requests that made it to rails logging, the LAST processed request
 before the worker timed-out all completed very quickly (and no real
 pattern in terms of which page may be triggering it.)

This is really strange.  This was only really bad for a 7s period?
Has it happened again?  Anything else going on with the system at that
time?  Swapping, particularly...

And if you're inside a VM, maybe your neighbors were hogging things.
Large PUT/POST requests which require filesystem I/O are particularly
sensitive to this.

  Is this with Unix or TCP sockets?  If it's over a LAN, maybe there's
  still a bad switch/port/cable somewhere (that happens often to me).
 
 TCP sockets, with nginx and unicorn running on the same box.

OK, that probably rules out a bunch of problems.
Just to be thorough, anything interesting in dmesg or syslogs?

  With Unix sockets, I don't recall encountering recent problems under
  Linux.  Which OS are you running?
 
 Stock RHEL 5, kernel 2.6.18.

RHEL 5.0 or 5.x?  I can't remember /that/ far back to 5.0 (I don't think
I even tried it until 5.2), but don't recall anything being obviously
broken in those...
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread Eric Wong
n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  This is really strange.  This was only really bad for a 7s period?
 
 It was a 7 minute period.  All of the workers would become busy and
 exceed their 120s timeout.  Master would kill and re-spawn them,
 they'd start to respond to a handful of requests (anywhere from 5-50)
 after which they'd become busy again, and get force killed by
 master.  This pattern happened 3 times over a 7 minute period.
 
  Has it happened again?
 
 No
 
  Anything else going on with the system at that time?  Swapping,
  particularly...
 
 No swap activity or high load.  Our munin graphs indicate a peak of
 web/app server disk latency around that time, although our graphs show
 many other similar peaks, without incident.

I'm stumped :

Do you have any background threads running that could be hanging the
workers?   This is Ruby 1.8, after all, so there's more likely to be
some blocking call hanging the entire process.  AFAIK, some monitoring
software runs a background thread in the unicorn worker and maybe the
OpenSSL extension doesn't work as well if it encountered network
problems under Ruby 1.8

Otherwise, I don't know...
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread nick
Eric Wong normalper...@yhbt.net said:

 n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  This is really strange.  This was only really bad for a 7s period?

 It was a 7 minute period.  All of the workers would become busy and
 exceed their 120s timeout.  Master would kill and re-spawn them,
 they'd start to respond to a handful of requests (anywhere from 5-50)
 after which they'd become busy again, and get force killed by
 master.  This pattern happened 3 times over a 7 minute period.

  Has it happened again?

 No

  Anything else going on with the system at that time?  Swapping,
  particularly...

 No swap activity or high load.  Our munin graphs indicate a peak of
 web/app server disk latency around that time, although our graphs show
 many other similar peaks, without incident.
 
 I'm stumped :

I was afraid you'd say that :(.

 Do you have any background threads running that could be hanging the
 workers?   This is Ruby 1.8, after all, so there's more likely to be
 some blocking call hanging the entire process.  AFAIK, some monitoring
 software runs a background thread in the unicorn worker and maybe the
 OpenSSL extension doesn't work as well if it encountered network
 problems under Ruby 1.8

We don't explicitly create any threads in our rails code.  We do communicate 
with backgroundrb worker processes, although, none of the strangeness today 
involved any routes that would hit backgroundrb workers.

Is there any instrumentation that I could add that might help debugging in the 
future? ($request_time and $upstream_response_time are now in my nginx logs.)  
We have noticed these unexplainable timeouts before, but typically for a 
single worker.  If there's some debugging that could be added I might be able 
to track it down during these one-off events.

Thoughts?

I absolutely appreciate all your help!

-Nick

___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


Re: A barrage of unexplained timeouts

2013-08-20 Thread Eric Wong
n...@auger.net wrote:
 Eric Wong normalper...@yhbt.net said:
  n...@auger.net wrote:
  Eric Wong normalper...@yhbt.net said:
  I'm stumped :
 
 I was afraid you'd say that :(.

Actually, another potential issue is DNS lookups timing out.  But they
shouldn't take *that* long...

  Do you have any background threads running that could be hanging the
  workers?   This is Ruby 1.8, after all, so there's more likely to be
  some blocking call hanging the entire process.  AFAIK, some monitoring
  software runs a background thread in the unicorn worker and maybe the
  OpenSSL extension doesn't work as well if it encountered network
  problems under Ruby 1.8
 
 We don't explicitly create any threads in our rails code.  We do
 communicate with backgroundrb worker processes, although, none of the
 strangeness today involved any routes that would hit backgroundrb
 workers.

I proactively audit every piece of code (including external
libraries/gems) loaded by an app for potentially blocking calls (hits to
the filesystem, socket calls w/o timeout/blocking).   I use strace to
help me find that sometimes...

 Is there any instrumentation that I could add that might help
 debugging in the future? ($request_time and $upstream_response_time
 are now in my nginx logs.)  We have noticed these unexplainable
 timeouts before, but typically for a single worker.  If there's some
 debugging that could be added I might be able to track it down during
 these one-off events.

As an experiment, can you replay traffic a few minutes leading up to and
including that 7m period in a test setup with only one straced worker?

Run strace -T -f -o $FILE -p $PID_OF_WORKER and see if there's any
unexpected/surprising dependencies (connect() to unrecognized addresses,
open() to networked filesystems, fcntl locks, etc...).

You can play around with some other strace options (-v/-s SIZE/-e filters)

Maybe you'll find something, there.
___
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying