Weird Unicorn Timeout Issues (Hibernation problem?)

Tony Devlin Mon, 04 Aug 2014 11:13:39 -0700

Current Setup:  (the problem existed before updating nginx, unicorn and
rails; but in an attempt to solve the problem I updated them).


CentOS 6.3 (x64)
Ruby 2.1.2p95
Rails 4.0.0
Nginx 1.6.0
Unicorn 4.8.3

====

We have an issue where if a site is not accessed for around (average) 30
minutes the next query will timeout, and it will timeout on all the workers
opened.  IE:   If I have two workers, then both of those workers will
timeout, even if the first one,
after timeout, starts to work.  As soon as the second worker is called upon
it will timeout.  Then everything runs perfectly good and great until the
site is not accessed for 30 minutes or more.  Then the timeout issue starts
all over again.

This happens on our dev machine on two different applications (only apps on
the server) and on our production machine (also running two apps).

I can't figure out exactly what is going on.  I tried strace the master
unicorn process, but I get no relevant information.   One of the
applications is also tied to a New Relic account and it gives no
information either.

Nginx logs show this: (sorry, obfuscated certain details due to the
internal enterprise nature)

2014/08/04 11:31:21 [error] 6353#0: *339 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: ****.org, request: "GET /assets/application.css HTTP/1.1",
upstream:
"http://unix:/var/www/sites/****/shared/sockets/.unicorn.sock.0:/assets/application.css";,
host: "****.org", referrer: "http://****.org/outages";

2014/08/04 11:31:53 [error] 31058#0: *1 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: ****.org, request: "GET /outages HTTP/1.1", upstream:
"http://unix:/var/www/sites/****/shared/sockets/.unicorn.sock.0:/outages";,
host: "****.org", referrer: "http://****.org/outages";

Unicorn stderr shows this: (sorry, obfuscated certain details due to the
internal enterprise nature)

E, [2014-08-04T11:31:21.620379 #11991] ERROR -- : worker=1 PID:29701
timeout (21s > 20s), killing
E, [2014-08-04T11:31:21.630521 #11991] ERROR -- : reaped #<Process::Status:
pid 29701 SIGKILL (signal 9)> worker=1
I, [2014-08-04T11:31:21.639881 #30521]  INFO -- : worker=1 ready

E, [2014-08-04T11:31:53.666300 #11991] ERROR -- : worker=0 PID:29705
timeout (21s > 20s), killing
E, [2014-08-04T11:31:53.676984 #11991] ERROR -- : reaped #<Process::Status:
pid 29705 SIGKILL (signal 9)> worker=0
I, [2014-08-04T11:31:53.687157 #30528]  INFO -- : worker=0 ready

Its interesting that the timeout is occurring on different stages, for
example if you look at the two nginx errors, one timeout at
/assets/application.css and the other at the root /outages/, I've had it be
small gifs as well.   I'd also like to point at
that I have changed the timeout from 30s to 60s to 300s to 20s and it
occurs at every interval.  So it is not a "true" timeout issue.    Also the
two apps on the server are completely different, with only Ruby and Rails
being the same process.  The
gems are different in both, the database is different in both, in fact one
of them has no assets, no css and no ui files at all.  The only
similarities are Ruby, Rails, Nginx and Unicorn.

It's like there is some sort of hibernation problem with the unicorn
workers.  The hibernation thought comes from these debug messages (I
dropped to recording debug messages trying to locate the problem).  I get
these every now and then:

D, [2014-08-03T22:14:49.776538 #11991] DEBUG -- : waiting 11.0s after
suspend/hibernation

But then they cease to occur for days, example is that message was the last
time that occurred (non in the logs for today).

I've been trying for the past 4 days to solve this, make a large number of
changes to no avail.  Any help would be GREATLY appreciated.

Weird Unicorn Timeout Issues (Hibernation problem?)

Reply via email to