I appreciate all your help Eric and Daniel. I have not solved this yet, but I think I have narrowed it down to a Firewall timeout issue. One app uses a database connection to Oracle, the other app uses a 3rd Party API (still on location, but across the network). The ping times to both of these devices are extremely fast, however 30 minutes of inactivity across the Firewall seems to disconnect these connections. At least that appears to be what the strace is telling me. The place in the strace that the timeout occurs is consistent, every time. For example the strace of the app that connects to Oracle shows this:
pid 7825] write(14, "\0\373\0\0\6\0\0\0\0\0\21iB\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\v\0\0\0\3^Ca\201\0\0\0\0\0\0\376\377\377\377\377\377\377\377\22\0\0\0\376\377\377\377\377\377\377\377\r\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0d\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\22select 1 from dual\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 251) = 251 [pid 7825] read(14, <unfinished ...> [pid 7827] +++ killed by SIGKILL +++ PANIC: handle_group_exit: 7827 leader 7825 [pid 7846] +++ killed by SIGKILL +++ PANIC: handle_group_exit: 7846 leader 7825 +++ killed by SIGKILL +++ Clearly that is a database query 'select 1 from dual'. It times out trying to read the response. At the same time if I watch the lsof -p <pid>, I see that the database connection drops after 30 minutes. I'll update this thread again once it is solved, for historical and future issues (in case someone else experiences something similar). Again thank you for your help Eric! On Mon, Aug 4, 2014 at 4:46 PM, Eric Wong <e...@80x24.org> wrote: > Eric Wong <e...@80x24.org> wrote: > > Did you try strace-ing for 30 minutes and reproducing the error? > > You can also try setting the unicorn timeout to longer than 30 > minutes and get a longer/stalled strace. >