Hi,
we have oversized system setup with three SOGo front end servers and a
pgpool postgreSQL DB cluster and a loadbalancer in front.
The system is designed for +10,000 users. Now we have less than 250.
The system is running for a few months now and user count rose slowly to
about 150.
When we activated about 80 new users yesterday, suddenly the system
became inaccessable.
We deactivated two of the three servers (I'm leaving out a few things
here*) and the system went back to normal.
The issue is now reproducible.
When the second SOGo server starts up,
- after a few minutes there occur requests, that do not terminate:
"pid x has been hanging in the same request for x minutes"
these become more and more on both servers
- now all the handlers are busy, and there are none left:
"No child available to handle incoming request!"
- with no free handlers, SOGo becomes unreachable
- nginx goes to 504 - bad gateway
- the load balance throws out both SOGo front ends and the whole system
is "gone".
When the second server shuts down,
- everything goes back to normal near immediately
Load is more or less the same all the time. Hardware is near to idle.
Are the two servers running together longer, I get these errors in sogo.log:
Jan 22 13:28:12 sogod [20373]: [ERROR] <0x0x929998[GCSChannelManager]>
could not open channel <0x0xc23a78[PostgreSQL72Channel]: not-connected>
for postgresql://10.49.40.80/sogo/sogo_user_profile
Jan 22 13:28:12 sogod [20373]: [ERROR] <0x0xc1f1a8[SOGoSQLUserProfile]>
failed to acquire channel for URL:
postgresql://sogo:@10.49.40.80:5433/sogo/sogo_user_profile
Any hint?
Marc
--
users@sogo.nu
https://inverse.ca/sogo/lists