Perhaps you can add the openssl version to the puzzle due to those ssl errors you caught, did it change with the upgrade? although without looking I would really tend to not associate a time out with ssl issues at all.
I'd also try tcpdump on the client side instead of the server. El mié., 9 oct. 2019 21:33, Franck Fallateuf < [email protected]> escribió: > Hello everyone, > > We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver > which proxies requests to backend servers. Initially when we cut-over to > the webserver running the newer version (2.4.18), all traffic seemed to > flow normally. But a few days onwards, we received a report from one of > our customers that they were experiencing random outages. The outage would > manifest itself in a browser page "This site can't be reached", > "ERR_CONNECTION_TIMED_OUT". As far as we were aware, this is the only > customer experiencing this issue and to report of it. After looking through > all available logs for Apache and otherwise, we could not identify what was > causing this nor where this was occurring. So we decided to setup some > packet capturing (tcpdumps) from both ends between us and this customer. > What we observed was the following: > > Packet captures on border firewall showed the SSL handshake failing during > ECDH negotiations, after the server hello message was received on the > client. The return packet was a ‘bad_record_mac’ alert message, alert code > 20. > > Because of this, we decided to make the following changes: > > During trouble shooting the TIME_WAIT value was increased on the firewall > to allow enough time for a response, this did not resolve the issue. The > firewall was then configured for TCP by-pass for the IP addresses having > the communication issues, this did not resolve the issue either. The > firewall is a Cisco ASA 5545 running v 9.8(3)29. > > While comparing the Apache setup we had running 2.4.12 and 2.4.18, we > found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on > 2.4.12. Reading on the differences between both of these mpm types, we > immediately thought this could have played a part in this because of how > sockets are handled. We reverted the mpm back to "worker" on the newer > Apache version. We tested again and this customer still experienced the > same random issues. > > Additional information: > - Customer uses one single destination IP address where all of these > requests are coming from for all of their employees' traffic to access our > application. > - There seems to be a correlation between high peak traffic time for > this customer and the likely occurrence of these events. So as stated all > traffic is coming from one single destination IP address and there could be > 200+ users on our system at that given time. > - Customer reports less occurrence of this issue outside of their high > peak traffic times. > - We've tuned the ListenBacklog to 99999 with no noticeable impact on > this issue, although we believe it could have played a part in a separate > issue not within this scope. > > Any help would greatly be appreciated as we are out of ideas and this > customer has not been very friendly in helping us help them with this > issue. We've had to revert back to running on Apache 2.4.12 which we would > like to upgrade from. > > Thank you, > Franck > > This email may contain confidential or protected material for the sole use > of the intended recipient(s). Any review, use, distribution or disclosure > by others is strictly prohibited. If you are not the intended recipient (or > authorized to receive for the recipient), please contact the sender by > reply email and delete all copies of this message. >
