Perhaps you can add the openssl version to the puzzle due to those ssl
errors you caught, did it change with the upgrade? although without looking
I would really tend to not associate a time out with ssl issues at all.

I'd also try tcpdump on the client side instead of the server.

El mié., 9 oct. 2019 21:33, Franck Fallateuf <
[email protected]> escribió:

> Hello everyone,
>
> We upgraded from Apache 2.4.12 to 2.4.18 on a public facing webserver
> which proxies requests to backend servers. Initially when we cut-over to
> the webserver running the newer version (2.4.18), all traffic seemed to
> flow normally.  But a few days onwards, we received a report from one of
> our customers that they were experiencing random outages. The outage would
> manifest itself in a browser page "This site can't be reached",
> "ERR_CONNECTION_TIMED_OUT".  As far as we were aware, this is the only
> customer experiencing this issue and to report of it. After looking through
> all available logs for Apache and otherwise, we could not identify what was
> causing this nor where this was occurring.  So we decided to setup some
> packet capturing (tcpdumps) from both ends between us and this customer.
> What we observed was the following:
>
> Packet captures on border firewall showed the SSL handshake failing during
> ECDH negotiations, after the server hello message was received on the
> client. The return packet was a ‘bad_record_mac’ alert message, alert code
> 20.
>
> Because of this, we decided to make the following changes:
>
> During trouble shooting the TIME_WAIT value was increased on the firewall
> to allow enough time for a response, this did not resolve the issue. The
> firewall was then configured for TCP by-pass for the IP addresses having
> the communication issues, this did not resolve the issue either. The
> firewall is a Cisco ASA 5545 running v 9.8(3)29.
>
> While comparing the Apache setup we had running 2.4.12 and 2.4.18, we
> found out that we were running the "event" mpm on 2.4.18 vs "worker" mpm on
> 2.4.12. Reading on the differences between both of these mpm types, we
> immediately thought this could have played a part in this because of how
> sockets are handled. We reverted the mpm back to "worker" on the newer
> Apache version. We tested again and this customer still experienced the
> same random issues.
>
> Additional information:
>   - Customer uses one single destination IP address where all of these
> requests are coming from for all of their employees' traffic to access our
> application.
>   - There seems to be a correlation between high peak traffic time for
> this customer and the likely occurrence of these events.  So as stated all
> traffic is coming from one single destination IP address and there could be
> 200+ users on our system at that given time.
> - Customer reports less occurrence of this issue outside of their high
> peak traffic times.
>   - We've tuned the ListenBacklog to 99999 with no noticeable impact on
> this issue, although we believe it could have played a part in a separate
> issue not within this scope.
>
> Any help would greatly be appreciated as we are out of ideas and this
> customer has not been very friendly in helping us help them with this
> issue. We've had to revert back to running on Apache 2.4.12 which we would
> like to upgrade from.
>
> Thank you,
> Franck
>
> This email may contain confidential or protected material for the sole use
> of the intended recipient(s). Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>

Reply via email to