Hello all, *Bottom line:*
We have clients connect with RDP via Guacamole to Windows 2016 Server instances running on AWS. We experience: a) Freezes of the client UI (which are solved when closing the client and re-establishing a connection, or sometimes after waiting some time). b) Failure to connect in the client. Roughly speaking, 1 out 10 clients would experience one of these problems. *Some background to the setup:* I have guacd in version 1.0.0 running in a Docker container on an AWS EC2 instance. I have my own web server in Java, using the guacamole SDK (also version 1.0.0). This one also runs in a Docker container on a different EC2 instance. It is pretty straight-forward - creates a tunnel wrapping the socket to the guacd for every client. The web server and guacd access each other using Hashicorp's Consul. A local consul agent runs on both EC2 instances, and when the web server accesses the guacd, it in fact accesses a certain DNS name which the Consul service resolves to the actual address of the guacd instance, which is on the same VPC and subnet as the web server. We wrote our own client, using guacamole-common-js (also version 1.0.0). Websockets the client app opens against the web server go through AWS ALB and then through eBay's Fabio load balancer (they also expose the web server as HTTPS). It's also worth mentioning that we have multiple services of guacd and web server in multiple containers, with the ALB and Fabio load balancing them. *More info I have:* 1. We ran guacd in debug log level, and saw many errors of: INFO: Guacamole connection closed during handshake DEBUG: Error reading "select": End of stream reached while reading instruction 2. We also see these errors in guacd when trying to establish the RDP connection for the first time: certificate_store_open: error opening [/root/.config/freerdp/known_hosts] for writing unexpected pubKeyAuth buffer size :0 Could not verify public key echo! Authentication failure, check credentials. If credentials are valid, the NTLMSSP implementation may be to blame. Error: protocol security negotiation or connection failure 3. We also see the line in the guacd logs when establishing the RDP connection: Unable to find a match for unix timezone: Etc/UTC 4. In the Guac configuration for the connection, we use: security: any ignore-cert: true 5. We also use enable-drive: true. The drive's path is an AWS EFS drive we mount on the guacd instance. 6. On the client we see 514 error codes. We can't match them with the symptoms we see - i.e. the disconnections or freezes. 7. We do see that upon connecting, both the guacd container and the web server container have CPU load (it could be higher than 100% utilization when opening ~20 connections). The specs of the containers themselves: guacd runs with 2000mhz / 6gb / 30mbits. The web server runs with 2000mhz / 2gb / 10mbits. 8. We don't see any load on the web traffic, in the client or on the servers.. 9. In our app we have different clients connecting to the same machine desktop using the same guac connection. Each client has a tunnel of its own to the web server, which then shares the guacd connection between the clients. Not sure if this is problematic, but worth mentioning... 10. We made a change to our IT to have only one instance of guacd and web server containers, to reduce possible friction points. *To sum up:* I think this is pretty much all of the information I have. As you can see, we fail to pinpoint where the problem could be. I'd appreciate if you can offer some direction - what we can check, how we can check, if you ran into something similar, if there's something that looks suspicious in this whole setup. Many thanks! Daniel
