Dear All,

We run into an issue where after an extended uptime, both Kylin query
server and jobs running on EMR stop working. The root cause of the issue in
both sides is this exception:

Caused by: java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable
to execute HTTP request: Timeout waiting for connection from pool
        at
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
~[emrfs-hadoop-assembly-2.37.0.jar:?]

In our setup, S3 is used for both intermediate data storage as well as
persistence under HBase.

Based on
https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/
increasing the connection pool size (fs.s3.maxConnections property) to 10
000 is just delaying the issue thus the underlying issue is likely a
connection leak.
It also indicates a leak that restarting the kylin service solves the
problem.

We opened a ticket about the issue, it is
https://issues.apache.org/jira/browse/KYLIN-4500.
A full stack trace from the QueryService is attached to the ticket.

Since this is seriously affecting our production service, any hint would be
much appreciated. Is there any chance someone could look into this?

Many thanks,
Andras

Reply via email to