[jira] [Comment Edited] (KYLIN-4500) Timeout waiting for connection from pool

2021-10-04 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423927#comment-17423927
 ] 

Gabor Arki edited comment on KYLIN-4500 at 10/4/21, 1:14 PM:
-

This has happened to our production environment today, now with Kylin 3.1.0 
running on EMR 5.28. Restarting the query+job server released the connections 
again and resolved the issue.

I assume there is another potential leak somewhere similar to KYLIN-4396 that 
is yet unfixed, at least in v3.1.0.


was (Author: arkigabor):
This has happened to our production environment today, now with Kylin 3.1.0 
running on EMR 5.28. Restarting the query server released the connections again 
and resolved the issue.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v3.0.0, v3.1.0
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4500) Timeout waiting for connection from pool

2021-10-04 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423927#comment-17423927
 ] 

Gabor Arki edited comment on KYLIN-4500 at 10/4/21, 1:08 PM:
-

This has happened to our production environment today, now with Kylin 3.1.0 
running on EMR 5.28. Restarting the query server released the connections again 
and resolved the issue.


was (Author: arkigabor):
This has happened to our production environment today, now with Kylin 3.1.0 
running on ERM 5.28. Restarting the query server released the connections again 
and resolved the issue.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v3.0.0, v3.1.0
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4500) Timeout waiting for connection from pool

2021-10-04 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4500:
--
Affects Version/s: v3.0.0
   v3.1.0

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v3.0.0, v3.1.0
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4500) Timeout waiting for connection from pool

2021-10-04 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423927#comment-17423927
 ] 

Gabor Arki commented on KYLIN-4500:
---

This has happened to our production environment today, now with Kylin 3.1.0 
running on ERM 5.28. Restarting the query server released the connections again 
and resolved the issue.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-21 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384695#comment-17384695
 ] 

Gabor Arki commented on KYLIN-5022:
---

Maybe using 
[checksum|https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)],
 although couldn't tell whether this one is applicable to the S3 filesystem. 
Also, the implemented algorithm is dependent on the filesystem used.

Another workaround could be to extend the upload method with the solution I am 
currently using: after the file has been copied to the 
{{hdfsWorkingDirectory}}, copy it back to the local file system.

Or maybe an even easier solution to reverse setting the last modified 
timestamp: instead of trying to set it on the filesystem, take the last 
modified of the newly uploaded file and set that value as the last modified for 
the original jar on the local file system.

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: Capture.PNG, 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-19 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383245#comment-17383245
 ] 

Gabor Arki commented on KYLIN-5022:
---

[~xxyu], I did some additional investigation and after finding the 
aforementioned {{isSame}} method I also found the root cause.

We are using S3 instead of HDFS and the problem is caused by that. S3 does not 
support setting the last modified timestamp. After the coprocessor jar has been 
copied to the remote file system, the invoked {{setTimes}} is silently ignored 
in the case of an S3 filesystem. Because of that, {{isSame}} will always return 
with false and for each and every table a new coprocessor jar is uploaded.

As a manual workaround, I copied the coprocessor jar manually to S3 and then 
copied it back from S3 to the local file system. This way the last modified 
timestamps are matching and only one coprocessor jar is used.

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: Capture.PNG, 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-16 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381978#comment-17381978
 ] 

Gabor Arki edited comment on KYLIN-5022 at 7/16/21, 10:44 AM:
--

The root cause seems to be that Kylin is creating and configuring a unique jar 
per HBase table thus HBase region servers are downloading this 5.5M jar for 
each table separately. In our case, the ~11000 tables result in 50+GB of space 
needed on our HBase region servers.

To make this issue worse, it seems over time HBase is starting to delete these 
jars (maybe when a table is cleaned up, maybe it does so occasionally anyway). 
But given the HBase region server process is continuing to run, the disk space 
occupied by these deleted jars is not freed up either unless the region server 
is shut down. Only then are these deleted files released and removed from the 
disk.


was (Author: arkigabor):
The root cause seems to be that Kylin is creating and configuring a unique jar 
per HBase table thus HBase region servers are downloading this 5.5M jar for 
each table separately. In our case, the ~11000 tables result in 50+GB of space 
needed on our HBase region servers.

To make this issue worse, it seems over time HBase is starting to delete these 
jars (maybe when a table is cleaned up, maybe it does so occasionally anyway). 
But given the HBase region server process is continuing to run, the disk space 
occupied by these deleted jars is not freed up unless the region server is shut 
down. Only then are these deleted files released and removed from the disk.

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: Capture.PNG, 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-16 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381978#comment-17381978
 ] 

Gabor Arki commented on KYLIN-5022:
---

The root cause seems to be that Kylin is creating and configuring a unique jar 
per HBase table thus HBase region servers are downloading this 5.5M jar for 
each table separately. In our case, the ~11000 tables result in 50+GB of space 
needed on our HBase region servers.

To make this issue worse, it seems over time HBase is starting to delete these 
jars (maybe when a table is cleaned up, maybe it does so occasionally anyway). 
But given the HBase region server process is continuing to run, the disk space 
occupied by these deleted jars is not freed up unless the region server is shut 
down. Only then are these deleted files released and removed from the disk.

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: Capture.PNG, 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-16 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-5022:
--
Attachment: Capture.PNG

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: Capture.PNG, 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-09 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378143#comment-17378143
 ] 

Gabor Arki edited comment on KYLIN-5022 at 7/9/21, 4:01 PM:


We are encountering a similar issue with v3.1.0. Apart from the 
{{kylin-coprocessor-.jar}} files present in {{/mnt/tmp}}, this is also causing 
a disk space leakage. There are a lot of deleted files still referenced by the 
HBase region server process thus the disk space cannot be actually freed. This 
causes a significant discrepancy between the {{du}} and {{df}} calculations. In 
our case, when running the EMR for Kylin for a few months now, 50+ GB of such 
deleted *{{kylin-coprocessor-3.1.0-}}*{{.jar}} files are still occupying disk 
space on each core node:
 [hadoop@ip-23-0-3-131 mnt]$ sudo lsof| grep "/mnt" | grep delete | more
 java 16611 hbase 666r REG 259,3 5592785 143780992 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1369857329.kylin-coprocessor-3.1.0-SNAPSHOT-1597.jar
 .1602171232484.jar (deleted)
 java 16611 hbase 679r REG 259,3 5592785 144773588 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1170685576.kylin-coprocessor-3.1.0-SNAPSHOT-129.jar.
 1602182959858.jar (deleted)
 java 16611 hbase 680r REG 259,3 5592785 144321908 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-1329141342.kylin-coprocessor-3.1.0-SNAPSHOT-3653.ja
 r.1602180128061.jar (deleted)
 java 16611 hbase 681r REG 259,3 5592785 144248531 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-832621882.kylin-coprocessor-3.1.0-SNAPSHOT-3651.jar
 .1602179699713.jar (deleted)
 ...
 ...


was (Author: arkigabor):
We are encountering a similar issue with v3.1.0. Apart from the 
{{kylin-coprocessor-*.jar}} files present in {{/mnt/tmp}}, this is also causing 
a disk space leakage. There are a lot of deleted files still referenced by the 
HBase region server process thus the disk space cannot be actually freed. This 
causes a significant discrepancy between the {{du}} and {{df}} calculations. In 
our case, when running the EMR for Kylin for a few months now, 50+ GB of such 
deleted {{kylin-coprocessor-3.1.0-*.jar}} files are still occupying disk space 
on each core node:
[hadoop@ip-23-0-3-131 mnt]$ sudo lsof| grep "/mnt"  | grep delete | more
java  16611   hbase  666r  REG  259,35592785  143780992 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1369857329.kylin-coprocessor-3.1.0-SNAPSHOT-1597.jar
.1602171232484.jar (deleted)
java  16611   hbase  679r  REG  259,35592785  144773588 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1170685576.kylin-coprocessor-3.1.0-SNAPSHOT-129.jar.
1602182959858.jar (deleted)
java  16611   hbase  680r  REG  259,35592785  144321908 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-1329141342.kylin-coprocessor-3.1.0-SNAPSHOT-3653.ja
r.1602180128061.jar (deleted)
java  16611   hbase  681r  REG  259,35592785  144248531 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-832621882.kylin-coprocessor-3.1.0-SNAPSHOT-3651.jar
.1602179699713.jar (deleted)
...
...

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-09 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378143#comment-17378143
 ] 

Gabor Arki edited comment on KYLIN-5022 at 7/9/21, 4:01 PM:


We are encountering a similar issue with v3.1.0. Apart from the 
{{kylin-coprocessor-.jar}} files present in {{/mnt/tmp}}, this is also causing 
a disk space leakage. There are a lot of deleted files still referenced by the 
HBase region server process thus the disk space cannot be actually freed. This 
causes a significant discrepancy between the {{du}} and {{df}} calculations. In 
our case, when running the EMR for Kylin for a few months now, 50+ GB of such 
deleted *{{kylin-coprocessor-3.1.0-}}*{{.jar}} files are still occupying disk 
space on each core node:
{code:java}
[hadoop@ip-23-0-3-131 mnt]$ sudo lsof| grep "/mnt" | grep delete | more
 java 16611 hbase 666r REG 259,3 5592785 143780992 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1369857329.kylin-coprocessor-3.1.0-SNAPSHOT-1597.jar
 .1602171232484.jar (deleted)
 java 16611 hbase 679r REG 259,3 5592785 144773588 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1170685576.kylin-coprocessor-3.1.0-SNAPSHOT-129.jar.
 1602182959858.jar (deleted)
 java 16611 hbase 680r REG 259,3 5592785 144321908 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-1329141342.kylin-coprocessor-3.1.0-SNAPSHOT-3653.ja
 r.1602180128061.jar (deleted)
 java 16611 hbase 681r REG 259,3 5592785 144248531 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-832621882.kylin-coprocessor-3.1.0-SNAPSHOT-3651.jar
 .1602179699713.jar (deleted)
...{code}


was (Author: arkigabor):
We are encountering a similar issue with v3.1.0. Apart from the 
{{kylin-coprocessor-.jar}} files present in {{/mnt/tmp}}, this is also causing 
a disk space leakage. There are a lot of deleted files still referenced by the 
HBase region server process thus the disk space cannot be actually freed. This 
causes a significant discrepancy between the {{du}} and {{df}} calculations. In 
our case, when running the EMR for Kylin for a few months now, 50+ GB of such 
deleted *{{kylin-coprocessor-3.1.0-}}*{{.jar}} files are still occupying disk 
space on each core node:
 [hadoop@ip-23-0-3-131 mnt]$ sudo lsof| grep "/mnt" | grep delete | more
 java 16611 hbase 666r REG 259,3 5592785 143780992 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1369857329.kylin-coprocessor-3.1.0-SNAPSHOT-1597.jar
 .1602171232484.jar (deleted)
 java 16611 hbase 679r REG 259,3 5592785 144773588 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1170685576.kylin-coprocessor-3.1.0-SNAPSHOT-129.jar.
 1602182959858.jar (deleted)
 java 16611 hbase 680r REG 259,3 5592785 144321908 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-1329141342.kylin-coprocessor-3.1.0-SNAPSHOT-3653.ja
 r.1602180128061.jar (deleted)
 java 16611 hbase 681r REG 259,3 5592785 144248531 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-832621882.kylin-coprocessor-3.1.0-SNAPSHOT-3651.jar
 .1602179699713.jar (deleted)
 ...
 ...

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-5022) kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件

2021-07-09 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378143#comment-17378143
 ] 

Gabor Arki commented on KYLIN-5022:
---

We are encountering a similar issue with v3.1.0. Apart from the 
{{kylin-coprocessor-*.jar}} files present in {{/mnt/tmp}}, this is also causing 
a disk space leakage. There are a lot of deleted files still referenced by the 
HBase region server process thus the disk space cannot be actually freed. This 
causes a significant discrepancy between the {{du}} and {{df}} calculations. In 
our case, when running the EMR for Kylin for a few months now, 50+ GB of such 
deleted {{kylin-coprocessor-3.1.0-*.jar}} files are still occupying disk space 
on each core node:
[hadoop@ip-23-0-3-131 mnt]$ sudo lsof| grep "/mnt"  | grep delete | more
java  16611   hbase  666r  REG  259,35592785  143780992 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1369857329.kylin-coprocessor-3.1.0-SNAPSHOT-1597.jar
.1602171232484.jar (deleted)
java  16611   hbase  679r  REG  259,35592785  144773588 
/mnt/tmp/hbase-hbase/local/jars/tmp/.1170685576.kylin-coprocessor-3.1.0-SNAPSHOT-129.jar.
1602182959858.jar (deleted)
java  16611   hbase  680r  REG  259,35592785  144321908 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-1329141342.kylin-coprocessor-3.1.0-SNAPSHOT-3653.ja
r.1602180128061.jar (deleted)
java  16611   hbase  681r  REG  259,35592785  144248531 
/mnt/tmp/hbase-hbase/local/jars/tmp/.-832621882.kylin-coprocessor-3.1.0-SNAPSHOT-3651.jar
.1602179699713.jar (deleted)
...
...

> kylin升级新版本-/mnt/tmp/hbase-hbase/local/jars/tmp产生大量的kylin-coprocessor文件
> --
>
> Key: KYLIN-5022
> URL: https://issues.apache.org/jira/browse/KYLIN-5022
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, Storage - HBase
>Affects Versions: v3.1.1
>Reporter: star_dev
>Priority: Major
> Attachments: 屏幕快照1.png, 屏幕快照2.png, 日志.log
>
>
> kylin版本从3.0.2更新到3.1.1,还是用原来的元数据。
> 发现在EMR的core节点中有大量的kylin-coprocessor文件生成,见附件屏幕快照1,占用了大量的空间,导致hdfs文件系统可用空间变少。路径为/mnt/tmp/hbase-hbase/local/jars/tmp
> 查询官方文档 [http://kylin.apache.org/docs/howto/howto_update_coprocessor.html]
> 执行如下命令仍然不好用,日志信息见附件
> -
>  
> {{$KYLIN_HOME/bin/kylin.sh 
> org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default all}}
> {{-}}
> 同时发现kylin元数据 
> kylin_metadata/coprocessor/下有大量的kylin-coprocessor-3.1.1-*.jar文件,见附件屏幕快照2
>  
> 是什么原因导致的这种现象?
> 如何才能在/mnt/tmp/hbase-hbase/local/jars/tmp路径下不再产生大量的文件?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4689) Deadlock in Kylin job execution

2020-09-29 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203901#comment-17203901
 ] 

Gabor Arki commented on KYLIN-4689:
---

[~xxyu], I might be wrong but I think this issue is happening due to a race 
condition and not a broken lock:
 * Even though {{kylin.job.max-concurrent-jobs}} config defines 10, Kylin is 
submitting more stream jobs than this limit (my experience is around 30 with a 
3 cube setup)
 * 30 or so jobs are competing for 10 slots and 1 lock per cube
 * At first, this is not a problem. Let's say *jobX* starts running, acquires 
the lock, and runs _Build Dimension Dictionaries For Steaming Job_ step
 * But after each step finished, the jobs are competing again for the resources
 * If *jobX* did get a slot from the 10, then everything is working as 
expected: it is running _Save Cube Dictionaries_ step and unlocks
 * But if *jobX* didn't get a slot, it is a deadlock: the other jobs will be 
running and occupying all 10 slots and waiting for the lock indefinitely, while 
*jobX* will be possessing the lock and waiting for a free slot indefinitely

Removing the lock of *jobX* at this point fixes the deadlock but re-introduces 
the possibility of running into 
https://issues.apache.org/jira/browse/KYLIN-4165.

> Deadlock in Kylin job execution
> ---
>
> Key: KYLIN-4689
> URL: https://issues.apache.org/jira/browse/KYLIN-4689
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v3.0.0, v3.1.0, v3.0.1, v3.0.2
>Reporter: Gabor Arki
>Assignee: Xiaoxiang Yu
>Priority: Critical
> Fix For: v3.1.1
>
>
> h4. Reproduction steps
>  * Install Kylin 3.1.0
>  * Deploy a streaming cube
>  * Enable the cube having historical data present in the Kafka topic
>  * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
> segments from Kafka when the cubes were enabled
> h4. Expected result
>  * Kylin is starting to process stream segments with stream jobs, eventually 
> processing the older segments and catching up with the stream
> h4. Actual result
>  * A short time time after the stream jobs have started (37 successful stream 
> jobs), all jobs are completely stuck without any progress. Some in running 
> state, some in pending state.
>  * The following logs are continuously written:
> {code:java}
> 2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
> 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
> 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
> path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
> lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
> true,will try after one minute
> 2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - 
> There are too many jobs running, Job Fetch will wait until next schedule time
> {code}
>  * Zookeeper indicates the following locks are in place:
> {code:java}
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
> []
> ls /kylin/kylin_metadata/cube_job_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_lock/cube_cm
> [f888380e-9ff4-98f5-2df4-1ae71e045f93]
> ls /kylin/kylin_metadata/cube_job_lock/cube_vm
> [fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
> ls /kylin/kylin_metadata/cube_job_lock/cube_jm
> [d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
> {code}
>  * The job IDs for the running jobs:
>  ** 169f75fa-a02f-221b-fc48-037bc7a842d0
>  ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
>  ** 00924699-8b51-8091-6e71-34ccfeba3a98
>  ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
>  ** 416355c2-a3d7-57eb-55c6-c042aa256510
>  ** 12750aea-3b96-c817-64e8-bf893d8c120f
>  ** 42819dde-5857-fd6b-b075-439952f47140
>  ** 00128937-bd4a-d6c1-7a4e-744dee946f67
>  ** 46a0233f-217e-9155-725b-c815ad77ba2c
>  ** 062150ba-bacd-6644-4801-3a51b260d1c5
> As you can see, the 10 jobs that are actually running do not possess the 
> locks thus cannot actually do anything (these all were stuck at step Build 
> Dimension Dictionaries For Steaming Job). On the other hand, the 3 jobs 
> possessing the locks cannot resume running because there are already 10 jobs 
> in running state, thus cannot proceed and release the locks. This is a 
> deadlock and the cluster is completely stuck.
> We have been observing this behavior in 3.0.0 (where rolling back 
> https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
> in 3.1.0 as well. It has been originally reported in the comments of 
> https://issues.apache.org/jira/browse/KYLIN-4348 but I'm not sure that it's 
> related to that bug/epic.



--

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * A short time time after the stream jobs have started (37 successful stream 
jobs), all jobs are completely stuck without any progress. Some in running 
state, some in pending state.
 * The following logs are continuously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything (these all were stuck at step Build Dimension 
Dictionaries For Steaming Job). On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock and the cluster 
is completely stuck.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348 but I'm not sure that it's 
related to that bug/epic.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Affects Version/s: v3.0.1
   v3.0.2

> Deadlock in Kylin job execution
> ---
>
> Key: KYLIN-4689
> URL: https://issues.apache.org/jira/browse/KYLIN-4689
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v3.0.0, v3.1.0, v3.0.1, v3.0.2
>Reporter: Gabor Arki
>Priority: Critical
>
> h4. Reproduction steps
>  * Install Kylin 3.1.0
>  * Deploy a streaming cube
>  * Enable the cube having historical data present in the Kafka topic
>  * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
> segments from Kafka when the cubes were enabled
> h4. Expected result
>  * Kylin is starting to process stream segments with stream jobs, eventually 
> processing the older segments and catching up with the stream
> h4. Actual result
>  * After a short time, all jobs are completely stuck without any progress. 
> Some in running state, some in pending state.
>  * The following logs are continously written:
> {code:java}
> 2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
> 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
> 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
> path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
> lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
> true,will try after one minute
> 2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - 
> There are too many jobs running, Job Fetch will wait until next schedule time
> {code}
>  * Zookeeper indicates the following locks are in place:
> {code:java}
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
> []
> ls /kylin/kylin_metadata/cube_job_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_lock/cube_cm
> [f888380e-9ff4-98f5-2df4-1ae71e045f93]
> ls /kylin/kylin_metadata/cube_job_lock/cube_vm
> [fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
> ls /kylin/kylin_metadata/cube_job_lock/cube_jm
> [d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
> {code}
>  * The job IDs for the running jobs:
>  * 
>  ** 169f75fa-a02f-221b-fc48-037bc7a842d0
>  ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
>  ** 00924699-8b51-8091-6e71-34ccfeba3a98
>  ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
>  ** 416355c2-a3d7-57eb-55c6-c042aa256510
>  ** 12750aea-3b96-c817-64e8-bf893d8c120f
>  ** 42819dde-5857-fd6b-b075-439952f47140
>  ** 00128937-bd4a-d6c1-7a4e-744dee946f67
>  ** 46a0233f-217e-9155-725b-c815ad77ba2c
>  ** 062150ba-bacd-6644-4801-3a51b260d1c5
> As you can see, the 10 jobs that are actually running do not possess the 
> locks thus cannot actually do anything. On the other hand, the 3 jobs 
> possessing the locks cannot resume running because there are already 10 jobs 
> in running state, thus cannot proceed and release the locks. This is a 
> deadlock and the cluster is completely stuck.
> We have been observing this behavior in 3.0.0 (where rolling back 
> https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
> in 3.1.0 as well. It has been originally reported in the comments of 
> https://issues.apache.org/jira/browse/KYLIN-4348 but I'm not sure that it's 
> related to that bug/epic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock and the cluster 
is completely stuck.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348 but I'm not sure that it's 
related to that bug/epic.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Affects Version/s: v3.0.0
   v3.1.0

> Deadlock in Kylin job execution
> ---
>
> Key: KYLIN-4689
> URL: https://issues.apache.org/jira/browse/KYLIN-4689
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v3.0.0, v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
>
> h4. Reproduction steps
>  * Install Kylin 3.1.0
>  * Deploy a streaming cube
>  * Enable the cube having historical data present in the Kafka topic
>  * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
> segments from Kafka when the cubes were enabled
> h4. Expected result
>  * Kylin is starting to process stream segments with stream jobs, eventually 
> processing the older segments and catching up with the stream
> h4. Actual result
>  * After a short time, all jobs are completely stuck without any progress. 
> Some in running state, some in pending state.
>  * The following logs are continously written:
> {code:java}
> 2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
> 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
> 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
> path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
> lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
> true,will try after one minute
> 2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - 
> There are too many jobs running, Job Fetch will wait until next schedule time
> {code}
>  * Zookeeper indicates the following locks are in place:
> {code:java}
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
> []
> ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
> []
> ls /kylin/kylin_metadata/cube_job_lock
> [cube_cm, cube_vm, cube_jm]
> ls /kylin/kylin_metadata/cube_job_lock/cube_cm
> [f888380e-9ff4-98f5-2df4-1ae71e045f93]
> ls /kylin/kylin_metadata/cube_job_lock/cube_vm
> [fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
> ls /kylin/kylin_metadata/cube_job_lock/cube_jm
> [d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
> {code}
>  * The job IDs for the running jobs:
>  * 
>  ** 169f75fa-a02f-221b-fc48-037bc7a842d0
>  ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
>  ** 00924699-8b51-8091-6e71-34ccfeba3a98
>  ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
>  ** 416355c2-a3d7-57eb-55c6-c042aa256510
>  ** 12750aea-3b96-c817-64e8-bf893d8c120f
>  ** 42819dde-5857-fd6b-b075-439952f47140
>  ** 00128937-bd4a-d6c1-7a4e-744dee946f67
>  ** 46a0233f-217e-9155-725b-c815ad77ba2c
>  ** 062150ba-bacd-6644-4801-3a51b260d1c5
> As you can see, the 10 jobs that are actually running do not possess the 
> locks thus cannot actually do anything. On the other hand, the 3 jobs 
> possessing the locks cannot resume running because there are already 10 jobs 
> in running state, thus cannot proceed and release the locks. This is a 
> deadlock and the cluster is completely stuck.
> We have been observing this behavior in 3.0.0 (where rolling back 
> https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
> in 3.1.0 as well. It has been originally reported in the comments of 
> https://issues.apache.org/jira/browse/KYLIN-4348.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock that completely 
stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock the cluster is 
completely stuck.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
locks cannot resume running because there are already 10 jobs in running state, 
thus cannot proceed and release the locks. This is a deadlock and the cluster 
is completely stuck.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not possess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
lock are not running thus cannot proceed and release them. This is a deadlock 
that completely stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

h4. Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not posess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
lock are not running thus cannot proceed and release them. This is a deadlock 
that completely stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 * 
 ** 

[jira] [Updated] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4689:
--
Description: 
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}
 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 * 
 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not posess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
lock are not running thus cannot proceed and release them. This is a deadlock 
that completely stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.

  was:
h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled)

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}

 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}

 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 ** 

[jira] [Created] (KYLIN-4689) Deadlock in Kylin job execution

2020-08-06 Thread Gabor Arki (Jira)
Gabor Arki created KYLIN-4689:
-

 Summary: Deadlock in Kylin job execution
 Key: KYLIN-4689
 URL: https://issues.apache.org/jira/browse/KYLIN-4689
 Project: Kylin
  Issue Type: Bug
  Components: Job Engine
Reporter: Gabor Arki


h4. Reproduction steps
 * Install Kylin 3.1.0
 * Deploy a streaming cube
 * Enable the cube having historical data present in the Kafka topic
 * Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly 
segments from Kafka when the cubes were enabled)

h4. Expected result
 * Kylin is starting to process stream segments with stream jobs, eventually 
processing the older segments and catching up with the stream

Actual result
 * After a short time, all jobs are completely stuck without any progress. Some 
in running state, some in pending state.
 * The following logs are continously written:

{code:java}
2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 
12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 
12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock 
path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral 
lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is 
true,will try after one minute
2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There 
are too many jobs running, Job Fetch will wait until next schedule time
{code}

 * Zookeeper indicates the following locks are in place:

{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}

 * The job IDs for the running jobs:
 * 10 running jobs in the cluster which show no progress:

 ** 169f75fa-a02f-221b-fc48-037bc7a842d0
 ** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 ** 00924699-8b51-8091-6e71-34ccfeba3a98
 ** 4620192a-71e1-16dd-3b05-44d7f9144ad4
 ** 416355c2-a3d7-57eb-55c6-c042aa256510
 ** 12750aea-3b96-c817-64e8-bf893d8c120f
 ** 42819dde-5857-fd6b-b075-439952f47140
 ** 00128937-bd4a-d6c1-7a4e-744dee946f67
 ** 46a0233f-217e-9155-725b-c815ad77ba2c
 ** 062150ba-bacd-6644-4801-3a51b260d1c5

As you can see, the 10 jobs that are actually running do not posess the locks 
thus cannot actually do anything. On the other hand, the 3 jobs possessing the 
lock are not running thus cannot proceed and release them. This is a deadlock 
that completely stuck the cluster.

We have been observing this behavior in 3.0.0 (where rolling back 
https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now 
in 3.1.0 as well. It has been originally reported in the comments of 
https://issues.apache.org/jira/browse/KYLIN-4348.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4348) Fix distributed concurrency lock bug

2020-08-06 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172255#comment-17172255
 ] 

Gabor Arki commented on KYLIN-4348:
---

Not sure if it is actually related to this ticket/epic, so created a separate 
bug: https://issues.apache.org/jira/browse/KYLIN-4689

> Fix distributed concurrency lock bug
> 
>
> Key: KYLIN-4348
> URL: https://issues.apache.org/jira/browse/KYLIN-4348
> Project: Kylin
>  Issue Type: Sub-task
>Reporter: wangxiaojing
>Assignee: wangxiaojing
>Priority: Major
> Fix For: v3.1.0
>
> Attachments: image-2020-02-03-10-54-21-976.png, 
> image-2020-02-03-10-54-53-468.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4348) Fix distributed concurrency lock bug

2020-08-06 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172237#comment-17172237
 ] 

Gabor Arki commented on KYLIN-4348:
---

Altogether we have 10 running jobs in the cluster which show no progress:
 * 169f75fa-a02f-221b-fc48-037bc7a842d0
 * 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
 * 00924699-8b51-8091-6e71-34ccfeba3a98
 * 4620192a-71e1-16dd-3b05-44d7f9144ad4
 * 416355c2-a3d7-57eb-55c6-c042aa256510
 * 12750aea-3b96-c817-64e8-bf893d8c120f
 * 42819dde-5857-fd6b-b075-439952f47140
 * 00128937-bd4a-d6c1-7a4e-744dee946f67
 * 46a0233f-217e-9155-725b-c815ad77ba2c
 * 062150ba-bacd-6644-4801-3a51b260d1c5

However, the ones possessing the locks are all pending:
 * f888380e-9ff4-98f5-2df4-1ae71e045f93
 * fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74
 * d1a6475a-9ab2-5ee4-6714-f395e20cfc01

So, essentially the jobs that are running cannot actually run because they are 
unable to acquire a lock. However, the ones that possess the lock cannot 
continue because there are already 10 running jobs. This seems to be a deadlock 
to me.

> Fix distributed concurrency lock bug
> 
>
> Key: KYLIN-4348
> URL: https://issues.apache.org/jira/browse/KYLIN-4348
> Project: Kylin
>  Issue Type: Sub-task
>Reporter: wangxiaojing
>Assignee: wangxiaojing
>Priority: Major
> Fix For: v3.1.0
>
> Attachments: image-2020-02-03-10-54-21-976.png, 
> image-2020-02-03-10-54-53-468.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4348) Fix distributed concurrency lock bug

2020-08-06 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172216#comment-17172216
 ] 

Gabor Arki commented on KYLIN-4348:
---

Zookeeper:
{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]{code}
State of {{f888380e-9ff4-98f5-2df4-1ae71e045f93}} job:
Build Dimension Dictionaries For Steaming Job: FINISHED
Save Cube Dictionaries: PENDING
Overall job status: PENDING

Last logs:
{code:java}
2020-08-05 22:44:44 INFO  [Scheduler 116797841 Job 
f888380e-9ff4-98f5-2df4-1ae71e045f93-354] ExecutableManager:479 - job 
id:f888380e-9ff4-98f5-2df4-1ae71e045f93-00 from RUNNING to SUCCEED
2020-08-05 22:44:44 INFO  [Scheduler 116797841 Job 
f888380e-9ff4-98f5-2df4-1ae71e045f93-354] ExecutableManager:479 - job 
id:f888380e-9ff4-98f5-2df4-1ae71e045f93 from RUNNING to READY{code}
Since this I can only see the following for this job ID:

 
{code:java}
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:b730dd18-173c-53d9-250b-ab9fb30a83b8 is in running, 
job state: READY.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:46a0233f-217e-9155-725b-c815ad77ba2c is in running, 
job state: RUNNING.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:f888380e-9ff4-98f5-2df4-1ae71e045f93 is in running, 
job state: READY.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:70a52f13-e401-4f2f-8a33-b35b5ef955c4 is in running, 
job state: READY.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:4620192a-71e1-16dd-3b05-44d7f9144ad4 is in running, 
job state: RUNNING.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:00924699-8b51-8091-6e71-34ccfeba3a98 is in running, 
job state: RUNNING.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:169f75fa-a02f-221b-fc48-037bc7a842d0 is in running, 
job state: RUNNING.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:456ebd63-9202-f142-eee6-2156846e5c11 is in running, 
job state: READY.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:462 - Job:522b8b86-5f89-cffb-3423-cca8d6908613 is in running, 
job state: READY.
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
2020-08-05 23:18:01 INFO  [streaming_job_submitter-thread-1] 
BuildJobSubmitter:287 - No left quota to build segments for cube:cube_cm at 0
{code}
The rest of the jobs for this cube are:
 * pending on Calculate Statistics from Base Cuboid (one, 
{{b730dd18-173c-53d9-250b-ab9fb30a83b8)}} 
 * pending on Build Dimension Dictionaries For Steaming Job
 * running, but stuck on Build Dimension Dictionaries For Steaming Job (four as 
in the list above)

The other 2 cubes we have are similarly stuck.

> Fix 

[jira] [Commented] (KYLIN-4500) Timeout waiting for connection from pool

2020-08-04 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170903#comment-17170903
 ] 

Gabor Arki commented on KYLIN-4500:
---

The issue has happened today again on one of our test environments. I checked 
the open connections and it was exceeding 1
{code:java}
[hadoop@ip-24-0-1-221 ~]$ netstat -anp | grep 21053 | grep CLOSE_WAIT | wc -l
10007{code}
Based on this, it seems indeed highly likely that 
https://issues.apache.org/jira/browse/KYLIN-4396 is causing the issue just it 
manifests with a different error in our case. I will be monitoring the 3.1.0 
version once we make the upgrade and will get providing an update.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-23 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163407#comment-17163407
 ] 

Gabor Arki edited comment on KYLIN-4656 at 7/23/20, 10:24 AM:
--

[~zhangyaqian] because you are excluding it from the dependencies of the 
released pom, but not from the shading itself.
 * Instead of providing an exclude list and shade everything else into the jdbc 
jar:
{noformat}


org.slf4j:jcl-over-slf4j:*

{noformat}
You should define an include list and shade only that you actually intend to 
shade:

{noformat}


org.apache.kylin:kylin-shaded-guava
org.apache.calcite.avatica:avatica-core
org.apache.calcite.avatica:avatica-metrics
com.fasterxml.jackson.core:jackson-annotations
com.fasterxml.jackson.core:jackson-core
com.fasterxml.jackson.core:jackson-databind
com.google.protobuf:protobuf-java
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
commons-codec:commons-codec
commons-logging:commons-logging

{noformat}
This set is the one that seems to be included in 3.0.2, not sure if everything 
is actually needed or not.

 
 * Also, the relocation in the kylin-shaded-guava module is not complete:
{noformat}
com.google.common
${shadeBase}.com.google.common{noformat}
It relocated only the {{com.google.common}} package of guava, but keeps 
{{com.google.thirdparty}} as is allowing again some classpath conflicts. Should 
be:

{noformat}
com.google
${shadeBase}.com.google{noformat}
 
 * These workarounds seem to fix the issue locally, but defining dependencies 
in the [parent pom|https://github.com/apache/kylin/blob/master/pom.xml#L1095] 
is still a bad practice because these are transitively also defined as 
dependencies for anyone who is using your public libraries like kylin-jdbc.

 


was (Author: arkigabor):
[~zhangyaqian] because you are excluding it from the dependencies of the 
released pom, but not from the shading itself.

* Instead of providing an exclude list and shade everything else:
{noformat}


org.slf4j:jcl-over-slf4j:*

{noformat}
You should define an include list and shade only that you actually intend to 
shade:

 
{noformat}


org.apache.kylin:kylin-shaded-guava
org.apache.calcite.avatica:avatica-core
org.apache.calcite.avatica:avatica-metrics
com.fasterxml.jackson.core:jackson-annotations
com.fasterxml.jackson.core:jackson-core
com.fasterxml.jackson.core:jackson-databind
com.google.protobuf:protobuf-java
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
commons-codec:commons-codec
commons-logging:commons-logging

{noformat}
This set it the one that seems to be included in 3.0.2, not sure if everything 
is actually needed or not.

 

* Also, the relocation in the kylin-shaded-guava module is not complete:
{noformat}
com.google.common
${shadeBase}.com.google.common{noformat}
It relocated only the {{com.google.common}} package of guava, but keeps 
{{com.google.thirdparty}} as is allowing again some classpath conflicts. Should 
be:

 
{noformat}
com.google
${shadeBase}.com.google{noformat}
 

* These workarounds seem to fix the issue locally, but defining dependencies in 
the [parent pom|https://github.com/apache/kylin/blob/master/pom.xml#L1095] is 
still a bad practice because these are transitively also defined as 
dependencies for anyone who is using your public libraries like kylin-jdbc.

 

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-23 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163407#comment-17163407
 ] 

Gabor Arki commented on KYLIN-4656:
---

[~zhangyaqian] because you are excluding it from the dependencies of the 
released pom, but not from the shading itself.

* Instead of providing an exclude list and shade everything else:
{noformat}


org.slf4j:jcl-over-slf4j:*

{noformat}
You should define an include list and shade only that you actually intend to 
shade:

 
{noformat}


org.apache.kylin:kylin-shaded-guava
org.apache.calcite.avatica:avatica-core
org.apache.calcite.avatica:avatica-metrics
com.fasterxml.jackson.core:jackson-annotations
com.fasterxml.jackson.core:jackson-core
com.fasterxml.jackson.core:jackson-databind
com.google.protobuf:protobuf-java
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
commons-codec:commons-codec
commons-logging:commons-logging

{noformat}
This set it the one that seems to be included in 3.0.2, not sure if everything 
is actually needed or not.

 

* Also, the relocation in the kylin-shaded-guava module is not complete:
{noformat}
com.google.common
${shadeBase}.com.google.common{noformat}
It relocated only the {{com.google.common}} package of guava, but keeps 
{{com.google.thirdparty}} as is allowing again some classpath conflicts. Should 
be:

 
{noformat}
com.google
${shadeBase}.com.google{noformat}
 

* These workarounds seem to fix the issue locally, but defining dependencies in 
the [parent pom|https://github.com/apache/kylin/blob/master/pom.xml#L1095] is 
still a bad practice because these are transitively also defined as 
dependencies for anyone who is using your public libraries like kylin-jdbc.

 

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-23 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163352#comment-17163352
 ] 

Gabor Arki edited comment on KYLIN-4656 at 7/23/20, 9:01 AM:
-

Based on the dependency tree of the jdbc module, guava 14 is a transitive 
dependency of the kylin-shaded-guava:
{noformat}
[INFO] < org.apache.kylin:kylin-jdbc >-
[INFO] Building Apache Kylin - JDBC Driver 3.1.1-SNAPSHOT [31/35]
[INFO] [ jar ]-
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:tree (default-cli) @ kylin-jdbc ---
[INFO] org.apache.kylin:kylin-jdbc:jar:3.1.1-SNAPSHOT
[INFO] +- org.apache.kylin:kylin-shaded-guava:jar:3.1.1-SNAPSHOT:compile
[INFO] | \- com.google.guava:guava:jar:14.0:compile{noformat}
The dependency tree of the external module:
{noformat}
[INFO] --< org.apache.kylin:kylin-external >---
[INFO] Building Apache Kylin - kylin External 3.1.1-SNAPSHOT [2/35]
[INFO] [ pom ]-
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:tree (default-cli) @ kylin-external ---
[INFO] org.apache.kylin:kylin-external:pom:3.1.1-SNAPSHOT
[INFO] +- log4j:log4j:jar:1.2.17:provided
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.21:provided
[INFO] +- org.slf4j:jcl-over-slf4j:jar:1.7.21:compile
[INFO] +- org.slf4j:slf4j-api:jar:1.7.21:compile
[INFO] \- org.apache.hadoop:hadoop-common:jar:2.7.1:provided
[INFO] +- org.apache.hadoop:hadoop-annotations:jar:2.7.1:provided
[INFO] | \- jdk.tools:jdk.tools:jar:1.8:system
[INFO] +- com.google.guava:guava:jar:14.0:provided
[INFO] +- commons-cli:commons-cli:jar:1.2:provided
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:provided
[INFO] +- xmlenc:xmlenc:jar:0.52:provided
[INFO] +- commons-httpclient:commons-httpclient:jar:3.1:provided
[INFO] +- commons-codec:commons-codec:jar:1.4:provided
[INFO] +- commons-io:commons-io:jar:2.4:provided
[INFO] +- commons-net:commons-net:jar:3.1:provided
[INFO] +- commons-collections:commons-collections:jar:3.2.2:provided
[INFO] +- org.mortbay.jetty:jetty:jar:6.1.26:provided
[INFO] +- org.mortbay.jetty:jetty-util:jar:6.1.26:provided
[INFO] +- com.sun.jersey:jersey-core:jar:1.9:provided
[INFO] +- com.sun.jersey:jersey-json:jar:1.9:provided
[INFO] | +- org.codehaus.jettison:jettison:jar:1.1:provided
[INFO] | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:provided
[INFO] | | \- javax.xml.bind:jaxb-api:jar:2.2.2:provided
[INFO] | | +- javax.xml.stream:stax-api:jar:1.0-2:provided
[INFO] | | \- javax.activation:activation:jar:1.1:provided
[INFO] | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:provided
[INFO] | \- org.codehaus.jackson:jackson-xc:jar:1.8.3:provided
[INFO] +- com.sun.jersey:jersey-server:jar:1.9:provided
[INFO] | \- asm:asm:jar:3.1:provided
[INFO] +- commons-logging:commons-logging:jar:1.1.3:provided
[INFO] +- commons-lang:commons-lang:jar:2.6:provided
[INFO] +- commons-configuration:commons-configuration:jar:1.6:provided
[INFO] | +- commons-digester:commons-digester:jar:1.8:provided
[INFO] | | \- commons-beanutils:commons-beanutils:jar:1.7.0:provided
[INFO] | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:provided
[INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:provided
[INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:provided
[INFO] +- org.apache.avro:avro:jar:1.7.4:provided
[INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:provided
[INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:provided
[INFO] +- com.google.protobuf:protobuf-java:jar:2.5.0:provided
[INFO] +- com.google.code.gson:gson:jar:2.2.4:provided
[INFO] +- org.apache.hadoop:hadoop-auth:jar:2.7.1:provided
[INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.6:provided
[INFO] | | \- org.apache.httpcomponents:httpcore:jar:4.3.3:provided
[INFO] | +- 
org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:provided
[INFO] | | +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:provided
[INFO] | | +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:provided
[INFO] | | \- org.apache.directory.api:api-util:jar:1.0.0-M20:provided
[INFO] | \- org.apache.curator:curator-framework:jar:2.12.0:provided
[INFO] +- com.jcraft:jsch:jar:0.1.54:provided
[INFO] +- org.apache.curator:curator-client:jar:2.12.0:provided
[INFO] +- org.apache.curator:curator-recipes:jar:2.12.0:provided
[INFO] +- com.google.code.findbugs:jsr305:jar:3.0.1:provided
[INFO] +- org.apache.htrace:htrace-core:jar:3.1.0-incubating:provided
[INFO] +- org.apache.zookeeper:zookeeper:jar:3.4.14:provided
[INFO] | +- com.github.spotbugs:spotbugs-annotations:jar:3.1.9:provided
[INFO] | +- org.apache.yetus:audience-annotations:jar:0.5.0:provided
[INFO] | \- io.netty:netty:jar:3.10.6.Final:provided
[INFO] \- org.apache.commons:commons-compress:jar:1.19:provided{noformat}
It is inheriting all 

[jira] [Commented] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-23 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163352#comment-17163352
 ] 

Gabor Arki commented on KYLIN-4656:
---

Based on the dependency tree of the jdbc module, guava 14 is a transitive 
dependency of the kylin-shaded-guava:
{noformat}

[INFO] < org.apache.kylin:kylin-jdbc >-
[INFO] Building Apache Kylin - JDBC Driver 3.1.1-SNAPSHOT [31/35]
[INFO] [ jar ]-
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:tree (default-cli) @ kylin-jdbc ---
[INFO] org.apache.kylin:kylin-jdbc:jar:3.1.1-SNAPSHOT
[INFO] +- org.apache.kylin:kylin-shaded-guava:jar:3.1.1-SNAPSHOT:compile
[INFO] | \- com.google.guava:guava:jar:14.0:compile{noformat}

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163235#comment-17163235
 ] 

Gabor Arki edited comment on KYLIN-4656 at 7/23/20, 5:52 AM:
-

Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar. From my findings, this 
is likely version 14 of guava.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of the competing classes randomly causing numerous 
runtime issues.

Also, the shaded version you are referring too is also present in the jar and 
is properly relocated to the org.apache.kylin.shaded.com.google package and is 
fine as is.


was (Author: arkigabor):
Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar. From my findings, this 
is likely version 14 of guava.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of the competing classes randomly causing numerous 
runtime issues.

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163235#comment-17163235
 ] 

Gabor Arki edited comment on KYLIN-4656 at 7/23/20, 5:48 AM:
-

Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar. From my findings, this 
is likely version 14 of guava.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of the competing classes randomly causing numerous 
runtime issues.


was (Author: arkigabor):
Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar. From my findings, this 
is likely version 14 of guava.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of them randomly causing numerous runtime issues.

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163235#comment-17163235
 ] 

Gabor Arki edited comment on KYLIN-4656 at 7/23/20, 5:47 AM:
-

Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar. From my findings, this 
is likely version 14 of guava.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of them randomly causing numerous runtime issues.


was (Author: arkigabor):
Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of them randomly causing numerous runtime issues.

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163235#comment-17163235
 ] 

Gabor Arki commented on KYLIN-4656:
---

Download the published 3.1.0 jar: 
[https://mvnrepository.com/artifact/org.apache.kylin/kylin-jdbc/3.1.0]

Then unzip the file. You will see the com.google package inside the jar 
containing the guava library within the kylin-jdbc jar.

!image-2020-07-23-07-44-40-675.png!

 

If you also have the real guava jar on your classpath, you end up having 2 
separate and highly incompatible versions of guava classes present, and the 
class-loader is loading one of them randomly causing numerous runtime issues.

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4656:
--
Attachment: image-2020-07-23-07-44-40-675.png

> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
> Attachments: image-2020-07-23-07-44-40-675.png
>
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4500) Timeout waiting for connection from pool

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162925#comment-17162925
 ] 

Gabor Arki commented on KYLIN-4500:
---

For now, I will keep monitoring our server with netstat and try to determine 
whether there is any correlation with the S3 pool exhaustion. Also, we will try 
to upgrade to 3.1.0 but will probably take some time to tell whether the issue 
is still reproducible with that version. I will post an update with our 
findings once I have them.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KYLIN-4500) Timeout waiting for connection from pool

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162898#comment-17162898
 ] 

Gabor Arki edited comment on KYLIN-4500 at 7/22/20, 4:21 PM:
-

[~hit_lacus], the linked issue seems to be unrelated. But if the dictionary is 
stored on the cluster, it could be related, however I do not see any 
FileNotFoundException logs when we are hitting this issue. I do see the slow 
ramp-up in CLOSE_WAIT connections though on the server.

We are running Kylin on AWS EMR cluster and use S3 (EMRFS) for data storage 
instead of HDFS to make the cluster stateless. However, after some continuous 
uptime, we are always facing this issue where both the query server and the 
Kylin MR jobs are suddenly failing with the aforementioned Exception. The root 
cause of these failures is that the connection pool of the EMR cluster to S3 is 
exhausted and new operations fail to acquire a connection and time out while 
waiting for an S3 connection.

No matter how much of a pool size we are configuring for the 
fs.s3.maxConnections value, this keeps happening. The underlying issue is very 
likely a connection leak where some code is not properly closing and returning 
a connection to the pool. Given a query server restart is solving the issue, I 
suspect the pool is exhausted somewhere in the Kylin query server code.


was (Author: arkigabor):
[~hit_lacus], the linked issue seems to be unrelated.

We are running Kylin on AWS EMR cluster and use S3 (EMRFS) for data storage 
instead of HDFS to make the cluster stateless. However, after some continuous 
uptime, we are always facing this issue where both the query server and the 
Kylin MR jobs are suddenly failing with the aforementioned Exception. The root 
cause of these failures is that the connection pool of the EMR cluster to S3 is 
exhausted and new operations fail to acquire a connection and time out while 
waiting for an S3 connection.

No matter how much of a pool size we are configuring for the 
fs.s3.maxConnections value, this keeps happening. The underlying issue is very 
likely a connection leak where some code is not properly closing and returning 
a connection to the pool. Given a query server restart is solving the issue, I 
suspect the pool is exhausted somewhere in the Kylin query server code.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4656:
--
Description: 
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplication with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the class-loader from the 2 versions. Based on the 
runtime errors of the missing classes and methods, it seems to be a very old 
version, probably <=14.

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.

  was:
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplication with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the class-loader from the 2 versions. Based on the 
runtime errors of the missing classes and methods, it seems to be a very old 
version, probably <=14. Because of the, 

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.


> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4656:
--
Description: 
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplication with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the class-loader from the 2 versions. Based on the 
runtime errors of the missing classes and methods, it seems to be a very old 
version, probably <=14. Because of the, 

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.

  was:
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplication with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the class-loader from the 2 versions. Based on the 
runtime errors of the missing classes and methods, it seems to be a very old 
version, probably <=14.

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.


> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14. Because of the, 
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4656:
--
Description: 
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplication with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the class-loader from the 2 versions. Based on the 
runtime errors of the missing classes and methods, it seems to be a very old 
version, probably <=14.

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.

  was:
The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplications with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the classloader from the 2 versions.

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.


> Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
> ---
>
> Key: KYLIN-4656
> URL: https://issues.apache.org/jira/browse/KYLIN-4656
> Project: Kylin
>  Issue Type: Bug
>  Components: Driver - JDBC
>Affects Versions: v3.1.0
>Reporter: Gabor Arki
>Priority: Critical
>
> The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
> version of the Guava library. This is causing class duplication with the 
> original guava jar if it is also on the classpath which results in 
> non-deterministic, runtime errors depending on which version of a certain 
> guava class has been picked up by the class-loader from the 2 versions. Based 
> on the runtime errors of the missing classes and methods, it seems to be a 
> very old version, probably <=14.
>  
> Either implement a proper shading with package relocation or rely on 
> transitive dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KYLIN-4656) Guava classpath conflict caused by kylin-jdbc 3.1.0 jar

2020-07-22 Thread Gabor Arki (Jira)
Gabor Arki created KYLIN-4656:
-

 Summary: Guava classpath conflict caused by kylin-jdbc 3.1.0 jar
 Key: KYLIN-4656
 URL: https://issues.apache.org/jira/browse/KYLIN-4656
 Project: Kylin
  Issue Type: Bug
  Components: Driver - JDBC
Affects Versions: v3.1.0
Reporter: Gabor Arki


The newly released kylin-jdbc 3.1.0 jar contains a shaded, non-repackaged 
version of the Guava library. This is causing class duplications with the 
original guava jar if it is also on the classpath which results in 
non-deterministic, runtime errors depending on which version of a certain guava 
class has been picked up by the classloader from the 2 versions.

 

Either implement a proper shading with package relocation or rely on transitive 
dependency, but do not shade non-repackaged versions of libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KYLIN-4500) Timeout waiting for connection from pool

2020-07-22 Thread Gabor Arki (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162898#comment-17162898
 ] 

Gabor Arki commented on KYLIN-4500:
---

[~hit_lacus], the linked issue seems to be unrelated.

We are running Kylin on AWS EMR cluster and use S3 (EMRFS) for data storage 
instead of HDFS to make the cluster stateless. However, after some continuous 
uptime, we are always facing this issue where both the query server and the 
Kylin MR jobs are suddenly failing with the aforementioned Exception. The root 
cause of these failures is that the connection pool of the EMR cluster to S3 is 
exhausted and new operations fail to acquire a connection and time out while 
waiting for an S3 connection.

No matter how much of a pool size we are configuring for the 
fs.s3.maxConnections value, this keeps happening. The underlying issue is very 
likely a connection leak where some code is not properly closing and returning 
a connection to the pool. Given a query server restart is solving the issue, I 
suspect the pool is exhausted somewhere in the Kylin query server code.

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4500) Timeout waiting for connection from pool

2020-05-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4500:
--
Description: 
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause in both cases is:
{noformat}
Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the fs.s3.maxConnections setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.

A full stack trace from the QueryService is attached.

 

  was:
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause in both cases is:
{noformat}
Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 


> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the fs.s3.maxConnections setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.
> A full stack trace from the QueryService is attached.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4500) Timeout waiting for connection from pool

2020-05-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4500:
--
Description: 
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause in both cases is:
{noformat}
Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 

  was:
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause is both cases is:
{noformat}
Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 


> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause in both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> {{Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.}}
> {{A full stack trace from the QueryService is attached.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4500) Timeout waiting for connection from pool

2020-05-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4500:
--
Description: 
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause is both cases is:
{noformat}
Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 

  was:
h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause is both cases is:

{{Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool}}
{{ at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]}}

{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 


> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause is both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> {{Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.}}
> {{A full stack trace from the QueryService is attached.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4500) Timeout waiting for connection from pool

2020-05-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4500:
--
Attachment: kylin-connection-timeout.txt

> Timeout waiting for connection from pool
> 
>
> Key: KYLIN-4500
> URL: https://issues.apache.org/jira/browse/KYLIN-4500
> Project: Kylin
>  Issue Type: Bug
>Reporter: Gabor Arki
>Priority: Major
> Attachments: kylin-connection-timeout.txt
>
>
> h4. Environment
>  * Kylin server 3.0.0
>  * EMR 5.28
> h4. Issue
> After an extended uptime, both Kylin query server and jobs running on EMR 
> stop working. The root cause is both cases is:
> {noformat}
> Caused by: java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable 
> to execute HTTP request: Timeout waiting for connection from pool
> at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
>  ~[emrfs-hadoop-assembly-2.37.0.jar:?]{noformat}
> {{Based on 
> [https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
>  increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
> issue thus the underlying issue is likely a connection leak. It also 
> indicates a leak that restarting the kylin service solves the problem.}}
> {{A full stack trace from the QueryService is attached.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KYLIN-4500) Timeout waiting for connection from pool

2020-05-18 Thread Gabor Arki (Jira)
Gabor Arki created KYLIN-4500:
-

 Summary: Timeout waiting for connection from pool
 Key: KYLIN-4500
 URL: https://issues.apache.org/jira/browse/KYLIN-4500
 Project: Kylin
  Issue Type: Bug
Reporter: Gabor Arki


h4. Environment
 * Kylin server 3.0.0
 * EMR 5.28

h4. Issue

After an extended uptime, both Kylin query server and jobs running on EMR stop 
working. The root cause is both cases is:

{{Caused by: java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to 
execute HTTP request: Timeout waiting for connection from pool}}
{{ at 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.getFileStatus(S3NativeFileSystem2.java:257)
 ~[emrfs-hadoop-assembly-2.37.0.jar:?]}}

{{Based on 
[https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/]
 increasing the *fs.s3.maxConnections* setting to 1 is just delaying the 
issue thus the underlying issue is likely a connection leak. It also indicates 
a leak that restarting the kylin service solves the problem.}}

{{A full stack trace from the QueryService is attached.}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4382) Unable to use DATE type in prepared statements

2020-02-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4382:
--
Description: 
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * StreamStorageQuery's _Skip cube segment_ log message is containing the 
filter with an epoch day value, for example: {{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}
 * Trying to CAST a String to DATE fails with the error described here: 
https://issues.apache.org/jira/browse/CALCITE-3100

 

  was:
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * StreamStorageQuery's _Skip cube segment_ log message is containing the 
filter with an epoch day value, for example: {{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 


> Unable to use DATE type in prepared statements
> --
>
> Key: KYLIN-4382
> URL: https://issues.apache.org/jira/browse/KYLIN-4382
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Reporter: Gabor Arki
>Priority: Major
>
> h4. Environment
>  * Kylin JDBC driver: 3.0.0
>  * Kylin server: 3.0.0
> h4. Reproduction steps
>  * Use a cube with a DATE column (like the derived day_start)
>  * Create a prepared statement and try to filter with this column in a where 
> clause
>  * Pass the values as java.sql.Date type
> h4. Expected result
>  * The proper response is provided for the query with the values for the 
> specified date(s)
> h4. Actual result
>  * No data is returned
>  * StreamStorageQuery's _Skip cube segment_ log message is containing the 
> filter with an epoch day value, for example: {{DAY_START GTE [18231]}}
>  * Executing the same query from the web UI you get the expected response. 
> Now the same log message is containing the filter in epoch millis format, for 
> example: {{DAY_START IN [158077440, 158086080]}}
>  * Passing the value as String instead of java.sql.Date fails on server-side 
> with: {{exception while executing query: java.lang.String cannot be cast to 
> java.lang.Integer}}
>  * Passing the value as java.sql.Timestamp or java.util.Date fails on 
> server-side with: {{exception while executing query: java.lang.Long cannot be 
> cast to java.lang.Integer}}
>  * Trying to CAST a String to DATE fails with the error described here: 
> https://issues.apache.org/jira/browse/CALCITE-3100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4382) Unable to use DATE type in prepared statements

2020-02-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4382:
--
Description: 
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * StreamStorageQuery's _Skip cube segment_ log message is containing the 
filter with an epoch day value, for example: {{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 

  was:
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * StreamStorageQuery's _Skip cube segment_ log message is containing the 
filter with an epoch day value, for example: {{}}{{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 


> Unable to use DATE type in prepared statements
> --
>
> Key: KYLIN-4382
> URL: https://issues.apache.org/jira/browse/KYLIN-4382
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Reporter: Gabor Arki
>Priority: Major
>
> h4. Environment
>  * Kylin JDBC driver: 3.0.0
>  * Kylin server: 3.0.0
> h4. Reproduction steps
>  * Use a cube with a DATE column (like the derived day_start)
>  * Create a prepared statement and try to filter with this column in a where 
> clause
>  * Pass the values as java.sql.Date type
> h4. Expected result
>  * The proper response is provided for the query with the values for the 
> specified date(s)
> h4. Actual result
>  * No data is returned
>  * StreamStorageQuery's _Skip cube segment_ log message is containing the 
> filter with an epoch day value, for example: {{DAY_START GTE [18231]}}
>  * Executing the same query from the web UI you get the expected response. 
> Now the same log message is containing the filter in epoch millis format, for 
> example: {{DAY_START IN [158077440, 158086080]}}
>  * Passing the value as String instead of java.sql.Date fails on server-side 
> with: {{exception while executing query: java.lang.String cannot be cast to 
> java.lang.Integer}}
>  * Passing the value as java.sql.Timestamp or java.util.Date fails on 
> server-side with: {{exception while executing query: java.lang.Long cannot be 
> cast to java.lang.Integer}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KYLIN-4382) Unable to use DATE type in prepared statements

2020-02-18 Thread Gabor Arki (Jira)


 [ 
https://issues.apache.org/jira/browse/KYLIN-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Arki updated KYLIN-4382:
--
Description: 
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * StreamStorageQuery's _Skip cube segment_ log message is containing the 
filter with an epoch day value, for example: {{}}{{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 

  was:
h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * 
StreamStorageQuery's _Skip cube segment_ log message is containing the filter 
with an epoch day value, for example: {{}}{{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 


> Unable to use DATE type in prepared statements
> --
>
> Key: KYLIN-4382
> URL: https://issues.apache.org/jira/browse/KYLIN-4382
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Reporter: Gabor Arki
>Priority: Major
>
> h4. Environment
>  * Kylin JDBC driver: 3.0.0
>  * Kylin server: 3.0.0
> h4. Reproduction steps
>  * Use a cube with a DATE column (like the derived day_start)
>  * Create a prepared statement and try to filter with this column in a where 
> clause
>  * Pass the values as java.sql.Date type
> h4. Expected result
>  * The proper response is provided for the query with the values for the 
> specified date(s)
> h4. Actual result
>  * No data is returned
>  * StreamStorageQuery's _Skip cube segment_ log message is containing the 
> filter with an epoch day value, for example: {{}}{{DAY_START GTE [18231]}}
>  * Executing the same query from the web UI you get the expected response. 
> Now the same log message is containing the filter in epoch millis format, for 
> example: {{DAY_START IN [158077440, 158086080]}}
>  * Passing the value as String instead of java.sql.Date fails on server-side 
> with: {{exception while executing query: java.lang.String cannot be cast to 
> java.lang.Integer}}
>  * Passing the value as java.sql.Timestamp or java.util.Date fails on 
> server-side with: {{exception while executing query: java.lang.Long cannot be 
> cast to java.lang.Integer}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KYLIN-4382) Unable to use DATE type in prepared statements

2020-02-18 Thread Gabor Arki (Jira)
Gabor Arki created KYLIN-4382:
-

 Summary: Unable to use DATE type in prepared statements
 Key: KYLIN-4382
 URL: https://issues.apache.org/jira/browse/KYLIN-4382
 Project: Kylin
  Issue Type: Bug
  Components: Query Engine
Reporter: Gabor Arki


h4. Environment
 * Kylin JDBC driver: 3.0.0
 * Kylin server: 3.0.0

h4. Reproduction steps
 * Use a cube with a DATE column (like the derived day_start)
 * Create a prepared statement and try to filter with this column in a where 
clause
 * Pass the values as java.sql.Date type

h4. Expected result
 * The proper response is provided for the query with the values for the 
specified date(s)

h4. Actual result
 * No data is returned
 * 
StreamStorageQuery's _Skip cube segment_ log message is containing the filter 
with an epoch day value, for example: {{}}{{DAY_START GTE [18231]}}
 * Executing the same query from the web UI you get the expected response. Now 
the same log message is containing the filter in epoch millis format, for 
example: {{DAY_START IN [158077440, 158086080]}}

 * Passing the value as String instead of java.sql.Date fails on server-side 
with: {{exception while executing query: java.lang.String cannot be cast to 
java.lang.Integer}}
 * Passing the value as java.sql.Timestamp or java.util.Date fails on 
server-side with: {{exception while executing query: java.lang.Long cannot be 
cast to java.lang.Integer}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)