[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2021-01-18 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267193#comment-17267193
 ] 

Stamatis Zampetakis commented on HIVE-24569:


[~prasanth_j] indicated in the PR that removing the '.done' suffix in log files 
(which happens with the current PR) may cause some problems with respect to 
yarn log aggregation and tez ui. I searched a bit around the history of this 
naming pattern and it seems that indeed there are some dependencies with yarn 
and tez (relevant issues are HIVE-14224, HIVE-14225, SLIDER-116, TEZ-3629) so 
using plainly the {{IdlePurgePolicy}} may not be possible.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2021-01-18 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267187#comment-17267187
 ] 

Stamatis Zampetakis commented on HIVE-24569:


It turns out that the race condition that leads to the leak of file 
descriptors/appenders is already logged in HIVE-14300.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 979 -> 
> 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2021-01-15 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266421#comment-17266421
 ] 

Stamatis Zampetakis commented on HIVE-24569:


Thanks [~prasanth_j], I replied in the PR. If you think that 5 sec of 
inactivity is very low then we can increase this value; we may keep the 
descriptor open for a bit longer but it is better if it is not on us to manage 
the lifecycle of appenders.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2021-01-15 Thread Prasanth Jayachandran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266271#comment-17266271
 ] 

Prasanth Jayachandran commented on HIVE-24569:
--

Left a question about how the Idle purging is triggered (we wanted to avoid 
files being closed too frequently). Looks good otherwise, +1. Thanks for adding 
tests for it! 

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 979 -> 
> 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2021-01-15 Thread Jesus Camacho Rodriguez (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266250#comment-17266250
 ] 

Jesus Camacho Rodriguez commented on HIVE-24569:


[~prasanthj], [~mustafaiman], could any of you review this patch? Thanks

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 979 -> 
> 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2020-12-24 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254632#comment-17254632
 ] 

Stamatis Zampetakis commented on HIVE-24569:


I believe relying on {{LlapRoutingAppenderPurgePolicy}} is very prone to errors 
and hard to reproduce bugs due to race conditions. I think the safest way to 
move forward and avoid a lot of this small fixes in the future is to rely on a 
less precise but more robust {{PurgePolicy}} such as the {{IdlePurgePolicy}}. I 
tested a config with {{IdlePurgePolicy}} on a test cluster and the problem no 
longer appears. Possibly as part of this fix I will try to get rid of 
{{LlapRoutingAppenderPurgePolicy}} and related classes also for HS2 and not 
only for LLAP.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2020-12-24 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254631#comment-17254631
 ] 

Stamatis Zampetakis commented on HIVE-24569:


HIVE-22127 also concerns HS2 and most likely it was solved by HIVE-17128.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 979 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log
> {noformat}
> or in the heap 

[jira] [Commented] (HIVE-24569) LLAP daemon leaks file descriptors/log4j appenders

2020-12-24 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254630#comment-17254630
 ] 

Stamatis Zampetakis commented on HIVE-24569:


The fix for HIVE-17128 deals with more or less the same problem but in HS2 
process.

> LLAP daemon leaks file descriptors/log4j appenders
> --
>
> Key: HIVE-24569
> URL: https://issues.apache.org/jira/browse/HIVE-24569
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: llap-appender-gc-roots.png
>
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx-- 1 hive hadoop 64 Dec 24 12:08 979 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_02/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log
> {noformat}
> or in