[jira] [Commented] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally

2024-03-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832725#comment-17832725
 ] 

ASF subversion and git services commented on KYLIN-5745:


Commit 180ff07afeaf64da8e565b7aa5882c3a89f1c268 in kylin's branch 
refs/heads/kylin5 from fengguangyuan
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=180ff07afe ]

KYLIN-5745 Using a global thread pool to clean underlying storages

1. Using a global thread pool to clean underlying storages;
2. Launching cleaning tasks in the local thread and to ignore
FileNotFoundException while collecting HDFS files.

Co-authored-by: Guangyuan Feng 


> The historical garbage cleanup task was not completed, causing the subsequent 
> scheduled garbage cleanup task cannot be executed normally
> 
>
> Key: KYLIN-5745
> URL: https://issues.apache.org/jira/browse/KYLIN-5745
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: 5.0-beta
>Reporter: zhong.zhu
>Assignee: zhong.zhu
>Priority: Major
> Fix For: 5.0.0
>
>
> {*}Problem description{*}: 
> Timed garbage cleanup operation cannot be completed successfully
> {*}Background{*}: 
> The customer found that Kylin has a large number of small files occupying 
> hdfs storage, we need to clean up, we check the customer's environment and 
> found that the timed garbage cleanup has not been completed properly, has 
> been timeout!
> *Troubleshooting:*
> After the check, it is found that the customer's garbage clearing is 
> triggered for the first time in the morning of 4.6 after Kylin is restarted 
> on the night of 4.5. After this clearing operation is triggered, the thread 
> of query history has been deleted since then. As a result, subsequent 
> periodic garbage clearing tasks cannot be completed
> Delete 2,000 rows of data at a time, one of the customer's projects need to 
> delete 550,000 query history, look at the kylin.log record, delete 
> time-consuming because of table locking problems lead to a delete operation 
> even reached more than 20 minutes!
> The following record is that the main thread of garbage collection is waiting 
> for the query history cleaning to complete, but the query history cleaning 
> has not been completed, and then the main thread timeout and exit.
> {code:shell}
> 2023-04-06T00:00:00,015 INFO  [RoutineOpsWorker-287] service.ScheduleService 
> : execute task MetadataBackup with remaining time: 1435 ms
> 2023-04-06T00:01:52,649 INFO  [RoutineOpsWorker-287] service.ScheduleService 
> : execute task QueryHistoriesCleanup with remaining time: 14287361 ms
> ...
> 2023-04-06T04:00:00,012 WARN  [DefaultTaskScheduler-3] 
> service.ScheduleService : Routine task execution timeout
> java.util.concurrent.TimeoutException: null
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205) 
> ~[?:1.8.0_242]
>   at 
> org.apache.kylin.rest.service.ScheduleService.executeTask(ScheduleService.java:107)
>  ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
>   at 
> org.apache.kylin.rest.service.ScheduleService.routineTask(ScheduleService.java:77)
>  ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
>   at 
> org.apache.kylin.rest.service.ScheduleService$$FastClassBySpringCGLIB$$afbfc46c.invoke()
>  ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
> {code}
> The following record is until the latest time provided by the log, after 9:00 
> pm the query history is still processing deletion, not with the termination 
> of the main thread
> {code:shell}
> 2023-04-06T00:08:43,015 DEBUG [QueryHistoryCleanWorker-23145] 
> QueryHistoryMapper.selectByProject : <==  Total: 12
> 2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
> util.QueryHisStoreUtil : Query histories of project is less than 
> the maximum limit, so skip it.
> 2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
> util.QueryHisStoreUtil : Query histories of project is less than the 
> maximum limit, so skip it.
> 2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
> util.QueryHisStoreUtil : Query histories of project is less than the 
> maximum limit, so skip it.
> 2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
> util.QueryHisStoreUtil : Query histories of project is less than the 
> maximum limit, so skip it.
> 2023-04-06T00:08:43,017 INFO  [QueryHistoryCleanWorker-23145] 
> util.QueryHisStoreUtil : Start to delete query histories that are beyond max 
> size for project, records:1551669
> ...
> 2023-04-06T09:03:54,974 INFO  [QueryHistoryCleanWorker-23145] 
> query.JdbcQueryHistoryStore : Delete 2000 row query history for project 
> [CXCZH] takes 938060 ms
> 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] 
> 

[jira] [Commented] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally

2023-12-11 Thread zhong.zhu (Jira)


[ 
https://issues.apache.org/jira/browse/KYLIN-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795247#comment-17795247
 ] 

zhong.zhu commented on KYLIN-5745:
--

h1. Root Cause

Controllers in Spring are singleton model, so each call to the following method 
Service, will be serial cleanup of the underlying HDFS files, compared to the 
cleanup of metadata and query histories, this process is particularly 
time-consuming; at the same time, *_MetadataToolHelper_* is serial design, so 
other calls to this method will also cause serialization problems.
{code:java}
MetadataToolHelper::cleanStorage{code}
Another programming issue, performing various types of data cleanup, Kylin uses 
two types of thread pools. One is the Route service's single-threaded pool; the 
other is Spring's built-in task pool, the default 5 threads, and the tasks they 
accept will involve cleaning up the function of HDFS files, so it will lead to 
the whole Kylin into the unavailability, so you need to use a separate thread 
pool to perform the function of cleaning up the HDFS files.
h1. Dev Degign

Transforms the logic for cleaning up HDFS files into a thread pooling pattern, 
while providing timeout logic.

A thread pool based on *PriorityBlockingQueue's* priority task queue to build a 
task manager for managing storage garbage cleanup that decouples the steps of 
serial cleanup of Class I, II, and III garbage in an asynchronous manner.
h1. Major Changes

1. Define the task type:{*}SERVICE > CLI > ROUTINE{*}
2. Task abstract class with weights:
{code:java}
public abstract class AbstractComparableCleanTask implements RunnableĀ 
Comparable{code}
3. Custom thread pools and task cache queues: configurable 
{_}*CachedThreadPool*{_}, specified _*PriorityBlockingQueue*_
4. Added Task Manager:
{code:java}
public class CleanStoragesHelper implements Closeable{code}
5. New timeout mechanism: based on *_JAVA CompletableFuture / Future _*to 
complete the core logic, the new parameters are as follows
{*}_kylin.storage.clean-timeout=1h_{*}: specify the cleanup task, default 
timeout, only for query histories/storage cleanup, {color:#de350b}this 
parameter will only take effect in non-Routine scenarios, non-CLI 
scenarios;{color}
{*}_kylin.storage.clean-tasks-concurrency=5_{*}ļ¼šSpecifies the number of threads 
storing garbage Query histories/HDFS tasks, i.e., up to how many of these two 
types of tasks are executed at the same time, with subsequent commits waiting 
in the task cache queue.
6. Asynchronous/synchronous mechanism:{_}*CleanStoragesHelper*{_} provides 
synchronous/asynchronous methods, which are chosen by the upper level according 
to the scenario.

7. Initialization timing for global classes:Initialize the tool class 
*_CleanStoragesHelper_* in *_AppInitializer_* to avoid problems due to the 
method *_KylinConfig.getInstanceFromEn_* that may return non-system KylinConfig.
8. Track the life cycle of a task:{*}CREATE => SUBMIT => SUCCEED/FAILED{*}

> The historical garbage cleanup task was not completed, causing the subsequent 
> scheduled garbage cleanup task cannot be executed normally
> 
>
> Key: KYLIN-5745
> URL: https://issues.apache.org/jira/browse/KYLIN-5745
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: 5.0-beta
>Reporter: zhong.zhu
>Assignee: zhong.zhu
>Priority: Major
> Fix For: 5.0.0
>
>
> {*}Problem description{*}: 
> Timed garbage cleanup operation cannot be completed successfully
> {*}Background{*}: 
> The customer found that Kylin has a large number of small files occupying 
> hdfs storage, we need to clean up, we check the customer's environment and 
> found that the timed garbage cleanup has not been completed properly, has 
> been timeout!
> *Troubleshooting:*
> After the check, it is found that the customer's garbage clearing is 
> triggered for the first time in the morning of 4.6 after KE is restarted on 
> the night of 4.5. After this clearing operation is triggered, the thread of 
> query history has been deleted since then. As a result, subsequent periodic 
> garbage clearing tasks cannot be completed
> Delete 2,000 rows of data at a time, one of the customer's projects need to 
> delete 550,000 query history, look at the kylin.log record, delete 
> time-consuming because of table locking problems lead to a delete operation 
> even reached more than 20 minutes!
> The following record is that the main thread of garbage collection is waiting 
> for the query history cleaning to complete, but the query history cleaning 
> has not been completed, and then the main thread timeout and exit.
> {code:shell}
> 2023-04-06T00:00:00,015 INFO  [RoutineOpsWorker-287] service.ScheduleService 
> : execute task