[jira] [Commented] (KYLIN-5731) When there is a null value in the Kafka source data, the build job reports an error
[ https://issues.apache.org/jira/browse/KYLIN-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795598#comment-17795598 ] ASF GitHub Bot commented on KYLIN-5731: --- thy950523 opened a new pull request, #2161: URL: https://github.com/apache/kylin/pull/2161 ## Proposed changes Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue. ## Branch to commit * [ ] Branch **kylin3** for v2.x to v3.x * [ ] Branch **kylin4** for v4.x * [x] Branch **kylin5** for v5.x ## Types of changes What types of changes does your code introduce to Kylin? _Put an `x` in the boxes that apply_ * [x] Bugfix (non-breaking change which fixes an issue) * [ ] New feature (non-breaking change which adds functionality) * [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) * [ ] Documentation Update (if none of the other choices apply) ## Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code._ * [x] I have created an issue on [Kylin's jira](https://issues.apache.org/jira/browse/KYLIN), and have described the bug/feature there in detail * [x] Commit messages in my PR start with the related jira ID, like "KYLIN- Make Kylin project open-source" * [x] Compiling and unit tests pass locally with my changes * [x] I have added tests that prove my fix is effective or that my feature works * [x] I have added necessary documentation (if appropriate) * [x] Any dependent changes have been merged ## Further comments If this is a relatively large or complex change, kick off the discussion at [u...@kylin.apache.org](mailto:u...@kylin.apache.org) or [d...@kylin.apache.org](mailto:d...@kylin.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc... > When there is a null value in the Kafka source data, the build job reports an > error > --- > > Key: KYLIN-5731 > URL: https://issues.apache.org/jira/browse/KYLIN-5731 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Minor > Fix For: 5.0.0 > > > If the field value in kafka json data is null, the task will report an error. > Null value field "clue_source_2_name":null > Field type is varchar > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] KYLIN-5731 ~ KYLIN-5747 merge code into kylin5 [kylin]
thy950523 opened a new pull request, #2161: URL: https://github.com/apache/kylin/pull/2161 ## Proposed changes Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue. ## Branch to commit * [ ] Branch **kylin3** for v2.x to v3.x * [ ] Branch **kylin4** for v4.x * [x] Branch **kylin5** for v5.x ## Types of changes What types of changes does your code introduce to Kylin? _Put an `x` in the boxes that apply_ * [x] Bugfix (non-breaking change which fixes an issue) * [ ] New feature (non-breaking change which adds functionality) * [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) * [ ] Documentation Update (if none of the other choices apply) ## Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code._ * [x] I have created an issue on [Kylin's jira](https://issues.apache.org/jira/browse/KYLIN), and have described the bug/feature there in detail * [x] Commit messages in my PR start with the related jira ID, like "KYLIN- Make Kylin project open-source" * [x] Compiling and unit tests pass locally with my changes * [x] I have added tests that prove my fix is effective or that my feature works * [x] I have added necessary documentation (if appropriate) * [x] Any dependent changes have been merged ## Further comments If this is a relatively large or complex change, kick off the discussion at [u...@kylin.apache.org](mailto:u...@kylin.apache.org) or [d...@kylin.apache.org](mailto:d...@kylin.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@kylin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (KYLIN-5731) When there is a null value in the Kafka source data, the build job reports an error
[ https://issues.apache.org/jira/browse/KYLIN-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhong.zhu updated KYLIN-5731: - Description: If the field value in kafka json data is null, the task will report an error. Null value field "clue_source_2_name":null Field type is varchar was: If the field value in kafka json data is null, the task will report an error. Null value field "clue_source_2_name":null Field type is varchar > When there is a null value in the Kafka source data, the build job reports an > error > --- > > Key: KYLIN-5731 > URL: https://issues.apache.org/jira/browse/KYLIN-5731 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Minor > Fix For: 5.0.0 > > > If the field value in kafka json data is null, the task will report an error. > Null value field "clue_source_2_name":null > Field type is varchar > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KYLIN-5743) Set kylin.query.convert-sum-expression-enabled=true, fail to completely hit the aggregate index when the query contains sum (case when) expressions
[ https://issues.apache.org/jira/browse/KYLIN-5743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795303#comment-17795303 ] zhong.zhu commented on KYLIN-5743: -- h1.Root Cause With the _*kylin.query.convert-sum-expression-enabled=true*_ conversion switch turned on, the original SQL generates an execution plan as follows {code:shell} KapOLAPToEnumerableConverter KapLimitRel(ctx=[], fetch=[500]) KapAggregateRel(group-set=[[]], groups=[null], EXPR$0=[SUM($0)], ctx=[]) KapProjectRel($f0=[$1], ctx=[]) KapJoinRel(condition=[=($0, $2)], joinType=[inner], ctx=[]) KapProjectRel(LO_COMMITDATE=[$15], CASE=[CASE(=($15, CAST('20230501'):DATE NOT NULL), $11, null)], ctx=[]) KapTableScan(table=[[SSB, LINEORDER]], ctx=[], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]) KapProjectRel(EXPR$0=[$0], ctx=[]) KapAggregateRel(group-set=[[]], groups=[null], EXPR$0=[MAX($0)], ctx=[]) KapProjectRel(LO_COMMITDATE=[$15], ctx=[]) KapTableScan(table=[[SSB, LINEORDER]], ctx=[], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]) {code} It can be seen that case when is pushed over TableScan (in combination with another ProjectMergeRule to the above result), which makes SumExpressionRule not work. It is possible that this is unstable, and it is also possible to get a different execution plan {code:shell} KapOLAPToEnumerableConverter KapLimitRel(ctx=[], fetch=[500]) KapAggregateRel(group-set=[[]], groups=[null], AGG$0=[SUM($0)], ctx=[]) KapProjectRel($f0=[CASE(=($0, CAST('20230501'):DATE NOT NULL), $1, null)], ctx=[]) KapAggregateRel(group-set=[[0]], groups=[null], TOP_AGG$0=[SUM($1)], TOP_AGG$1=[SUM($2)], ctx=[]) KapProjectRel(LO_COMMITDATE=[$0], SUM_CASE$0$0=[$1], $f2=[*(0, $2)], ctx=[]) KapAggregateRel(group-set=[[0]], groups=[null], SUM_CASE$0$0=[SUM($1)], SUM_CONST$1=[COUNT()], ctx=[]) KapProjectRel(LO_COMMITDATE=[$0], LO_DISCOUNT=[$1], ctx=[]) KapProjectRel(LO_COMMITDATE=[$0], LO_DISCOUNT=[$1], LO_ORDERDATE=[$2], ctx=[]) KapJoinRel(condition=[=($0, $3)], joinType=[inner], ctx=[]) KapProjectRel(LO_COMMITDATE=[$15], LO_DISCOUNT=[$11], LO_ORDERDATE=[$5], ctx=[]) KapTableScan(table=[[SSB, LINEORDER]], ctx=[], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]) KapAggregateRel(group-set=[[]], groups=[null], EXPR$0=[MAX($0)], ctx=[]) KapProjectRel(LO_COMMITDATE=[$15], ctx=[]) KapTableScan(table=[[SSB, LINEORDER]], ctx=[], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]) {code} h1.Fix Design With *_kylin.query.convert-sum-expression-enabled=true_*, then skipping the _*KapProjectJoinTransposeRule*_ then ensures that the second execution plan above is stabilized, so that it can hit two aggregated indexes instead of one aggregated and one detailed. > Set kylin.query.convert-sum-expression-enabled=true, fail to completely hit > the aggregate index when the query contains sum (case when) expressions > --- > > Key: KYLIN-5743 > URL: https://issues.apache.org/jira/browse/KYLIN-5743 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Major > Fix For: 5.0.0 > > > {code:sql} > select > sum( > case > when LO_COMMITDATE = '20230501' then LO_DISCOUNT > end > ) > from > ( > select > LO_COMMITDATE, > LO_DISCOUNT, > LINEORDER.LO_ORDERDATE > from > ssb.LINEORDER > ) a > where > LO_COMMITDATE = ( > select > max(LO_COMMITDATE) > from > ssb.LINEORDER > ) > LIMIT > 500 > {code} > Fix the sum case when in this scenario so that it hits aggregated indexes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KYLIN-5741) When using the API in project settings API to update the linking relationship between projects and job engines, an error is reported when the projects parameter is empt
[ https://issues.apache.org/jira/browse/KYLIN-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795300#comment-17795300 ] zhong.zhu commented on KYLIN-5741: -- h1.Dev Design When the projects parameter is empty, the epochs of all projects are updated by default. > When using the API in project settings API to update the linking relationship > between projects and job engines, an error is reported when the projects > parameter is empty > - > > Key: KYLIN-5741 > URL: https://issues.apache.org/jira/browse/KYLIN-5741 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Critical > Fix For: 5.0.0 > > Attachments: image-2023-12-11-14-44-04-923.png > > > !image-2023-12-11-14-44-04-923.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KYLIN-5734) Problems with task scheduling logic
[ https://issues.apache.org/jira/browse/KYLIN-5734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795299#comment-17795299 ] zhong.zhu commented on KYLIN-5734: -- h1. Fix Design _*org.apache.kylin.jobs.execution.ExecutableContext#addRunningJob* In this method the logic of adding the current thread {*}(runningJobThreads.put(executable.getId(), Thread. currentThread()){*}{*}){*}_ independently. The purpose of this is that the addRunningJob is just a record of which tasks have been scheduled, which is used to determine that they should not be scheduled repeatedly, and should not be added to the current thread (the scheduling thread), but rather when the task is actually executed. > Problems with task scheduling logic > --- > > Key: KYLIN-5734 > URL: https://issues.apache.org/jira/browse/KYLIN-5734 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Major > Fix For: 5.0.0 > > > When a task is scheduled, the task is logged into runningJobs and the current > thread is logged into runningJobThreads, which is expected to be the thread > executing the task, but is actually the thread of the scheduler, which leads > to subsequent attempts to interrupt the scheduler FetcherRunner when the task > is killed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KYLIN-5733) Export model TDS file in English interface, including Chinese when opening the file in text mode
[ https://issues.apache.org/jira/browse/KYLIN-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795297#comment-17795297 ] zhong.zhu commented on KYLIN-5733: -- h1.Root Cause When exporting the tds file, the tableau.template.xml file template is used, which has "USA" written in the template, so the exported tds file contains "USA" in Chinese characters, which has nothing to do with specifying Chinese/English in the Kylin interface. It has nothing to do with specifying Chinese/English in the Kylin interface. {code:xml} {code} h1.Dev Design In order to minimize changes and not interfere with previous functionality, "USA" -> "US" in the template to meet customer needs Since this content has been introduced from Kylin3x since 2018, it is no longer possible to determine the purpose of defining this label at that time, in the test, try to remove the semantic-values label, it can still be displayed normally in tableau, try to remove "USA" -> "US" can also be displayed in the normal query in tableau can also be displayed in the normal query, so first only change the Chinese characters in the template for the time being. > Export model TDS file in English interface, including Chinese when opening > the file in text mode > > > Key: KYLIN-5733 > URL: https://issues.apache.org/jira/browse/KYLIN-5733 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Critical > Fix For: 5.0.0 > > > *Steps to reproduce the issue:* > # Go to a model and click “Export TDS”. > # Open up the file in a text editor and look at the bottom. There are > Chinese characters. See attachment. > I confirmed this issue is present in 4.5.4 and .11, and likely exists in > other versions. The TDS file seems to work fine though. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KYLIN-5747) Calcite constant folding, adding strings to numbers, results not as expected when multiple plus signs are used together
[ https://issues.apache.org/jira/browse/KYLIN-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795292#comment-17795292 ] zhong.zhu commented on KYLIN-5747: -- h1.Root Cause When the plus sign is used in a row, and the arguments are all constants, calcite will do constant folding, and when calculating the value, it will convert the expression into java code, and each plus sign will determine whether it needs to be converted into a custom plus function or whether it should be used directly with the java additions. into a custom plus function has three conditions: 1. plus left is not a basic type; 2. plus right is a bigDecimal type; 3. (more complex), any one of them can be satisfied For '1' + 3 + '3', '1' + 3 is of type double, and '3' is of type string, so the second plus sign does not satisfy all three conditions, which results in the expression translated into java code as plus('1' + 3) + '3'. h1.Dev Design 1. In calcite to generate java code, to determine the + sign needs to be converted to plus() it or directly use the addition in java, add a condition, when the parameters on both sides of the plus sign for the string type or numerical type, directly use plus() i.e. fix the '1' + 3 + '3' in the previous sql, the java code is plus('1' + 3) + '3'. After the fix it is plus(plus('1' + 3), '3') 2. Change the return value of the custom plus function in calcite from Double to bigDecimal. The reason: - When calcite does constant folding to generate java code, it does isNullable derivation, and when the nullable of a call is false, it does an automatic unboxing. - The expression 'a' + 3 is considered by calcite to be non-nullable when it does the nullable derivation because the two arguments are constants and neither of them is null, so the whole thing is considered non-null! - In spark, 'a' + 3 results in null, so in our implementation of the plus method, 'a' + 3 also results in null - In summary, when plus(string, number) returns Double, the java code for 'a' + 3 + '3' is actually plus(plus('a' + 3).doubleValue(), '3'), which then throws an NPE when the calculation is performed. The logic of isNullable involves a wider scope, and it is risky to modify it directly, so here we change the return value of plus to BigDecimal, to avoid unboxing. > Calcite constant folding, adding strings to numbers, results not as expected > when multiple plus signs are used together > --- > > Key: KYLIN-5747 > URL: https://issues.apache.org/jira/browse/KYLIN-5747 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Critical > Fix For: 5.0.0 > > > Phenomenon: > When more than one plus sign is used in a row and the parameters on both > sides of the plus sign are constants, the result is not as expected > '1' + 3 + 3 → 7 (correct) > '1' + 3 + '3' → 4.03 (wrong result) > '1' + '3' + 'a' → error > When multiple plus signs are used in a row, and the arguments on both sides > of the plus sign are constants, and the first plus sign results in null, the > use of plus signs in a row is not supported. > e.g. 'q' + 1 + 1 -> error -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KYLIN-5747) Calcite constant folding, adding strings to numbers, results not as expected when multiple plus signs are used together
zhong.zhu created KYLIN-5747: Summary: Calcite constant folding, adding strings to numbers, results not as expected when multiple plus signs are used together Key: KYLIN-5747 URL: https://issues.apache.org/jira/browse/KYLIN-5747 Project: Kylin Issue Type: Bug Affects Versions: 5.0-beta Reporter: zhong.zhu Assignee: zhong.zhu Fix For: 5.0.0 Phenomenon: When more than one plus sign is used in a row and the parameters on both sides of the plus sign are constants, the result is not as expected '1' + 3 + 3 → 7 (correct) '1' + 3 + '3' → 4.03 (wrong result) '1' + '3' + 'a' → error When multiple plus signs are used in a row, and the arguments on both sides of the plus sign are constants, and the first plus sign results in null, the use of plus signs in a row is not supported. e.g. 'q' + 1 + 1 -> error -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KYLIN-5746) On the page, select online model operation offline, click the model online again, and put the model online button into ash.
zhong.zhu created KYLIN-5746: Summary: On the page, select online model operation offline, click the model online again, and put the model online button into ash. Key: KYLIN-5746 URL: https://issues.apache.org/jira/browse/KYLIN-5746 Project: Kylin Issue Type: Bug Affects Versions: 5.0-beta Reporter: zhong.zhu Assignee: zhong.zhu Fix For: 5.0.0 Attachments: image-2023-12-11-17-33-31-889.png Repeat step: 1:Create model, build data, model for online. 2:Operate the model offline 3:Click on the model again and select the model to go online. Actual result The model on-line button is grayed out, indicating that the model is not available for on-line operation. !image-2023-12-11-17-33-31-889.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally
[ https://issues.apache.org/jira/browse/KYLIN-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhong.zhu updated KYLIN-5745: - Description: {*}Problem description{*}: Timed garbage cleanup operation cannot be completed successfully {*}Background{*}: The customer found that Kylin has a large number of small files occupying hdfs storage, we need to clean up, we check the customer's environment and found that the timed garbage cleanup has not been completed properly, has been timeout! *Troubleshooting:* After the check, it is found that the customer's garbage clearing is triggered for the first time in the morning of 4.6 after Kylin is restarted on the night of 4.5. After this clearing operation is triggered, the thread of query history has been deleted since then. As a result, subsequent periodic garbage clearing tasks cannot be completed Delete 2,000 rows of data at a time, one of the customer's projects need to delete 550,000 query history, look at the kylin.log record, delete time-consuming because of table locking problems lead to a delete operation even reached more than 20 minutes! The following record is that the main thread of garbage collection is waiting for the query history cleaning to complete, but the query history cleaning has not been completed, and then the main thread timeout and exit. {code:shell} 2023-04-06T00:00:00,015 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task MetadataBackup with remaining time: 1435 ms 2023-04-06T00:01:52,649 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task QueryHistoriesCleanup with remaining time: 14287361 ms ... 2023-04-06T04:00:00,012 WARN [DefaultTaskScheduler-3] service.ScheduleService : Routine task execution timeout java.util.concurrent.TimeoutException: null at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_242] at org.apache.kylin.rest.service.ScheduleService.executeTask(ScheduleService.java:107) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService.routineTask(ScheduleService.java:77) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService$$FastClassBySpringCGLIB$$afbfc46c.invoke() ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] {code} The following record is until the latest time provided by the log, after 9:00 pm the query history is still processing deletion, not with the termination of the main thread {code:shell} 2023-04-06T00:08:43,015 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.selectByProject : <== Total: 12 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,017 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Start to delete query histories that are beyond max size for project, records:1551669 ... 2023-04-06T09:03:54,974 INFO [QueryHistoryCleanWorker-23145] query.JdbcQueryHistoryStore : Delete 2000 row query history for project [CXCZH] takes 938060 ms 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Preparing: delete from ke4_instance_query_history_realization where query_time < ? and project_name = ? 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Parameters: 1678863450091(Long), CXCZH(String) {code} was: {*}Problem description{*}: Timed garbage cleanup operation cannot be completed successfully {*}Background{*}: The customer found that Kylin has a large number of small files occupying hdfs storage, we need to clean up, we check the customer's environment and found that the timed garbage cleanup has not been completed properly, has been timeout! *Troubleshooting:* After the check, it is found that the customer's garbage clearing is triggered for the first time in the morning of 4.6 after KE is restarted on the night of 4.5. After this clearing operation is triggered, the thread of query history has been deleted since then. As a result, subsequent periodic garbage clearing tasks cannot be completed Delete 2,000 rows of data at a time, one of the customer's projects need to delete 550,000 query history, look at the kylin.log record, delete time-consuming because of table locking problems lead to a delete operation even reached
[jira] [Commented] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally
[ https://issues.apache.org/jira/browse/KYLIN-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795247#comment-17795247 ] zhong.zhu commented on KYLIN-5745: -- h1. Root Cause Controllers in Spring are singleton model, so each call to the following method Service, will be serial cleanup of the underlying HDFS files, compared to the cleanup of metadata and query histories, this process is particularly time-consuming; at the same time, *_MetadataToolHelper_* is serial design, so other calls to this method will also cause serialization problems. {code:java} MetadataToolHelper::cleanStorage{code} Another programming issue, performing various types of data cleanup, Kylin uses two types of thread pools. One is the Route service's single-threaded pool; the other is Spring's built-in task pool, the default 5 threads, and the tasks they accept will involve cleaning up the function of HDFS files, so it will lead to the whole Kylin into the unavailability, so you need to use a separate thread pool to perform the function of cleaning up the HDFS files. h1. Dev Degign Transforms the logic for cleaning up HDFS files into a thread pooling pattern, while providing timeout logic. A thread pool based on *PriorityBlockingQueue's* priority task queue to build a task manager for managing storage garbage cleanup that decouples the steps of serial cleanup of Class I, II, and III garbage in an asynchronous manner. h1. Major Changes 1. Define the task type:{*}SERVICE > CLI > ROUTINE{*} 2. Task abstract class with weights: {code:java} public abstract class AbstractComparableCleanTask implements Runnable Comparable{code} 3. Custom thread pools and task cache queues: configurable {_}*CachedThreadPool*{_}, specified _*PriorityBlockingQueue*_ 4. Added Task Manager: {code:java} public class CleanStoragesHelper implements Closeable{code} 5. New timeout mechanism: based on *_JAVA CompletableFuture / Future _*to complete the core logic, the new parameters are as follows {*}_kylin.storage.clean-timeout=1h_{*}: specify the cleanup task, default timeout, only for query histories/storage cleanup, {color:#de350b}this parameter will only take effect in non-Routine scenarios, non-CLI scenarios;{color} {*}_kylin.storage.clean-tasks-concurrency=5_{*}:Specifies the number of threads storing garbage Query histories/HDFS tasks, i.e., up to how many of these two types of tasks are executed at the same time, with subsequent commits waiting in the task cache queue. 6. Asynchronous/synchronous mechanism:{_}*CleanStoragesHelper*{_} provides synchronous/asynchronous methods, which are chosen by the upper level according to the scenario. 7. Initialization timing for global classes:Initialize the tool class *_CleanStoragesHelper_* in *_AppInitializer_* to avoid problems due to the method *_KylinConfig.getInstanceFromEn_* that may return non-system KylinConfig. 8. Track the life cycle of a task:{*}CREATE => SUBMIT => SUCCEED/FAILED{*} > The historical garbage cleanup task was not completed, causing the subsequent > scheduled garbage cleanup task cannot be executed normally > > > Key: KYLIN-5745 > URL: https://issues.apache.org/jira/browse/KYLIN-5745 > Project: Kylin > Issue Type: Bug >Affects Versions: 5.0-beta >Reporter: zhong.zhu >Assignee: zhong.zhu >Priority: Major > Fix For: 5.0.0 > > > {*}Problem description{*}: > Timed garbage cleanup operation cannot be completed successfully > {*}Background{*}: > The customer found that Kylin has a large number of small files occupying > hdfs storage, we need to clean up, we check the customer's environment and > found that the timed garbage cleanup has not been completed properly, has > been timeout! > *Troubleshooting:* > After the check, it is found that the customer's garbage clearing is > triggered for the first time in the morning of 4.6 after KE is restarted on > the night of 4.5. After this clearing operation is triggered, the thread of > query history has been deleted since then. As a result, subsequent periodic > garbage clearing tasks cannot be completed > Delete 2,000 rows of data at a time, one of the customer's projects need to > delete 550,000 query history, look at the kylin.log record, delete > time-consuming because of table locking problems lead to a delete operation > even reached more than 20 minutes! > The following record is that the main thread of garbage collection is waiting > for the query history cleaning to complete, but the query history cleaning > has not been completed, and then the main thread timeout and exit. > {code:shell} > 2023-04-06T00:00:00,015 INFO [RoutineOpsWorker-287] service.ScheduleService > : execute task
[jira] [Created] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally
zhong.zhu created KYLIN-5745: Summary: The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally Key: KYLIN-5745 URL: https://issues.apache.org/jira/browse/KYLIN-5745 Project: Kylin Issue Type: Bug Affects Versions: 5.0-beta Reporter: zhong.zhu Assignee: zhong.zhu Fix For: 5.0.0 {*}Problem description{*}: Timed garbage cleanup operation cannot be completed successfully {*}Background{*}: The customer found that Kylin has a large number of small files occupying hdfs storage, we need to clean up, we check the customer's environment and found that the timed garbage cleanup has not been completed properly, has been timeout! *Troubleshooting:* After the check, it is found that the customer's garbage clearing is triggered for the first time in the morning of 4.6 after KE is restarted on the night of 4.5. After this clearing operation is triggered, the thread of query history has been deleted since then. As a result, subsequent periodic garbage clearing tasks cannot be completed Delete 2,000 rows of data at a time, one of the customer's projects need to delete 550,000 query history, look at the kylin.log record, delete time-consuming because of table locking problems lead to a delete operation even reached more than 20 minutes! The following record is that the main thread of garbage collection is waiting for the query history cleaning to complete, but the query history cleaning has not been completed, and then the main thread timeout and exit. {code:shell} 2023-04-06T00:00:00,015 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task MetadataBackup with remaining time: 1435 ms 2023-04-06T00:01:52,649 INFO [RoutineOpsWorker-287] service.ScheduleService : execute task QueryHistoriesCleanup with remaining time: 14287361 ms ... 2023-04-06T04:00:00,012 WARN [DefaultTaskScheduler-3] service.ScheduleService : Routine task execution timeout java.util.concurrent.TimeoutException: null at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_242] at org.apache.kylin.rest.service.ScheduleService.executeTask(ScheduleService.java:107) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService.routineTask(ScheduleService.java:77) ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] at org.apache.kylin.rest.service.ScheduleService$$FastClassBySpringCGLIB$$afbfc46c.invoke() ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?] {code} The following record is until the latest time provided by the log, after 9:00 pm the query history is still processing deletion, not with the termination of the main thread {code:shell} 2023-04-06T00:08:43,015 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.selectByProject : <== Total: 12 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,016 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Query histories of project is less than the maximum limit, so skip it. 2023-04-06T00:08:43,017 INFO [QueryHistoryCleanWorker-23145] util.QueryHisStoreUtil : Start to delete query histories that are beyond max size for project, records:1551669 ... 2023-04-06T09:03:54,974 INFO [QueryHistoryCleanWorker-23145] query.JdbcQueryHistoryStore : Delete 2000 row query history for project [CXCZH] takes 938060 ms 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Preparing: delete from ke4_instance_query_history_realization where query_time < ? and project_name = ? 2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] QueryHistoryMapper.delete : ==> Parameters: 1678863450091(Long), CXCZH(String) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)