[jira] [Updated] (KYLIN-1768) NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed length encoded to more than 256 bytes
[ https://issues.apache.org/jira/browse/KYLIN-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1768: - Description: When user defined a dimension which is fixed length encoded to more than 256 bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack trace is shown below: {noformat} Error: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {noformat} The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per split, and trying to put a larger encoded dimension throws ArrayIndexOutOfBoundsException. was: When user defined a dimension which is fixed length encoded to more than 256 bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack trace is shown below: {noformat} Error: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {noformat} The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per split, and trying to put a larger encoded dimension throws ArrayIndexOutOfBoundsException. > NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed > length encoded to more than 256 bytes > -- > > Key: KYLIN-1768 > URL: https://issues.apache.org/jira/browse/KYLIN-1768 > Project: Kylin > Issue Type: Bug > Components: Job Engine >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao > > When user defined a dimension which is fixed length encoded to more than 256 > bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack > trace is shown below: > {noformat} > Error: java.lang.ArrayIndexOutOfBoundsException at > java.lang.System.arraycopy(Native Method) > at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) > at > org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) > at > org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) > {noformat} > The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes > per split, and trying to put a larger encoded dimension throws > ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1768) NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed length encoded to more than 256 bytes
Dayue Gao created KYLIN-1768: Summary: NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed length encoded to more than 256 bytes Key: KYLIN-1768 URL: https://issues.apache.org/jira/browse/KYLIN-1768 Project: Kylin Issue Type: Bug Components: Job Engine Affects Versions: v1.5.2 Reporter: Dayue Gao Assignee: Dayue Gao When user defined a dimension which is fixed length encoded to more than 256 bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack trace is shown below: {noformat} Error: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {noformat} The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per split, and trying to put a larger encoded dimension throws ArrayIndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1770) Can't use PreparedStatement with "between and" expression
Dayue Gao created KYLIN-1770: Summary: Can't use PreparedStatement with "between and" expression Key: KYLIN-1770 URL: https://issues.apache.org/jira/browse/KYLIN-1770 Project: Kylin Issue Type: Bug Components: Driver - JDBC Affects Versions: v1.5.2, v1.5.1 Reporter: Dayue Gao Sample code to reproduce: {code:java} final String sql = "select count(*) from kylin_sales where LSTG_SITE_ID between ? and ?"; try (PreparedStatement stmt = conn.prepareStatement(sql)) { stmt.setInt(1, 0); stmt.setInt(2, 5); try (ResultSet rs = stmt.executeQuery()) { printResultSet(rs); } } {code} Exception stack trace from server log: {noformat} java.sql.SQLException: Error while preparing statement [select count(*) from kylin_sales where LSTG_SITE_ID between ? and ?] at org.apache.calcite.avatica.Helper.createException(Helper.java:56) at org.apache.calcite.avatica.Helper.createException(Helper.java:41) at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement_(CalciteConnectionImpl.java:203) at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:184) at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:85) at org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:153) at org.apache.kylin.rest.service.QueryService.execute(QueryService.java:353) at org.apache.kylin.rest.service.QueryService.queryWithSqlMassage(QueryService.java:274) at org.apache.kylin.rest.service.QueryService.query(QueryService.java:120) at org.apache.kylin.rest.service.QueryService$$FastClassByCGLIB$$4957273f.invoke() at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:618) at org.apache.kylin.rest.service.QueryService$$EnhancerByCGLIB$$8610374f.query() at org.apache.kylin.rest.controller.QueryController.doQueryWithCache(QueryController.java:192) at org.apache.kylin.rest.controller.QueryController.prepareQuery(QueryController.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789) at javax.servlet.http.HttpServlet.service(HttpServlet.java:646) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118) at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84) at
[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty
[ https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1752: - Attachment: KYLIN-1752.patch.1 uploaded KYLIN-1752.patch.1 according to Shaofeng's suggestion. > Add an option to fail cube build job when source table is empty > --- > > Key: KYLIN-1752 > URL: https://issues.apache.org/jira/browse/KYLIN-1752 > Project: Kylin > Issue Type: New Feature > Components: Job Engine >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Trivial > Attachments: KYLIN-1752.patch, KYLIN-1752.patch.1 > > > For non-incremental build cube, it's valuable to be able to fail the build > job as long as the source table is empty. Otherwise, a mistake in upstream > ETL which results in empty source table will lead to an empty cube. Often in > this situation, user wants to still be able to query history data in the cube > before they fix their ETL and rebuild the cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1677) Distribute source data by certain columns when creating flat table
[ https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319955#comment-15319955 ] Dayue Gao commented on KYLIN-1677: -- Hi Shaofeng, Here's the test result of using hive view as fact table: || KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) || | Count Source Table | 9.02 | Create Intermediate Flat Hive Table | 8.12 | | Create Intermediate Flat Hive Table | 12.89 | Redistribute Intermediate Flat Hive | 2.39 | As expected, KYLIN-1677 took more time due to materializing view twice instead of once in KYLIN-1656. To be fair, I also tested a cube which uses non-view as fact table: || KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) || | Count Source Table | 1.10 | Create Intermediate Flat Hive Table | 3.74 | | Create Intermediate Flat Hive Table | 1.70 | Redistribute Intermediate Flat Hive | 5.13 | In this case, KYLIN-1677 behaves better than KYLIN-1656 due to avoiding one round of MR. In general, I'm +1 to release KYLIN-1677 as an refinement to KYLIN-1656. > Distribute source data by certain columns when creating flat table > -- > > Key: KYLIN-1677 > URL: https://issues.apache.org/jira/browse/KYLIN-1677 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Reporter: Shaofeng SHI >Assignee: Shaofeng SHI > Fix For: v1.5.3 > > > Inspired by KYLIN-1656, Kylin can distribute the source data by certain > columns when creating the flat hive table; Then the data assigned to a mapper > will have more similarity, more aggregation can happen at mapper side, and > then less shuffle and reduce is needed. > Columns can be used for the distribution includes: ultra high cardinality > column, mandantory column, partition date/time column, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1758) createLookupHiveViewMaterializationStep will create intermediate table for fact table
[ https://issues.apache.org/jira/browse/KYLIN-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315319#comment-15315319 ] Dayue Gao commented on KYLIN-1758: -- Hi Shaofeng, Please also note that `createLookupHiveViewMaterializationStep` should use `JoinedFlatTable.generateHiveSetStatements` when building hive command, otherwise it will not pick the correct queue configuration. > createLookupHiveViewMaterializationStep will create intermediate table for > fact table > -- > > Key: KYLIN-1758 > URL: https://issues.apache.org/jira/browse/KYLIN-1758 > Project: Kylin > Issue Type: Bug > Components: Job Engine >Affects Versions: v1.5.2 > Environment: hadoop2.4, hbase1.1.2 >Reporter: Xingxing Di >Assignee: Shaofeng SHI >Priority: Critical > > In our model, the fact table is a hive view. When i build cube(I selected > one partition for one day's data), the job was trying to create intermediate > table by sql : > DROP TABLE IF EXISTS kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU; > CREATE TABLE IF NOT EXISTS kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU > LOCATION > '/team/db/kylin15/kylin_metadata15/kylin-65c7ee9a-3024-4633-927c-19e992ed155a/kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU' > AS SELECT * FROM OLAP.OLAP_FLW_PCM_MAU; > 1. There is no partition string in where string (I selected only one > partition), which cause a very very big MR. > 2. Note that, our lookup table is not a view. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
[ https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300145#comment-15300145 ] Dayue Gao commented on KYLIN-1656: -- Hi Shaofeng, We choose 500K to increase parallelism and reduce total build time. As we have a very big cluster, we don't see the problem of pending tasks, but you made a good point about it. In your test, you have 5000+ mappers, which means the input has 2.5B+ rows. I'm not sure if it were the common case. Most of our cubes in production have ~100M rows per segment, thus 500K leads to 200 mappers, which looks a reasonable parallelism to me. If the setting is increased to 5M, then 20 mappers for the "Build Cube" step is way too small, leads to step timeout ultimately. I do find input split of 500K rows to be small, but don't see it as a problem. > Improve performance of MRv2 engine by making each mapper handles a configured > number of records > --- > > Key: KYLIN-1656 > URL: https://issues.apache.org/jira/browse/KYLIN-1656 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.5.3 > > Attachments: KYLIN-1656.patch > > > In the current version of MRv2 build engine, each mapper handles one block of > the flat hive table (stored in sequence file). This has two major problems: > # It's difficult for user to control the parallelism of mappers for each cube. > User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a > global configuration and cannot be override using "override_kylin_properties" > introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. > # May encounter mapper execution skew due to a skew distribution of each > block's records number. > This is a more severe problem since FactDistinctColumn and InMemCubing step > of MRv2 is very cpu intensive in map task. To give you a sense of how bad it > is, one of our cube's FactDistinctColumnStep takes ~100min in total with > average mapper time only 11min. This is because there exists several skewed > map tasks which handled 10x records than average map task. And the > InMemCubing steps failed because the skewed mapper tasks hit > "mapred.task.timeout". > To avoid skew to happen, *we'd better make each mapper handles a configurable > number of records instead of handles a sequence file block.* The way we > achieved this is to add a `RedistributeFlatHiveTableStep` right after > "FlatHiveTableStep". > Here's what RedistributeFlatHiveTableStep do: > 1. we run a {{select count(1) from intermediate_table}} to determine the > `input_rowcount` of this build > 2. we run a {{insert overwrite table intermediate_table select * from > intermediate_table distribute by rand()}} to evenly distribute records to > reducers. > The number of reducers is specified as "input_rowcount / mapper_input_rows" > where `mapper_input_rows` is a new parameter for user to specify how many > records each mapper should handle. Since each reducer will write out its > records into one file, we're guaranteed that after > RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains > around mapper_input_rows. And since the followed up job's mapper handles one > block of each sequence file, they won't handle more than mapper_input_rows. > The added RedistributeFlatHiveTableStep usually takes a small amount of time > compared to other steps, but the benefit it brings is remarkable. Here's what > performance improvement we saw: > || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || > FactDistinctColumn after|| > | case#1 | 51.78min | 8.40min | 13.06min | > | case#2 | 95.65min | 2.46min | 26.37min | > And since mapper_input_rows is a kylin configuration, user can override it > for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1677) Distribute source data by certain columns when creating flat table
[ https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300164#comment-15300164 ] Dayue Gao commented on KYLIN-1677: -- Thanks for the reply. I'll test master branch on hive view tomorrow to see how it performs. Our internal version is still using KYLIN-1656. And from the performance number given in 1656, the cost of RedistributeFlatHiveTableStep is usually negligible. > Distribute source data by certain columns when creating flat table > -- > > Key: KYLIN-1677 > URL: https://issues.apache.org/jira/browse/KYLIN-1677 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Reporter: Shaofeng SHI >Assignee: Shaofeng SHI > Fix For: v1.5.3 > > > Inspired by KYLIN-1656, Kylin can distribute the source data by certain > columns when creating the flat hive table; Then the data assigned to a mapper > will have more similarity, more aggregation can happen at mapper side, and > then less shuffle and reduce is needed. > Columns can be used for the distribution includes: ultra high cardinality > column, mandantory column, partition date/time column, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty
[ https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1752: - Description: For non-incremental build cube, it's valuable to be able to fail the build job as long as the source table is empty. Otherwise, a mistake in upstream ETL which results in empty source table will lead to an empty cube. Often in this situation, user wants to still be able to query history data in the cube before they fix their ETL and rebuild the cube. (was: For non-incremental build cube, it's valuable to be able to fail the build job as long as the source table is empty. Otherwise, a mistake in upper ETL which results in empty source table will lead to an empty cube. Often in this situation, user wants to still be able to query history data in the cube before they fix their ETL and rebuild the cube.) > Add an option to fail cube build job when source table is empty > --- > > Key: KYLIN-1752 > URL: https://issues.apache.org/jira/browse/KYLIN-1752 > Project: Kylin > Issue Type: New Feature > Components: Job Engine >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Trivial > Attachments: KYLIN-1752.patch > > > For non-incremental build cube, it's valuable to be able to fail the build > job as long as the source table is empty. Otherwise, a mistake in upstream > ETL which results in empty source table will lead to an empty cube. Often in > this situation, user wants to still be able to query history data in the cube > before they fix their ETL and rebuild the cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty
[ https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1752: - Attachment: KYLIN-1752.patch Here's the patch, tested internally. [~Shaofengshi] Could you please take a look at this? > Add an option to fail cube build job when source table is empty > --- > > Key: KYLIN-1752 > URL: https://issues.apache.org/jira/browse/KYLIN-1752 > Project: Kylin > Issue Type: New Feature > Components: Job Engine >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Trivial > Attachments: KYLIN-1752.patch > > > For non-incremental build cube, it's valuable to be able to fail the build > job as long as the source table is empty. Otherwise, a mistake in upper ETL > which results in empty source table will lead to an empty cube. Often in this > situation, user wants to still be able to query history data in the cube > before they fix their ETL and rebuild the cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1706) Allow cube to override MR job configuration by properties
[ https://issues.apache.org/jira/browse/KYLIN-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315482#comment-15315482 ] Dayue Gao commented on KYLIN-1706: -- This is one of the small but very helpful improvements I'd like to have in Kylin. There were sometimes I want to adjust mapper/reducer heap size and "mapreduce.task.io.sort.mb" for a specific cube to accelerate building. It's become feasible now with this jira. Thank you to bring it! In addition, I'd like to comment that, this jira is not sufficient to support cube/project level queue isolation. To the best of my knowledge, there're still two problems to deal with, # this jira only allows user to override MR queue configuration in kylin_job_conf.xml. However queue configuration also appears in kylin_hive_conf.xml, and we should address it too. # queue can have ACLs to specify who can submit application to it. Therefore, Kylin needs to know not only which queue to submit job, but also which hadoop user to impersonate to. > Allow cube to override MR job configuration by properties > - > > Key: KYLIN-1706 > URL: https://issues.apache.org/jira/browse/KYLIN-1706 > Project: Kylin > Issue Type: Improvement >Reporter: liyang >Assignee: liyang > Fix For: v1.5.3 > > > Currently cube can specify MR job configuration by a job_conf.xml file under > conf/. This is still not sufficient for example, to specify 50+ different job > queues, user will have to maintain 50+ different job_conf.xml files. > By allowing config override from kylin properties, the 50+ job queue case > will become a lot more easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
[ https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309184#comment-15309184 ] Dayue Gao commented on KYLIN-1656: -- Hi Shaofeng, {quote} did you observe the source file of the intermediate file size in your side? {quote} Yes, the intermediate file size is indeed very small. {quote} My concern is it may generate many small files on HDFS, adding NN's memory footprint. {quote} It's a very good concern if these files are left in HDFS "forever". But since files of intermediate table will be garbage collected as soon as the build succeeds, I don't think it's a big issue. What's your opinion? > Improve performance of MRv2 engine by making each mapper handles a configured > number of records > --- > > Key: KYLIN-1656 > URL: https://issues.apache.org/jira/browse/KYLIN-1656 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.5.3 > > Attachments: KYLIN-1656.patch > > > In the current version of MRv2 build engine, each mapper handles one block of > the flat hive table (stored in sequence file). This has two major problems: > # It's difficult for user to control the parallelism of mappers for each cube. > User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a > global configuration and cannot be override using "override_kylin_properties" > introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. > # May encounter mapper execution skew due to a skew distribution of each > block's records number. > This is a more severe problem since FactDistinctColumn and InMemCubing step > of MRv2 is very cpu intensive in map task. To give you a sense of how bad it > is, one of our cube's FactDistinctColumnStep takes ~100min in total with > average mapper time only 11min. This is because there exists several skewed > map tasks which handled 10x records than average map task. And the > InMemCubing steps failed because the skewed mapper tasks hit > "mapred.task.timeout". > To avoid skew to happen, *we'd better make each mapper handles a configurable > number of records instead of handles a sequence file block.* The way we > achieved this is to add a `RedistributeFlatHiveTableStep` right after > "FlatHiveTableStep". > Here's what RedistributeFlatHiveTableStep do: > 1. we run a {{select count(1) from intermediate_table}} to determine the > `input_rowcount` of this build > 2. we run a {{insert overwrite table intermediate_table select * from > intermediate_table distribute by rand()}} to evenly distribute records to > reducers. > The number of reducers is specified as "input_rowcount / mapper_input_rows" > where `mapper_input_rows` is a new parameter for user to specify how many > records each mapper should handle. Since each reducer will write out its > records into one file, we're guaranteed that after > RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains > around mapper_input_rows. And since the followed up job's mapper handles one > block of each sequence file, they won't handle more than mapper_input_rows. > The added RedistributeFlatHiveTableStep usually takes a small amount of time > compared to other steps, but the benefit it brings is remarkable. Here's what > performance improvement we saw: > || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || > FactDistinctColumn after|| > | case#1 | 51.78min | 8.40min | 13.06min | > | case#2 | 95.65min | 2.46min | 26.37min | > And since mapper_input_rows is a kylin configuration, user can override it > for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1752) Add an option to fail cube build job when source table is empty
Dayue Gao created KYLIN-1752: Summary: Add an option to fail cube build job when source table is empty Key: KYLIN-1752 URL: https://issues.apache.org/jira/browse/KYLIN-1752 Project: Kylin Issue Type: New Feature Components: Job Engine Affects Versions: v1.5.2 Reporter: Dayue Gao Assignee: Dayue Gao Priority: Trivial For non-incremental build cube, it's valuable to be able to fail the build job as long as the source table is empty. Otherwise, a mistake in upper ETL which results in empty source table will lead to an empty cube. Often in this situation, user wants to still be able to query history data in the cube before they fix their ETL and rebuild the cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number
[ https://issues.apache.org/jira/browse/KYLIN-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303443#comment-15303443 ] Dayue Gao commented on KYLIN-1657: -- can we merge this? > Add new configuration kylin.job.mapreduce.min.reducer.number > > > Key: KYLIN-1657 > URL: https://issues.apache.org/jira/browse/KYLIN-1657 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Attachments: KYLIN-1657.patch > > > We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of > reducers for cubing job, but min reducer is hard coded to 1. We should make > it also configurable and this could be helpful in some circumstances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1694) make multiply coefficient configurable when estimating cuboid size
[ https://issues.apache.org/jira/browse/KYLIN-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303446#comment-15303446 ] Dayue Gao commented on KYLIN-1694: -- can we merge this? > make multiply coefficient configurable when estimating cuboid size > -- > > Key: KYLIN-1694 > URL: https://issues.apache.org/jira/browse/KYLIN-1694 > Project: Kylin > Issue Type: Bug > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: kangkaisen >Assignee: Dong Li > Attachments: KYLIN-1694.patch > > > In the current version of MRv2 build engine, in CubeStatsReader when > estimating cuboid size , the curent method is "cube is memory hungry, storage > size estimation multiply 0.05" and "cube is not memory hungry, storage size > estimation multiply 0.25". > This has one major problems:the default multiply coefficient is smaller, this > will make the estimated cuboid size much less than the actual > cuboid size,which will lead to the region numbers of HBase and the reducer > numbers of CubeHFileJob are both smaller. obviously, the current method > makes the job of CubeHFileJob much slower. > After we remove the the default multiply coefficient, the job of CubeHFileJob > becomes much faster. > we'd better make multiply coefficient configurable and this could be more > friendly for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
[ https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309376#comment-15309376 ] Dayue Gao commented on KYLIN-1656: -- no problem :-) > Improve performance of MRv2 engine by making each mapper handles a configured > number of records > --- > > Key: KYLIN-1656 > URL: https://issues.apache.org/jira/browse/KYLIN-1656 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.5.3 > > Attachments: KYLIN-1656.patch > > > In the current version of MRv2 build engine, each mapper handles one block of > the flat hive table (stored in sequence file). This has two major problems: > # It's difficult for user to control the parallelism of mappers for each cube. > User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a > global configuration and cannot be override using "override_kylin_properties" > introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. > # May encounter mapper execution skew due to a skew distribution of each > block's records number. > This is a more severe problem since FactDistinctColumn and InMemCubing step > of MRv2 is very cpu intensive in map task. To give you a sense of how bad it > is, one of our cube's FactDistinctColumnStep takes ~100min in total with > average mapper time only 11min. This is because there exists several skewed > map tasks which handled 10x records than average map task. And the > InMemCubing steps failed because the skewed mapper tasks hit > "mapred.task.timeout". > To avoid skew to happen, *we'd better make each mapper handles a configurable > number of records instead of handles a sequence file block.* The way we > achieved this is to add a `RedistributeFlatHiveTableStep` right after > "FlatHiveTableStep". > Here's what RedistributeFlatHiveTableStep do: > 1. we run a {{select count(1) from intermediate_table}} to determine the > `input_rowcount` of this build > 2. we run a {{insert overwrite table intermediate_table select * from > intermediate_table distribute by rand()}} to evenly distribute records to > reducers. > The number of reducers is specified as "input_rowcount / mapper_input_rows" > where `mapper_input_rows` is a new parameter for user to specify how many > records each mapper should handle. Since each reducer will write out its > records into one file, we're guaranteed that after > RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains > around mapper_input_rows. And since the followed up job's mapper handles one > block of each sequence file, they won't handle more than mapper_input_rows. > The added RedistributeFlatHiveTableStep usually takes a small amount of time > compared to other steps, but the benefit it brings is remarkable. Here's what > performance improvement we saw: > || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || > FactDistinctColumn after|| > | case#1 | 51.78min | 8.40min | 13.06min | > | case#2 | 95.65min | 2.46min | 26.37min | > And since mapper_input_rows is a kylin configuration, user can override it > for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1694) make multiply coefficient configurable when estimating cuboid size
[ https://issues.apache.org/jira/browse/KYLIN-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284236#comment-15284236 ] Dayue Gao commented on KYLIN-1694: -- Hi shaofeng, we have cherry-picked that patch and verified it's not related to it. > make multiply coefficient configurable when estimating cuboid size > -- > > Key: KYLIN-1694 > URL: https://issues.apache.org/jira/browse/KYLIN-1694 > Project: Kylin > Issue Type: Bug > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: kangkaisen >Assignee: Dong Li > > In the current version of MRv2 build engine, in CubeStatsReader when > estimating cuboid size , the curent method is "cube is memory hungry, storage > size estimation multiply 0.05" and "cube is not memory hungry, storage size > estimation multiply 0.25". > This has one major problems:the default multiply coefficient is smaller, this > will make the estimated cuboid size much less than the actual > cuboid size,which will lead to the region numbers of HBase and the reducer > numbers of CubeHFileJob are both smaller. obviously, the current method > makes the job of CubeHFileJob much slower. > After we remove the the default multiply coefficient, the job of CubeHFileJob > becomes much faster. > we'd better make multiply coefficient configurable and this could be more > friendly for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1323) Improve performance of converting data to hfile
[ https://issues.apache.org/jira/browse/KYLIN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272072#comment-15272072 ] Dayue Gao commented on KYLIN-1323: -- Hi [~Shaofengshi], what's the progress of this on 1.5.x? > Improve performance of converting data to hfile > --- > > Key: KYLIN-1323 > URL: https://issues.apache.org/jira/browse/KYLIN-1323 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.2 >Reporter: Yerui Sun >Assignee: Shaofeng SHI > Fix For: v1.4.0, v1.3.0 > > Attachments: KYLIN-1323-1.x-staging.2.patch, > KYLIN-1323-1.x-staging.patch, KYLIN-1323-2.x-staging.2.patch > > > Supposed that we got 100GB data after cuboid building, and with setting that > 10GB per region. For now, 10 split keys was calculated, and 10 region > created, 10 reducer used in ‘convert to hfile’ step. > With optimization, we could calculate 100 (or more) split keys, and use all > them in ‘covert to file’ step, but sampled 10 keys in them to create regions. > The result is still 10 region created, but 100 reducer used in ‘convert to > file’ step. Of course, the hfile created is also 100, and load 10 files per > region. That’s should be fine, doesn’t affect the query performance > dramatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number
Dayue Gao created KYLIN-1657: Summary: Add new configuration kylin.job.mapreduce.min.reducer.number Key: KYLIN-1657 URL: https://issues.apache.org/jira/browse/KYLIN-1657 Project: Kylin Issue Type: Improvement Components: Job Engine Affects Versions: v1.5.1 Reporter: Dayue Gao Assignee: Dayue Gao Priority: Minor We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of reducers for cubing job, but min reducer is hard coded to 1. We should make it also configurable and this could be helpful in some circumstances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number
[ https://issues.apache.org/jira/browse/KYLIN-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1657: - Attachment: KYLIN-1657.patch Added a new kylin configuration "kylin.job.mapreduce.min.reducer.number" which defaults to "1". > Add new configuration kylin.job.mapreduce.min.reducer.number > > > Key: KYLIN-1657 > URL: https://issues.apache.org/jira/browse/KYLIN-1657 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Attachments: KYLIN-1657.patch > > > We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of > reducers for cubing job, but min reducer is hard coded to 1. We should make > it also configurable and this could be helpful in some circumstances. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
Dayue Gao created KYLIN-1656: Summary: Improve performance of MRv2 engine by making each mapper handles a configured number of records Key: KYLIN-1656 URL: https://issues.apache.org/jira/browse/KYLIN-1656 Project: Kylin Issue Type: Improvement Components: Job Engine Affects Versions: v1.5.1, v1.5.0 Reporter: Dayue Gao Assignee: Dayue Gao In the current version of MRv2 build engine, each mapper handles one block of the flat hive table (stored in sequence file). This has two major problems: # It's difficult for user to control the parallelism of mappers for each cube. User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a global configuration and cannot be override using "override_kylin_properties" introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. # May encounter mapper execution skew due to a skew distribution of each block's records number. This is a more severe problem since FactDistinctColumn and InMemCubing step of MRv2 is very cpu intensive in map task. To give you a sense of how bad it is, one of our cube's FactDistinctColumnStep takes ~100min in total with average mapper time only 11min. This is because there exists several skewed map tasks which handled 10x records than average map task. And the InMemCubing steps failed because the skewed mapper tasks hit "mapred.task.timeout". To avoid skew to happen, *we'd better make each mapper handles a configurable number of records instead of handles a sequence file block.* The way we achieved this is to add a `RedistributeFlatHiveTableStep` right after "FlatHiveTableStep". Here's what RedistributeFlatHiveTableStep do: 1. we run a {{select count(1) from intermediate_table}} to determine the `input_rowcount` of this build 2. we run a {{insert overwrite table intermediate_table select * from intermediate_table distribute by rand()}} to evenly distribute records to reducers. The number of reducers is specified as "input_rowcount / mapper_input_rows" where `mapper_input_rows` is a new parameter for user to specify how many records each mapper should handle. Since each reducer will write out its records into one file, we're guaranteed that after RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains around mapper_input_rows. And since the followed up job's mapper handles one block of each sequence file, they won't handle more than mapper_input_rows. The added RedistributeFlatHiveTableStep usually takes a small amount of time compared to other steps, but the benefit it brings is remarkable. Here's what performance improvement we saw: || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || FactDistinctColumn after|| | case#1 | 51.78min | 8.40min | 13.06min | | case#2 | 95.65min | 2.46min | 26.37min | And since mapper_input_rows is a kylin configuration, user can override it for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
[ https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1656: - Attachment: KYLIN-1656.patch Please review the patch. This adds a new kylin configuration named "kylin.job.mapreduce.mapper.input.rows" and defaults to "50". > Improve performance of MRv2 engine by making each mapper handles a configured > number of records > --- > > Key: KYLIN-1656 > URL: https://issues.apache.org/jira/browse/KYLIN-1656 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-1656.patch > > > In the current version of MRv2 build engine, each mapper handles one block of > the flat hive table (stored in sequence file). This has two major problems: > # It's difficult for user to control the parallelism of mappers for each cube. > User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a > global configuration and cannot be override using "override_kylin_properties" > introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. > # May encounter mapper execution skew due to a skew distribution of each > block's records number. > This is a more severe problem since FactDistinctColumn and InMemCubing step > of MRv2 is very cpu intensive in map task. To give you a sense of how bad it > is, one of our cube's FactDistinctColumnStep takes ~100min in total with > average mapper time only 11min. This is because there exists several skewed > map tasks which handled 10x records than average map task. And the > InMemCubing steps failed because the skewed mapper tasks hit > "mapred.task.timeout". > To avoid skew to happen, *we'd better make each mapper handles a configurable > number of records instead of handles a sequence file block.* The way we > achieved this is to add a `RedistributeFlatHiveTableStep` right after > "FlatHiveTableStep". > Here's what RedistributeFlatHiveTableStep do: > 1. we run a {{select count(1) from intermediate_table}} to determine the > `input_rowcount` of this build > 2. we run a {{insert overwrite table intermediate_table select * from > intermediate_table distribute by rand()}} to evenly distribute records to > reducers. > The number of reducers is specified as "input_rowcount / mapper_input_rows" > where `mapper_input_rows` is a new parameter for user to specify how many > records each mapper should handle. Since each reducer will write out its > records into one file, we're guaranteed that after > RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains > around mapper_input_rows. And since the followed up job's mapper handles one > block of each sequence file, they won't handle more than mapper_input_rows. > The added RedistributeFlatHiveTableStep usually takes a small amount of time > compared to other steps, but the benefit it brings is remarkable. Here's what > performance improvement we saw: > || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || > FactDistinctColumn after|| > | case#1 | 51.78min | 8.40min | 13.06min | > | case#2 | 95.65min | 2.46min | 26.37min | > And since mapper_input_rows is a kylin configuration, user can override it > for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1662) tableName got truncated during request mapping for /tables/tableName
[ https://issues.apache.org/jira/browse/KYLIN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1662: - Attachment: KYLIN-1662.patch attach patch. > tableName got truncated during request mapping for /tables/tableName > > > Key: KYLIN-1662 > URL: https://issues.apache.org/jira/browse/KYLIN-1662 > Project: Kylin > Issue Type: Bug > Components: REST Service >Affects Versions: v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-1662.patch > > > Request '/tables/default.kylin_sales' for table metadata return empty string. > This is because Spring by default treats ".kylin_sales" as a file extension > and path variable {{tableName}} receives value "default" rather than > "default.kylin_sales". As a result, Kylin searchs metadata for table > "default.default". > An easy fix is to use "/\{tableName:.+\}" in request mapping as suggested in > http://stackoverflow.com/questions/16332092/spring-mvc-pathvariable-with-dot-is-getting-truncated -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records
[ https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276344#comment-15276344 ] Dayue Gao commented on KYLIN-1656: -- Didn't get the time to create the branch, thank you Shaofeng! > Improve performance of MRv2 engine by making each mapper handles a configured > number of records > --- > > Key: KYLIN-1656 > URL: https://issues.apache.org/jira/browse/KYLIN-1656 > Project: Kylin > Issue Type: Improvement > Components: Job Engine >Affects Versions: v1.5.0, v1.5.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-1656.patch > > > In the current version of MRv2 build engine, each mapper handles one block of > the flat hive table (stored in sequence file). This has two major problems: > # It's difficult for user to control the parallelism of mappers for each cube. > User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a > global configuration and cannot be override using "override_kylin_properties" > introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534]. > # May encounter mapper execution skew due to a skew distribution of each > block's records number. > This is a more severe problem since FactDistinctColumn and InMemCubing step > of MRv2 is very cpu intensive in map task. To give you a sense of how bad it > is, one of our cube's FactDistinctColumnStep takes ~100min in total with > average mapper time only 11min. This is because there exists several skewed > map tasks which handled 10x records than average map task. And the > InMemCubing steps failed because the skewed mapper tasks hit > "mapred.task.timeout". > To avoid skew to happen, *we'd better make each mapper handles a configurable > number of records instead of handles a sequence file block.* The way we > achieved this is to add a `RedistributeFlatHiveTableStep` right after > "FlatHiveTableStep". > Here's what RedistributeFlatHiveTableStep do: > 1. we run a {{select count(1) from intermediate_table}} to determine the > `input_rowcount` of this build > 2. we run a {{insert overwrite table intermediate_table select * from > intermediate_table distribute by rand()}} to evenly distribute records to > reducers. > The number of reducers is specified as "input_rowcount / mapper_input_rows" > where `mapper_input_rows` is a new parameter for user to specify how many > records each mapper should handle. Since each reducer will write out its > records into one file, we're guaranteed that after > RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains > around mapper_input_rows. And since the followed up job's mapper handles one > block of each sequence file, they won't handle more than mapper_input_rows. > The added RedistributeFlatHiveTableStep usually takes a small amount of time > compared to other steps, but the benefit it brings is remarkable. Here's what > performance improvement we saw: > || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || > FactDistinctColumn after|| > | case#1 | 51.78min | 8.40min | 13.06min | > | case#2 | 95.65min | 2.46min | 26.37min | > And since mapper_input_rows is a kylin configuration, user can override it > for each cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1898) Upgrade to Avatica 1.8 or higher
[ https://issues.apache.org/jira/browse/KYLIN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385511#comment-15385511 ] Dayue Gao commented on KYLIN-1898: -- I've relocated kylin-jdbc dependencies in KYLIN-1846, hopefully it will solve the problem mentioned. > Upgrade to Avatica 1.8 or higher > > > Key: KYLIN-1898 > URL: https://issues.apache.org/jira/browse/KYLIN-1898 > Project: Kylin > Issue Type: Bug >Reporter: Julian Hyde > Attachments: KYLIN-1898.patch > > > A [stackoverflow > question|http://stackoverflow.com/questions/38369871/how-to-install-two-different-version-of-a-specific-package-in-maven] > reports problems when mixing Avatica 1.6 (used by Kylin) and Avatica 1.8 > (used by some unspecified other database). It appears that 1.6 and 1.8 are > not compatible, probably due to CALCITE-836 or CALCITE-1213. The solution is > for Kylin to upgrade to 1.8 or higher. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1849) add basic search capability at model UI
Dayue Gao created KYLIN-1849: Summary: add basic search capability at model UI Key: KYLIN-1849 URL: https://issues.apache.org/jira/browse/KYLIN-1849 Project: Kylin Issue Type: New Feature Components: Web Affects Versions: v1.5.2 Reporter: Dayue Gao Assignee: Zhong,Jason In order to work with dozens of cubes, could we add a search box at "Model" page? Just like the one at "Monitor" page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1848) Can't sort cubes by any field in Web UI
Dayue Gao created KYLIN-1848: Summary: Can't sort cubes by any field in Web UI Key: KYLIN-1848 URL: https://issues.apache.org/jira/browse/KYLIN-1848 Project: Kylin Issue Type: Bug Components: Web Affects Versions: v1.5.2 Reporter: Dayue Gao Assignee: Zhong,Jason In a project containing dozens of cubes, it's helpful to sort cubes by fields like "Create Time", "Status", and so on. I tried it today but found it's not working, could we fix it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-1846) minimize dependencies of JDBC driver
[ https://issues.apache.org/jira/browse/KYLIN-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-1846: - Attachment: KYLIN-1846.patch upload patch, which was tested and solved several classloading problems encountered in our environment. > minimize dependencies of JDBC driver > > > Key: KYLIN-1846 > URL: https://issues.apache.org/jira/browse/KYLIN-1846 > Project: Kylin > Issue Type: Improvement > Components: Driver - JDBC >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-1846.patch > > > kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) > into an uber jar, which could cause problems when user tries to integrate > kylin-jdbc into their own application. > I suggest making the following changes to packaging: > # remove calcite-core dependency > calcite-avatica is sufficient as far as I know. > # remove guava dependency > The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} > in KylinMeta.java, which can be simply replaced with > {{Collections.singletonList(metaResultSet)}}. > # remove log4j, slf4j-log4j12 dependencies > As a library, kylin-jdbc [should only depend on > slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging > framework to use should be a deployment-time choice made by user. This means > we should revert https://issues.apache.org/jira/browse/KYLIN-1160 > # relocate all dependencies to "org.apache.kylin.jdbc.shaded" using > maven-shade-plugin > This includes calcite-avatica, jackson, commons-httpclient and commons-codec. > Relocating should help to avoid class version conflicts. > I'll submit a patch for this, discussions are welcome~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1846) minimize dependencies of JDBC driver
[ https://issues.apache.org/jira/browse/KYLIN-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370115#comment-15370115 ] Dayue Gao commented on KYLIN-1846: -- [~Shaofengshi] I committed 9c578e7 to shade httpcomponents. > minimize dependencies of JDBC driver > > > Key: KYLIN-1846 > URL: https://issues.apache.org/jira/browse/KYLIN-1846 > Project: Kylin > Issue Type: Improvement > Components: Driver - JDBC >Affects Versions: v1.5.2 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-1846.patch > > > kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) > into an uber jar, which could cause problems when user tries to integrate > kylin-jdbc into their own application. > I suggest making the following changes to packaging: > # remove calcite-core dependency > calcite-avatica is sufficient as far as I know. > # remove guava dependency > The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} > in KylinMeta.java, which can be simply replaced with > {{Collections.singletonList(metaResultSet)}}. > # remove log4j, slf4j-log4j12 dependencies > As a library, kylin-jdbc [should only depend on > slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging > framework to use should be a deployment-time choice made by user. This means > we should revert https://issues.apache.org/jira/browse/KYLIN-1160 > # relocate all dependencies to "org.apache.kylin.jdbc.shaded" using > maven-shade-plugin > This includes calcite-avatica, jackson, commons-httpclient and commons-codec. > Relocating should help to avoid class version conflicts. > I'll submit a patch for this, discussions are welcome~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-1846) minimize dependencies of JDBC driver
Dayue Gao created KYLIN-1846: Summary: minimize dependencies of JDBC driver Key: KYLIN-1846 URL: https://issues.apache.org/jira/browse/KYLIN-1846 Project: Kylin Issue Type: Improvement Components: Driver - JDBC Affects Versions: v1.5.2 Reporter: Dayue Gao Assignee: Dayue Gao kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) into an uber jar, which could cause problems when user tries to integrate kylin-jdbc into their own application. I suggest making the following changes to packaging: # remove calcite-core dependency calcite-avatica is sufficient as far as I know. # remove guava dependency The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} in KylinMeta.java, which can be simply replaced with {{Collections.singletonList(metaResultSet)}}. # remove log4j, slf4j-log4j12 dependencies As a library, kylin-jdbc [should only depend on slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging framework to use should be a deployment-time choice made by user. This means we should revert https://issues.apache.org/jira/browse/KYLIN-1160 # relocate all dependencies to "org.apache.kylin.jdbc.shaded" using maven-shade-plugin This includes calcite-avatica, jackson, commons-httpclient and commons-codec. Relocating should help to avoid class version conflicts. I'll submit a patch for this, discussions are welcome~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2437) collect number of bytes scanned to query metrics
Dayue Gao created KYLIN-2437: Summary: collect number of bytes scanned to query metrics Key: KYLIN-2437 URL: https://issues.apache.org/jira/browse/KYLIN-2437 Project: Kylin Issue Type: Improvement Components: Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao Besides scanned row count, it's useful to know how many bytes are scanned from HBase to fulfil a query. It is perhaps a better indicator than row count that shows how much pressure a query puts on HBase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache
Dayue Gao created KYLIN-2436: Summary: add a configuration knob to disable spilling of aggregation cache Key: KYLIN-2436 URL: https://issues.apache.org/jira/browse/KYLIN-2436 Project: Kylin Issue Type: Improvement Components: Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao Kylin's aggregation operator can spill intermediate results to disk when its estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb to be specific). While it's a useful feature in general to prevent RegionServer from OOM, there are times when aborting this kind of memory-hungry query immediately is a more suitable choice to users. To accommodate this requirement, I suggest adding a new configuration named "kylin.storage.hbase.coprocessor-spill-enabled". The default value would be true, which will keep the same behavior as before. If changed to false, query that uses more aggregation memory than threshold will fail immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KYLIN-2058) Make Kylin more resilient to bad queries
[ https://issues.apache.org/jira/browse/KYLIN-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2058. -- Resolution: Fixed Fix Version/s: v1.6.0 There are still works to be done to defend Kylin against bad query. However since 1.6.0 has been released, I'll fire new JIRAs to continue. > Make Kylin more resilient to bad queries > > > Key: KYLIN-2058 > URL: https://issues.apache.org/jira/browse/KYLIN-2058 > Project: Kylin > Issue Type: Improvement > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Bad/Big queries are a huge threat to the overall performance and stability of > Kylin. We occasionally saw some of these queries either causing heavy GC > activities or crashing regionservers. I'd like to start a series of work to > make Kylin more resilient to bad queries. > This is an umbrella jira to relating works. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (KYLIN-1455) HBase ScanMetrics are not properly logged in query log
[ https://issues.apache.org/jira/browse/KYLIN-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao closed KYLIN-1455. Resolution: Won't Fix > HBase ScanMetrics are not properly logged in query log > -- > > Key: KYLIN-1455 > URL: https://issues.apache.org/jira/browse/KYLIN-1455 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.2 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Attachments: KYLIN-1455-1.x-staging.patch > > > HBase's ScanMetrics provide users valuable information when troubleshooting > query performance issues. But I found it was not properly logged, sometimes > missing from the log, sometimes duplicated. > Below is an example of duplicated scan log, this is due to > {{CubeSegmentTupleIterator#closeScanner()}} method is invoked two times, > first in hasNext(), second in close(). > {noformat} > [http-bio-8080-exec-8]:[2016-02-26 > 17:31:50,227][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)] > - Scan > {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}}, > > ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]} > [http-bio-8080-exec-8]:[2016-02-26 > 17:31:50,229][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:150)] > - HBase Metrics: count=17357, ms=3194, bytes=905594, remote_bytes=905594, > regions=1, not_serving_region=0, rpc=19, rpc_retries=0, remote_rpc=19, > remote_rpc_retries=0 > [http-bio-8080-exec-8]:[2016-02-26 > 17:32:58,016][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)] > - Scan > {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}}, > > ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]} > [http-bio-8080-exec-8]:[2016-02-26 > 17:33:04,443][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:150)] > - HBase Metrics: count=17357, ms=3194, bytes=905594, remote_bytes=905594, > regions=1, not_serving_region=0, rpc=19, rpc_retries=0, remote_rpc=19, > remote_rpc_retries=0 > {noformat} > And sometimes ScanMetrics is missing from the log, showed below. I think this > is due to {{CubeSegmentTupleIterator#closeScanner()}} trying to get > ScanMetrics before close the current ResultScanner. After looking into HBase > client source, I found that ScanMetrics will not be written out until the > scanner is closed or exhausted (no cache entries). So it'd be better to get > ScanMetrics after closing the scanner. > {noformat} > [http-bio-8080-exec-2]:[2016-02-26 > 17:18:43,928][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)] > - Scan > {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}}, > > ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x01\\x1C\\x04\\x0Cx\\x03\\x08Y","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]} > [http-bio-8080-exec-2]:[2016-02-26 > 17:19:38,228][INFO][org.apache.kylin.rest.service.QueryService.logQuery(QueryService.java:242)] > - > ==[QUERY]=== > {noformat} > This should be easy to fix, I will submit a patch for this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KYLIN-2438) replace scan threshold with max scan bytes
Dayue Gao created KYLIN-2438: Summary: replace scan threshold with max scan bytes Key: KYLIN-2438 URL: https://issues.apache.org/jira/browse/KYLIN-2438 Project: Kylin Issue Type: Improvement Components: Query Engine, Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao In order to guard against bad queries that can consume too much memory and then crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is determined by two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # otherwise, *kylin.query.mem.budget* / estimated_row_size is used as the maximum per region. This approach however has several deficiencies: * It doesn't work with complex, variable length metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan in total * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level * the old *kylin.query.scan.threshold* will be deprecated -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes
[ https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2438: - Description: In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum. This approach however has several deficiencies: * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level * The above two configs scan be override at cube level * the old *kylin.query.scan.threshold* will be deprecated was: In order to guard against bad queries that can consume too much memory and then crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is determined by two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # otherwise, *kylin.query.mem.budget* / estimated_row_size is used as the maximum per region. This approach however has several deficiencies: * It doesn't work with complex, variable length metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan in total * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level * the old *kylin.query.scan.threshold* will be deprecated > replace scan threshold with max scan bytes > -- > > Key: KYLIN-2438 > URL: https://issues.apache.org/jira/browse/KYLIN-2438 > Project: Kylin > Issue Type: Improvement > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > In order to guard against bad queries that can consume lots of memory and > potentially crash kylin / hbase server, kylin limits the maximum number of > rows query can scan. The maximum value is chosen based on two configs > # *kylin.query.scan.threshold* is used if the query doesn't contain > memory-hungry metrics > # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per > region maximum. > This approach however has several deficiencies: > * It doesn't work with complex, varlen metrics very well. The estimated > threshold could be either too small or too large. If it's too small, good > queries are killed. If it's too large, bad queries are not banned. > * Row count doesn't correspond to memory consumption, thus it's difficult to > determine how large scan threshold should be set to. > * kylin.query.scan.threshold can't be override at cube level. > In this JIRA, I propose to replace the current row count based threshold with > a more intuitive size based threshold > * KYLIN-2437 will collect the number of bytes scanned at both region and > query level > * A new configuration *kylin.query.max-scan-bytes* will be added to limits > the maximum number of bytes query can scan > * *kylin.query.mem.budget* will be renamed to > *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level > * The above two configs scan be override at cube level > * the old *kylin.query.scan.threshold* will be deprecated -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes
[ https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2438: - Description: In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum. This approach however has several deficiencies: * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. We don't need to rely on estimations about row size any more. * The above two configs scan be override at cube level * the old *kylin.query.scan.threshold* will be deprecated was: In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum. This approach however has several deficiencies: * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level * The above two configs scan be override at cube level * the old *kylin.query.scan.threshold* will be deprecated > replace scan threshold with max scan bytes > -- > > Key: KYLIN-2438 > URL: https://issues.apache.org/jira/browse/KYLIN-2438 > Project: Kylin > Issue Type: Improvement > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > In order to guard against bad queries that can consume lots of memory and > potentially crash kylin / hbase server, kylin limits the maximum number of > rows query can scan. The maximum value is chosen based on two configs > # *kylin.query.scan.threshold* is used if the query doesn't contain > memory-hungry metrics > # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per > region maximum. > This approach however has several deficiencies: > * It doesn't work with complex, varlen metrics very well. The estimated > threshold could be either too small or too large. If it's too small, good > queries are killed. If it's too large, bad queries are not banned. > * Row count doesn't correspond to memory consumption, thus it's difficult to > determine how large scan threshold should be set to. > * kylin.query.scan.threshold can't be override at cube level. > In this JIRA, I propose to replace the current row count based threshold with > a more intuitive size based threshold > * KYLIN-2437 will collect the number of bytes scanned at both region and > query level > * A new configuration *kylin.query.max-scan-bytes* will be added to limits > the maximum number of bytes query can scan > * *kylin.query.mem.budget* will be renamed to > *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region > level. We don't need to rely on estimations about row size any more. > * The above two configs scan be
[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes
[ https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2438: - Description: In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum. This approach however has several deficiencies: * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. No need to rely on estimations about row size any more. * The above two configs scan be override at cube level * the old *kylin.query.scan.threshold* will be deprecated was: In order to guard against bad queries that can consume lots of memory and potentially crash kylin / hbase server, kylin limits the maximum number of rows query can scan. The maximum value is chosen based on two configs # *kylin.query.scan.threshold* is used if the query doesn't contain memory-hungry metrics # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per region maximum. This approach however has several deficiencies: * It doesn't work with complex, varlen metrics very well. The estimated threshold could be either too small or too large. If it's too small, good queries are killed. If it's too large, bad queries are not banned. * Row count doesn't correspond to memory consumption, thus it's difficult to determine how large scan threshold should be set to. * kylin.query.scan.threshold can't be override at cube level. In this JIRA, I propose to replace the current row count based threshold with a more intuitive size based threshold * KYLIN-2437 will collect the number of bytes scanned at both region and query level * A new configuration *kylin.query.max-scan-bytes* will be added to limits the maximum number of bytes query can scan * *kylin.query.mem.budget* will be renamed to *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. We don't need to rely on estimations about row size any more. * The above two configs scan be override at cube level * the old *kylin.query.scan.threshold* will be deprecated > replace scan threshold with max scan bytes > -- > > Key: KYLIN-2438 > URL: https://issues.apache.org/jira/browse/KYLIN-2438 > Project: Kylin > Issue Type: Improvement > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > In order to guard against bad queries that can consume lots of memory and > potentially crash kylin / hbase server, kylin limits the maximum number of > rows query can scan. The maximum value is chosen based on two configs > # *kylin.query.scan.threshold* is used if the query doesn't contain > memory-hungry metrics > # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per > region maximum. > This approach however has several deficiencies: > * It doesn't work with complex, varlen metrics very well. The estimated > threshold could be either too small or too large. If it's too small, good > queries are killed. If it's too large, bad queries are not banned. > * Row count doesn't correspond to memory consumption, thus it's difficult to > determine how large scan threshold should be set to. > * kylin.query.scan.threshold can't be override at cube level. > In this JIRA, I propose to replace the current row count based threshold with > a more intuitive size based threshold > * KYLIN-2437 will collect the number of bytes scanned at both region and > query level > * A new configuration *kylin.query.max-scan-bytes* will be added to limits > the maximum number of bytes query can scan > * *kylin.query.mem.budget* will be renamed to > *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region > level. No need to rely on estimations about
[jira] [Resolved] (KYLIN-2412) Unclosed DataOutputStream in RoaringBitmapCounter#write()
[ https://issues.apache.org/jira/browse/KYLIN-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2412. -- Resolution: Fixed commit https://github.com/apache/kylin/commit/d264339b1c16c195ffafc2217b793d81bdbd6434 > Unclosed DataOutputStream in RoaringBitmapCounter#write() > - > > Key: KYLIN-2412 > URL: https://issues.apache.org/jira/browse/KYLIN-2412 > Project: Kylin > Issue Type: Bug >Reporter: Ted Yu >Assignee: Dayue Gao >Priority: Minor > > {code} > bitmap.serialize(new DataOutputStream(new > ByteBufferOutputStream(out))); > {code} > Upon return from the method, DataOutputStream should be closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job
[ https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878043#comment-15878043 ] Dayue Gao commented on KYLIN-2457: -- +1. Hi [~zhengd], it would be better if you update the comments of {{makeDictForNewSegment}} and {{makeSnapshotForNewSegment}}? > Should copy the latest dictionaries on dimension tables in a batch merge job > > > Key: KYLIN-2457 > URL: https://issues.apache.org/jira/browse/KYLIN-2457 > Project: Kylin > Issue Type: Bug >Reporter: zhengdong >Priority: Critical > Attachments: > KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch > > > In a batch merge job, we need to create dictionaries for all dimensions for > the new segment. For those dictionaries on dimension table, we currently just > copy them from the earliest segment of the merging segments. > However, we should select the newest dictionary for the new segment, since > the incremental dimension table is allowed. The older dictionary can't find > the records corresponding to the new key added to a dimension table and lead > wrong query result. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds
[ https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874279#comment-15874279 ] Dayue Gao commented on KYLIN-2451: -- LGTM > Set HBASE_RPC_TIMEOUT according to > kylin.storage.hbase.coprocessor-timeout-seconds > -- > > Key: KYLIN-2451 > URL: https://issues.apache.org/jira/browse/KYLIN-2451 > Project: Kylin > Issue Type: Improvement >Reporter: liyang >Assignee: liyang > Fix For: v2.0.0 > > > Currently if HBASE_RPC_TIMEOUT is shorter than > "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will > timeout before coprocessor gives up. Shall let RPC wait longer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds
[ https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871127#comment-15871127 ] Dayue Gao commented on KYLIN-2451: -- Hi [~liyang.g...@gmail.com], how is this possible? Haven't we already upper bounded coprocessor timeout to 0.9 x HBASE_RPC_TIMEOUT? Please take a look at CubeHBaseRPC.getCoprocessorTimeoutMillis. > Set HBASE_RPC_TIMEOUT according to > kylin.storage.hbase.coprocessor-timeout-seconds > -- > > Key: KYLIN-2451 > URL: https://issues.apache.org/jira/browse/KYLIN-2451 > Project: Kylin > Issue Type: Improvement >Reporter: liyang >Assignee: liyang > Fix For: v2.0.0 > > > Currently if HBASE_RPC_TIMEOUT is shorter than > "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will > timeout before coprocessor gives up. Shall let RPC wait longer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds
[ https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874211#comment-15874211 ] Dayue Gao commented on KYLIN-2451: -- Hi [~liyang.g...@gmail.com], I see what's the difference. Our goal is the same, make rpc timeout longer than coprocessor-timeout-seconds. The differences are that previously we override coprocessor-timeout-seconds according to HBASE_RPC_TIMEOUT (please see comment about coprocessor-timeout-seconds in kylin.properties), and now you want to do it oppositely, set HBASE_RPC_TIMEOUT according to coprocessor-timeout-seconds, right? But I would prefer the previously approach because coprocessor-timeout-seconds is a cube level configs but HBASE_RPC_TIMEOUT is a global one. With your approach, use can't choose a larger value at cube level. > Set HBASE_RPC_TIMEOUT according to > kylin.storage.hbase.coprocessor-timeout-seconds > -- > > Key: KYLIN-2451 > URL: https://issues.apache.org/jira/browse/KYLIN-2451 > Project: Kylin > Issue Type: Improvement >Reporter: liyang >Assignee: liyang > Fix For: v2.0.0 > > > Currently if HBASE_RPC_TIMEOUT is shorter than > "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will > timeout before coprocessor gives up. Shall let RPC wait longer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2443) Report coprocessor error information back to client
[ https://issues.apache.org/jira/browse/KYLIN-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862616#comment-15862616 ] Dayue Gao commented on KYLIN-2443: -- Commit https://github.com/apache/kylin/commit/43c0566728092d537201d751d3e8f6e3c0d5f051 Changes highlight * Update CubeVisitResponse message with ErrorInfo and report error message back to end user * Renamed GTScanTimeoutException to KylinTimeoutException, GTScanExceedThresholdException to ResourceLimitExceededException. Deleted GTScanSelfTerminatedException. * Make SQLResponse#totalScanCount reflect hbase scan count rather than query server scan count. Rename StorageContext#totalScanCount to processedRowCount [~mahongbin], could you peer review the commit? > Report coprocessor error information back to client > --- > > Key: KYLIN-2443 > URL: https://issues.apache.org/jira/browse/KYLIN-2443 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > When query aborts in coprocessor, the current error message (list below) > doesn't carry any concrete reason. User has to check regionserver's log in > order to figure out what's happening, which is a tedious work and not always > possible in a cloud environment. > {noformat} > 4d65f9bf>The coprocessor thread stopped itself due to scan timeout or scan > threshold(check region server log), failing current query... > {noformat} > It would be better to report error message to client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache
[ https://issues.apache.org/jira/browse/KYLIN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2436. -- Resolution: Fixed Fix Version/s: v2.0.0 > add a configuration knob to disable spilling of aggregation cache > - > > Key: KYLIN-2436 > URL: https://issues.apache.org/jira/browse/KYLIN-2436 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v2.0.0 > > > Kylin's aggregation operator can spill intermediate results to disk when its > estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb > to be specific). While it's a useful feature in general to prevent > RegionServer from OOM, there are times when aborting this kind of > memory-hungry query immediately is a more suitable choice to users. > To accommodate this requirement, I suggest adding a new configuration named > *kylin.storage.hbase.coprocessor-spill-enabled*. The default value would be > true, which will keep the same behavior as before. If changed to false, query > that uses more aggregation memory than threshold will fail immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache
[ https://issues.apache.org/jira/browse/KYLIN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860703#comment-15860703 ] Dayue Gao commented on KYLIN-2436: -- commit https://github.com/apache/kylin/commit/ecf6a69fece7cbda3a9bd8d678c928224ce677aa > add a configuration knob to disable spilling of aggregation cache > - > > Key: KYLIN-2436 > URL: https://issues.apache.org/jira/browse/KYLIN-2436 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v2.0.0 > > > Kylin's aggregation operator can spill intermediate results to disk when its > estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb > to be specific). While it's a useful feature in general to prevent > RegionServer from OOM, there are times when aborting this kind of > memory-hungry query immediately is a more suitable choice to users. > To accommodate this requirement, I suggest adding a new configuration named > *kylin.storage.hbase.coprocessor-spill-enabled*. The default value would be > true, which will keep the same behavior as before. If changed to false, query > that uses more aggregation memory than threshold will fail immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KYLIN-2443) Report coprocessor error information back to client
[ https://issues.apache.org/jira/browse/KYLIN-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2443. -- Resolution: Fixed Fix Version/s: v2.0.0 > Report coprocessor error information back to client > --- > > Key: KYLIN-2443 > URL: https://issues.apache.org/jira/browse/KYLIN-2443 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v2.0.0 > > > When query aborts in coprocessor, the current error message (list below) > doesn't carry any concrete reason. User has to check regionserver's log in > order to figure out what's happening, which is a tedious work and not always > possible in a cloud environment. > {noformat} > 4d65f9bf>The coprocessor thread stopped itself due to scan timeout or scan > threshold(check region server log), failing current query... > {noformat} > It would be better to report error message to client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KYLIN-2437) collect number of bytes scanned to query metrics
[ https://issues.apache.org/jira/browse/KYLIN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2437. -- Resolution: Fixed commit https://github.com/apache/kylin/commit/e09338b34c0b07a7167096e45bf9185aa0d0cbd5 > collect number of bytes scanned to query metrics > > > Key: KYLIN-2437 > URL: https://issues.apache.org/jira/browse/KYLIN-2437 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > Besides scanned row count, it's useful to know how many bytes are scanned > from HBase to fulfil a query. It is perhaps a better indicator than row count > that shows how much pressure a query puts on HBase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2437) collect number of bytes scanned to query metrics
[ https://issues.apache.org/jira/browse/KYLIN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2437: - Fix Version/s: v2.0.0 > collect number of bytes scanned to query metrics > > > Key: KYLIN-2437 > URL: https://issues.apache.org/jira/browse/KYLIN-2437 > Project: Kylin > Issue Type: Improvement > Components: Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v2.0.0 > > > Besides scanned row count, it's useful to know how many bytes are scanned > from HBase to fulfil a query. It is perhaps a better indicator than row count > that shows how much pressure a query puts on HBase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance
[ https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829485#comment-15829485 ] Dayue Gao commented on KYLIN-2387: -- Choosing mutable or immutable bitmaps is implementation details, shouldn't affect the way client uses BitmapCounter. Hence I add mutate operations back to BitmapCounter interface, and merge the two subclasses into one RoaringBitmapCounter. Commit here https://github.com/apache/kylin/commit/38c3e7bf691ecdfd0f8d42fcc97065a0596be018 > A new BitmapCounter with better performance > --- > > Key: KYLIN-2387 > URL: https://issues.apache.org/jira/browse/KYLIN-2387 > Project: Kylin > Issue Type: Improvement > Components: Metadata, Query Engine, Storage - HBase >Affects Versions: v2.0.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > We found the old BitmapCounter does not perform very well on very large > bitmap. The inefficiency comes from > * Poor serialize implementation: instead of serialize bitmap directly to > ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes > superfluous memory allocations > * Poor peekLength implementation: the whole bitmap is deserialized in order > to retrieve its serialized size > * Extra deserialize cost: even if only cardinality info is needed to answer > query, the whole bitmap is deserialize into MutableRoaringBitmap > A new BitmapCounter is designed to solve these problems > * It comes in tow flavors, mutable and immutable, which is based on > Mutable/Immutable RoaringBitmap correspondingly > * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a > copied buffer. So we always deserialize to ImmutableBitmapCounter at first, > and convert it to MutableBitmapCounter only when necessary > * peekLength is implemented using > ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only > the header of roaring format is examined > * It can directly serializes to ByteBuffer, no intermediate buffer is > allocated > * The wire format is the same as before > ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). > Therefore no cube rebuild is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (KYLIN-2398) CubeSegmentScanner generated inaccurate
[ https://issues.apache.org/jira/browse/KYLIN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao closed KYLIN-2398. Resolution: Duplicate Fix Version/s: (was: Future) > CubeSegmentScanner generated inaccurate > --- > > Key: KYLIN-2398 > URL: https://issues.apache.org/jira/browse/KYLIN-2398 > Project: Kylin > Issue Type: Improvement > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: WangSheng >Assignee: liyang > > My project has three segment: > 2016060100_2016060200, > 2016060200_2016060300, > 2016060300_2016060400 > When I used filter condition like this : day>='2016-06-01' and > day<'2016-06-02' > Kylin would generated three CubeSegmentScanner, and each CubeSegmentScanner's > GTScanRequest are not empty! > When I changed filter condition like this : day>='2016-06-01' and > day<='2016-06-02' > Kylin would also generated three CubeSegmentScanner, but the last > CubeSegmentScanner's GTScanRequest is empty! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (KYLIN-2399) CubeSegmentScanner generated inaccurate
[ https://issues.apache.org/jira/browse/KYLIN-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao closed KYLIN-2399. Resolution: Duplicate > CubeSegmentScanner generated inaccurate > --- > > Key: KYLIN-2399 > URL: https://issues.apache.org/jira/browse/KYLIN-2399 > Project: Kylin > Issue Type: Improvement > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: WangSheng >Assignee: liyang > Fix For: Future > > > My project has three segment: > 2016060100_2016060200, > 2016060200_2016060300, > 2016060300_2016060400 > When I used filter condition like this : day>='2016-06-01' and > day<'2016-06-02' > Kylin would generated three CubeSegmentScanner, and each CubeSegmentScanner's > GTScanRequest are not empty! > When I changed filter condition like this : day>='2016-06-01' and > day<='2016-06-02' > Kylin would also generated three CubeSegmentScanner, but the last > CubeSegmentScanner's GTScanRequest is empty! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance
[ https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825703#comment-15825703 ] Dayue Gao commented on KYLIN-2387: -- ImmutableRoaringBitmap.bitmapOf is only used in test, so it's possible to remove the usage of it. But my question is, why does kylin load RoaringBitmap class from spark? Is it a classpath issue? > A new BitmapCounter with better performance > --- > > Key: KYLIN-2387 > URL: https://issues.apache.org/jira/browse/KYLIN-2387 > Project: Kylin > Issue Type: Improvement > Components: Metadata, Query Engine, Storage - HBase >Affects Versions: v2.0.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > We found the old BitmapCounter does not perform very well on very large > bitmap. The inefficiency comes from > * Poor serialize implementation: instead of serialize bitmap directly to > ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes > superfluous memory allocations > * Poor peekLength implementation: the whole bitmap is deserialized in order > to retrieve its serialized size > * Extra deserialize cost: even if only cardinality info is needed to answer > query, the whole bitmap is deserialize into MutableRoaringBitmap > A new BitmapCounter is designed to solve these problems > * It comes in tow flavors, mutable and immutable, which is based on > Mutable/Immutable RoaringBitmap correspondingly > * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a > copied buffer. So we always deserialize to ImmutableBitmapCounter at first, > and convert it to MutableBitmapCounter only when necessary > * peekLength is implemented using > ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only > the header of roaring format is examined > * It can directly serializes to ByteBuffer, no intermediate buffer is > allocated > * The wire format is the same as before > ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). > Therefore no cube rebuild is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance
[ https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825814#comment-15825814 ] Dayue Gao commented on KYLIN-2387: -- Commit https://github.com/apache/kylin/commit/e894465007f422d619ddeab2acd87e38fa093fd9 removes the usage of ImmutableRoaringBitmap.bitmapOf. > A new BitmapCounter with better performance > --- > > Key: KYLIN-2387 > URL: https://issues.apache.org/jira/browse/KYLIN-2387 > Project: Kylin > Issue Type: Improvement > Components: Metadata, Query Engine, Storage - HBase >Affects Versions: v2.0.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > We found the old BitmapCounter does not perform very well on very large > bitmap. The inefficiency comes from > * Poor serialize implementation: instead of serialize bitmap directly to > ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes > superfluous memory allocations > * Poor peekLength implementation: the whole bitmap is deserialized in order > to retrieve its serialized size > * Extra deserialize cost: even if only cardinality info is needed to answer > query, the whole bitmap is deserialize into MutableRoaringBitmap > A new BitmapCounter is designed to solve these problems > * It comes in tow flavors, mutable and immutable, which is based on > Mutable/Immutable RoaringBitmap correspondingly > * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a > copied buffer. So we always deserialize to ImmutableBitmapCounter at first, > and convert it to MutableBitmapCounter only when necessary > * peekLength is implemented using > ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only > the header of roaring format is examined > * It can directly serializes to ByteBuffer, no intermediate buffer is > allocated > * The wire format is the same as before > ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). > Therefore no cube rebuild is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance
[ https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825749#comment-15825749 ] Dayue Gao commented on KYLIN-2387: -- OK, I'll remove the usage of ImmutableRoaringBitmap.bitmapOf. But I'm not sure if there are any other incompatible methods. > A new BitmapCounter with better performance > --- > > Key: KYLIN-2387 > URL: https://issues.apache.org/jira/browse/KYLIN-2387 > Project: Kylin > Issue Type: Improvement > Components: Metadata, Query Engine, Storage - HBase >Affects Versions: v2.0.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > > We found the old BitmapCounter does not perform very well on very large > bitmap. The inefficiency comes from > * Poor serialize implementation: instead of serialize bitmap directly to > ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes > superfluous memory allocations > * Poor peekLength implementation: the whole bitmap is deserialized in order > to retrieve its serialized size > * Extra deserialize cost: even if only cardinality info is needed to answer > query, the whole bitmap is deserialize into MutableRoaringBitmap > A new BitmapCounter is designed to solve these problems > * It comes in tow flavors, mutable and immutable, which is based on > Mutable/Immutable RoaringBitmap correspondingly > * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a > copied buffer. So we always deserialize to ImmutableBitmapCounter at first, > and convert it to MutableBitmapCounter only when necessary > * peekLength is implemented using > ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only > the header of roaring format is examined > * It can directly serializes to ByteBuffer, no intermediate buffer is > allocated > * The wire format is the same as before > ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). > Therefore no cube rebuild is needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job
[ https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2457. -- Resolution: Fixed Assignee: zhengdong > Should copy the latest dictionaries on dimension tables in a batch merge job > > > Key: KYLIN-2457 > URL: https://issues.apache.org/jira/browse/KYLIN-2457 > Project: Kylin > Issue Type: Bug >Affects Versions: v1.6.0 >Reporter: zhengdong >Assignee: zhengdong >Priority: Critical > Fix For: v2.0.0 > > Attachments: > 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch > > > In a batch merge job, we need to create dictionaries for all dimensions for > the new segment. For those dictionaries on dimension table, we currently just > copy them from the earliest segment of the merging segments. > However, we should select the newest dictionary for the new segment, since > the incremental dimension table is allowed. The older dictionary can't find > the records corresponding to the new key added to a dimension table and lead > wrong query result. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job
[ https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887881#comment-15887881 ] Dayue Gao commented on KYLIN-2457: -- Merged to master https://github.com/apache/kylin/commit/a8001226b2a07cd553e680b7e14de9bf8c9981f3 [~zhengd], nice work! Thank you for your contribution! > Should copy the latest dictionaries on dimension tables in a batch merge job > > > Key: KYLIN-2457 > URL: https://issues.apache.org/jira/browse/KYLIN-2457 > Project: Kylin > Issue Type: Bug >Reporter: zhengdong >Priority: Critical > Attachments: > 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch > > > In a batch merge job, we need to create dictionaries for all dimensions for > the new segment. For those dictionaries on dimension table, we currently just > copy them from the earliest segment of the merging segments. > However, we should select the newest dictionary for the new segment, since > the incremental dimension table is allowed. The older dictionary can't find > the records corresponding to the new key added to a dimension table and lead > wrong query result. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job
[ https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2457: - Fix Version/s: v2.0.0 > Should copy the latest dictionaries on dimension tables in a batch merge job > > > Key: KYLIN-2457 > URL: https://issues.apache.org/jira/browse/KYLIN-2457 > Project: Kylin > Issue Type: Bug >Affects Versions: v1.6.0 >Reporter: zhengdong >Priority: Critical > Fix For: v2.0.0 > > Attachments: > 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch > > > In a batch merge job, we need to create dictionaries for all dimensions for > the new segment. For those dictionaries on dimension table, we currently just > copy them from the earliest segment of the merging segments. > However, we should select the newest dictionary for the new segment, since > the incremental dimension table is allowed. The older dictionary can't find > the records corresponding to the new key added to a dimension table and lead > wrong query result. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job
[ https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2457: - Affects Version/s: v1.6.0 > Should copy the latest dictionaries on dimension tables in a batch merge job > > > Key: KYLIN-2457 > URL: https://issues.apache.org/jira/browse/KYLIN-2457 > Project: Kylin > Issue Type: Bug >Affects Versions: v1.6.0 >Reporter: zhengdong >Priority: Critical > Fix For: v2.0.0 > > Attachments: > 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch > > > In a batch merge job, we need to create dictionaries for all dimensions for > the new segment. For those dictionaries on dimension table, we currently just > copy them from the earliest segment of the merging segments. > However, we should select the newest dictionary for the new segment, since > the incremental dimension table is allowed. The older dictionary can't find > the records corresponding to the new key added to a dimension table and lead > wrong query result. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache
[ https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2007: - Attachment: KYLIN-2007.patch patch uploaded > CUBOID_CACHE is not cleared when rebuilding ALL cache > - > > Key: KYLIN-2007 > URL: https://issues.apache.org/jira/browse/KYLIN-2007 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Attachments: KYLIN-2007.patch > > > CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to > invalidate possibly stale cache. However we forgot to clear > Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect > query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache
Dayue Gao created KYLIN-2007: Summary: CUBOID_CACHE is not cleared when rebuilding ALL cache Key: KYLIN-2007 URL: https://issues.apache.org/jira/browse/KYLIN-2007 Project: Kylin Issue Type: Bug Components: Query Engine Affects Versions: v1.5.3 Reporter: Dayue Gao Assignee: Dayue Gao Priority: Minor CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to invalidate possibly stale cache. However we forgot to clear Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2013) more robust approach to hive schema changes
Dayue Gao created KYLIN-2013: Summary: more robust approach to hive schema changes Key: KYLIN-2013 URL: https://issues.apache.org/jira/browse/KYLIN-2013 Project: Kylin Issue Type: Bug Components: Metadata, REST Service, Web Affects Versions: v1.5.3 Reporter: Dayue Gao Assignee: Dayue Gao Our users occasionally want to change their existing cube, such as adding/renaming/removing a dimension. Some of these changes require modifications to its source hive table. So our user changed the table schema and reloaded its metadata in Kylin, then several issues can happen depends on what he changed. I did some schema changing tests based on 1.5.3, the results after reloading table are listed below || type of changes || fact table || lookup table || | *minor* | both query and build still works | query can fail or return wrong answer | | *major* | fail to load related cube | fail to load related cube | {{minor}} changes refer to those doesn't change columns used in cubes, such as insert/append new column, remove/change unused column. {{major}} changes are the opposite, like remove/rename/change type of used column. Clearly from the table, reload a changed table is problematic in certain cases. KYLIN-1536 reports a similar problem. So what can we do to support this kind of iterative development process (load -> define cube -> build -> reload -> change cube -> rebuild)? My first thought is simply detect-and-prohibit reloading used table. User should be able to know which cube is preventing him from reloading, and then he could drop and recreate cube after reloading. However, defining a cube is not an easy task (consider editing 100 measures). Force users to recreate their cube over and over again will certainly not make them happy. A better idea is to allow cube to be editable even if it's broken due to some columns changed after reloading. Broken cube can't be built or queried, it can only be edit or dropped. In fact, there is a cube status called {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should take advantage of it. An enabled cube shouldn't allow schema changes, otherwise an unintentional reload could make it unavailable. Similarly, a disabled but unpurged cube shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2012) more robust approach to hive schema changes
Dayue Gao created KYLIN-2012: Summary: more robust approach to hive schema changes Key: KYLIN-2012 URL: https://issues.apache.org/jira/browse/KYLIN-2012 Project: Kylin Issue Type: Bug Components: Metadata, REST Service, Web Affects Versions: v1.5.3 Reporter: Dayue Gao Assignee: Dayue Gao Our users occasionally want to change their existing cube, such as adding/renaming/removing a dimension. Some of these changes require modifications to its source hive table. So our user changed the table schema and reloaded its metadata in Kylin, then several issues can happen depends on what he changed. I did some schema changing tests based on 1.5.3, the results after reloading table are listed below || type of changes || fact table || lookup table || | *minor* | both query and build still works | query can fail or return wrong answer | | *major* | fail to load related cube | fail to load related cube | {{minor}} changes refer to those doesn't change columns used in cubes, such as insert/append new column, remove/change unused column. {{major}} changes are the opposite, like remove/rename/change type of used column. Clearly from the table, reload a changed table is problematic in certain cases. KYLIN-1536 reports a similar problem. So what can we do to support this kind of iterative development process (load -> define cube -> build -> reload -> change cube -> rebuild)? My first thought is simply detect-and-prohibit reloading used table. User should be able to know which cube is preventing him from reloading, and then he could drop and recreate cube after reloading. However, defining a cube is not an easy task (consider editing 100 measures). Force users to recreate their cube over and over again will certainly not make them happy. A better idea is to allow cube to be editable even if it's broken due to some columns changed after reloading. Broken cube can't be built or queried, it can only be edit or dropped. In fact, there is a cube status called {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should take advantage of it. An enabled cube shouldn't allow schema changes, otherwise an unintentional reload could make it unavailable. Similarly, a disabled but unpurged cube shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2013) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489140#comment-15489140 ] Dayue Gao commented on KYLIN-2013: -- Ah... I have no idea why it got created twice. > more robust approach to hive schema changes > --- > > Key: KYLIN-2013 > URL: https://issues.apache.org/jira/browse/KYLIN-2013 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2013) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489097#comment-15489097 ] Dayue Gao commented on KYLIN-2013: -- Hi [~yimingliu], could you point to me to the jira it's duplicated with? Has this issue already been fixed? I'm just going to submit a patch for it. > more robust approach to hive schema changes > --- > > Key: KYLIN-2013 > URL: https://issues.apache.org/jira/browse/KYLIN-2013 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache
[ https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489189#comment-15489189 ] Dayue Gao commented on KYLIN-2007: -- committed to master > CUBOID_CACHE is not cleared when rebuilding ALL cache > - > > Key: KYLIN-2007 > URL: https://issues.apache.org/jira/browse/KYLIN-2007 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Attachments: KYLIN-2007.patch > > > CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to > invalidate possibly stale cache. However we forgot to clear > Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect > query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489712#comment-15489712 ] Dayue Gao commented on KYLIN-2012: -- commit 17569f6 to master. SchemaChecker is the main workhorse, it prevents danger reloads according to the following rules: * if table has been used as fact table, all columns used in cube can't be changed. It means ** remove/rename used column is not allowed ** type change of used column is generally not allowed, except {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}} ** add/remove/change unused column is ok * if table has been used as lookup table, the old and new schema should be the same, except these type change {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}, It means ** add/remove/rename/reorder column is not allowed (PS: I'm aware that KYLIN-1985 could allow some degree of schema changes on lookup table, so the above rule for lookup table may be too strict) When a non-empty cube violates these rules, no reloading will be performed. An error message containing details about the violation is shown. When only empty cube violates these rules, reloading will success. All violating cubes are changed to {{DESCBROKEN}} status (see CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label at front-end so that user can easily find out all broken cubes. User can edit or drop broken cube, but can't disable/enable/build/copy it. After user fixes all the problems in his cube (and model), the cube will back to DISABLE status. Trying to save a broken cube won't success like always. Therefore, DESCBROKEN status can only appear after reloading a changed table. [~Shaofengshi] [~yimingliu] Do you have time to review the code? [~zhongjian] I'm not an expert on front-end, could you also review the front-end changes? > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489712#comment-15489712 ] Dayue Gao edited comment on KYLIN-2012 at 9/14/16 7:52 AM: --- commit 17569f6 to master. SchemaChecker is the main workhorse, it prevents danger reloads according to the following rules: * if table has been used as fact table, all columns used in cube can't be changed. It means ** remove/rename used column is not allowed ** type change of used column is generally not allowed, except {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}} ** add/remove/change unused column is ok * if table has been used as lookup table, then old and new schema should be the same, except type changes above. It means ** add/remove/rename/reorder column is not allowed (PS: I'm aware that after KYLIN-1985, we can allow some schema changes on lookup table, so the above rule for lookup table may be too strict) {color:red}When a non-empty cube violates these rules, no reloading will be performed{color}. An error message containing details about the violation is shown. {color:blue}When only empty cube violates these rules, reloading will success{color}. All violating cubes are then changed to {{DESCBROKEN}} status (see CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label at front-end so that user can easily find out all broken cubes. *User can edit or drop broken cube, but can't disable/enable/build/copy it.* After user fixes all the problems in his cube (and model), the cube will back to DISABLE status. Trying to save a broken cube won't success like always. Therefore, DESCBROKEN status can only appear after reloading a changed table. [~Shaofengshi] [~yimingliu] Do you have time to review the code? [~zhongjian] I'm not an expert on front-end, could you also review the front-end changes? was (Author: gaodayue): commit 17569f6 to master. SchemaChecker is the main workhorse, it prevents danger reloads according to the following rules: * if table has been used as fact table, all columns used in cube can't be changed. It means ** remove/rename used column is not allowed ** type change of used column is generally not allowed, except {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}} ** add/remove/change unused column is ok * if table has been used as lookup table, the old and new schema should be the same, except these type change {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}, It means ** add/remove/rename/reorder column is not allowed (PS: I'm aware that KYLIN-1985 could allow some degree of schema changes on lookup table, so the above rule for lookup table may be too strict) When a non-empty cube violates these rules, no reloading will be performed. An error message containing details about the violation is shown. When only empty cube violates these rules, reloading will success. All violating cubes are changed to {{DESCBROKEN}} status (see CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label at front-end so that user can easily find out all broken cubes. User can edit or drop broken cube, but can't disable/enable/build/copy it. After user fixes all the problems in his cube (and model), the cube will back to DISABLE status. Trying to save a broken cube won't success like always. Therefore, DESCBROKEN status can only appear after reloading a changed table. [~Shaofengshi] [~yimingliu] Do you have time to review the code? [~zhongjian] I'm not an expert on front-end, could you also review the front-end changes? > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table,
[jira] [Closed] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache
[ https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao closed KYLIN-2007. Resolution: Fixed Fix Version/s: v1.6.0 > CUBOID_CACHE is not cleared when rebuilding ALL cache > - > > Key: KYLIN-2007 > URL: https://issues.apache.org/jira/browse/KYLIN-2007 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Minor > Fix For: v1.6.0 > > Attachments: KYLIN-2007.patch > > > CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to > invalidate possibly stale cache. However we forgot to clear > Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect > query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2058) Make Kylin more resilient to bad queries
Dayue Gao created KYLIN-2058: Summary: Make Kylin more resilient to bad queries Key: KYLIN-2058 URL: https://issues.apache.org/jira/browse/KYLIN-2058 Project: Kylin Issue Type: Improvement Components: Query Engine, Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao Bad/Big queries are a huge threat to the overall performance and stability of Kylin. We occasionally saw some of these queries either causing heavy GC activities or crashing regionservers. I'd like to start a series of work to make Kylin more resilient to bad queries. This is an umbrella jira to relating works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened
Dayue Gao created KYLIN-2173: Summary: push down limit leads to wrong answer when filter is loosened Key: KYLIN-2173 URL: https://issues.apache.org/jira/browse/KYLIN-2173 Project: Kylin Issue Type: Bug Components: Storage - HBase Affects Versions: v1.5.4.1 Reporter: Dayue Gao Assignee: Dayue Gao To reproduce: {noformat} select test_kylin_fact.cal_dt ,sum(test_kylin_fact.price) as GMV FROM test_kylin_fact left JOIN edw.test_cal_dt as test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20') group by test_kylin_fact.cal_dt limit 12 {noformat} Kylin returns 5 rows, expect 12 rows. Root cause: filter condition may be loosened when we translate derived filter in DerivedFilterTranslator. If we push down limit, query server won't get enough valid records from storage. In the above example, 24 rows returned from storage, only 5 are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-1609) Push down undefined Count Distinct aggregation to storage
[ https://issues.apache.org/jira/browse/KYLIN-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650757#comment-15650757 ] Dayue Gao commented on KYLIN-1609: -- Hi [~lidong_sjtu], is there further plan on this? What's the motivation for pushdown count(distinct dim)? > Push down undefined Count Distinct aggregation to storage > - > > Key: KYLIN-1609 > URL: https://issues.apache.org/jira/browse/KYLIN-1609 > Project: Kylin > Issue Type: New Feature > Components: Query Engine >Affects Versions: v1.5.1 >Reporter: Dong Li >Assignee: Dong Li >Priority: Minor > > KYLIN-1016 already enabled count distinct aggregation on dimension which are > not defined as COUNT_DISTINCT measures. But it's only in query engine level. > This JIRA will push it deeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2159) Redistribution Hive Table Step always requires row_count filename as 000000_0
[ https://issues.apache.org/jira/browse/KYLIN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15639870#comment-15639870 ] Dayue Gao commented on KYLIN-2159: -- We also run into this problem once, should find target file for pattern "00_*". > Redistribution Hive Table Step always requires row_count filename as 00_0 > -- > > Key: KYLIN-2159 > URL: https://issues.apache.org/jira/browse/KYLIN-2159 > Project: Kylin > Issue Type: Bug >Reporter: Dong Li > > In some case, the filename is not 00_0. > For example, the output of second attempt of mr job might become 00_01000. > java.io.FileNotFoundException: File does not exist: > /kylin/kylin_metadata/kylin-xxx/row_count/00_0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2115) some extended column query returns wrong answer
Dayue Gao created KYLIN-2115: Summary: some extended column query returns wrong answer Key: KYLIN-2115 URL: https://issues.apache.org/jira/browse/KYLIN-2115 Project: Kylin Issue Type: Bug Components: General Affects Versions: v1.5.4, v1.5.4.1 Reporter: Dayue Gao Assignee: Dayue Gao Priority: Critical KYLIN-1979 introduces a bug, which can cause extended column query returns wrong result if user defines more than one extended column metrics. {noformat} Example: let's define two extended columns 1. metricA(host=h1, extend=e1) 2. metricB(host=h2, extend=e2). "select h1, e1 ... group by h1,e1" correct. "select h1, e1, h2, e2 ... group by h1,e1, h2, e2" correct. "select h2, e2 ... group by h2, e2" wrong. (column e2 is empty) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KYLIN-2115) some extended column query returns wrong answer
[ https://issues.apache.org/jira/browse/KYLIN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2115. -- Resolution: Fixed Fix Version/s: v1.6.0 fixed in https://github.com/apache/kylin/commit/bec4a888cbacf1db1dca81a50919c935b7cb1d96 > some extended column query returns wrong answer > --- > > Key: KYLIN-2115 > URL: https://issues.apache.org/jira/browse/KYLIN-2115 > Project: Kylin > Issue Type: Bug > Components: General >Affects Versions: v1.5.4, v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao >Priority: Critical > Fix For: v1.6.0 > > > KYLIN-1979 introduces a bug, which can cause extended column query returns > wrong result if user defines more than one extended column metrics. > {noformat} > Example: let's define two extended columns > 1. metricA(host=h1, extend=e1) > 2. metricB(host=h2, extend=e2). > "select h1, e1 ... group by h1,e1" correct. > "select h1, e1, h2, e2 ... group by h1,e1, h2, e2" correct. > "select h2, e2 ... group by h2, e2" wrong. (column e2 is empty) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2105) add QueryId
[ https://issues.apache.org/jira/browse/KYLIN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591776#comment-15591776 ] Dayue Gao commented on KYLIN-2105: -- [~liyang.g...@gmail.com], thank you for your suggestions. The motivation for adding project to query ID (also thread name) is that user can quickly tell which project is putting the most load on HBase (by tailing regionserver log). Timestamp is redundant, I include it mainly for improving uniqueness of query ID. I just did a quick search on query ID format used by other projects: * Presto includes timestamp in query ID * Druid and Drill use UUID. Druid also includes datasource name (similar to Kylin cube name) in query thread name. For code simplicity and log cleanliness, UUID seems a good choice. What do you think? > add QueryId > --- > > Key: KYLIN-2105 > URL: https://issues.apache.org/jira/browse/KYLIN-2105 > Project: Kylin > Issue Type: Sub-task > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Labels: patch > Fix For: v1.6.0 > > > * for each query, generate an unique id. > * the id could describe some context information about the query, like start > time, project name, etc. > * for query thread, we could use query id as the name of the thread. As long > as user logs thread's name, he can grep query log by query id afterwards. > * pass query id to coprocessor, so that query id gets logged both in query > server and region server. > * BadQueryDetector should also log query id -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2105) add QueryId
Dayue Gao created KYLIN-2105: Summary: add QueryId Key: KYLIN-2105 URL: https://issues.apache.org/jira/browse/KYLIN-2105 Project: Kylin Issue Type: Sub-task Components: Query Engine, Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao Fix For: v1.6.0 * for each query, generate an unique id. * the id could describe some context information about the query, like start time, project name, etc. * for query thread, we could use query id as the name of the thread. As long as user logs thread's name, he can grep query log by query id afterwards. * pass query id to coprocessor, so that query id gets logged both in query server and region server. * BadQueryDetector should also log query id -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2105) add QueryId
[ https://issues.apache.org/jira/browse/KYLIN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587203#comment-15587203 ] Dayue Gao commented on KYLIN-2105: -- commit https://github.com/apache/kylin/commit/db09f5f9cae5a3d3ff731221cbb1c026da4f4e41 to master. QueryID format "MMdd_HHmmss_project_xx" * "MMdd_HHmmss": query submit time * "project": which project the query is submitted to * "xx": six random generated base26 (a-z) characters Need a threadlocal context to pass QueryID to storage layer, use BackdoorToggles for now. Maybe we should rename BackdoorToggles to QueryContext? Add a field to coprocessor protobuf interface, which is backward compatible with 1.5.4. > add QueryId > --- > > Key: KYLIN-2105 > URL: https://issues.apache.org/jira/browse/KYLIN-2105 > Project: Kylin > Issue Type: Sub-task > Components: Query Engine, Storage - HBase >Affects Versions: v1.6.0 >Reporter: Dayue Gao >Assignee: Dayue Gao > Labels: patch > Fix For: v1.6.0 > > > * for each query, generate an unique id. > * the id could describe some context information about the query, like start > time, project name, etc. > * for query thread, we could use query id as the name of the thread. As long > as user logs thread's name, he can grep query log by query id afterwards. > * pass query id to coprocessor, so that query id gets logged both in query > server and region server. > * BadQueryDetector should also log query id -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened
[ https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663899#comment-15663899 ] Dayue Gao commented on KYLIN-2173: -- commit to master and v1.6.0-rc2. UT seems to be broken by previous commit, will add test case later > push down limit leads to wrong answer when filter is loosened > - > > Key: KYLIN-2173 > URL: https://issues.apache.org/jira/browse/KYLIN-2173 > Project: Kylin > Issue Type: Bug > Components: Storage - HBase >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > > To reproduce: > {noformat} > select > test_kylin_fact.cal_dt > ,sum(test_kylin_fact.price) as GMV > FROM test_kylin_fact > left JOIN edw.test_cal_dt as test_cal_dt > ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt > where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20') > group by test_kylin_fact.cal_dt > limit 12 > {noformat} > Kylin returns 5 rows, expect 12 rows. > Root cause: filter condition may be loosened when we translate derived filter > in DerivedFilterTranslator. If we push down limit, query server won't get > enough valid records from storage. In the above example, 24 rows returned > from storage, only 5 are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2221) rethink on KYLIN-1684
[ https://issues.apache.org/jira/browse/KYLIN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685855#comment-15685855 ] Dayue Gao commented on KYLIN-2221: -- +1. If we can improve the way how empty segment is determined, even "kylin.query.skip-empty-segments" is superfluous. > rethink on KYLIN-1684 > - > > Key: KYLIN-2221 > URL: https://issues.apache.org/jira/browse/KYLIN-2221 > Project: Kylin > Issue Type: Improvement >Reporter: hongbin ma >Assignee: hongbin ma > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2200: - Attachment: KYLIN-2200.patch patch uploaded. There is no ITs for UNION/UNION_ALL right now. Let me add test cases once my sandbox env is fixed. > CompileException on UNION ALL query when result only contains one column > > > Key: KYLIN-2200 > URL: https://issues.apache.org/jira/browse/KYLIN-2200 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-2200.patch > > > {code:sql} > select count(*) from kylin_sales > union all > select count(*) from kylin_sales > {code} > got following exception > {noformat} > Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column > 32: Cannot determine simple type name "Record11_1" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) > at > org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) > at > org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673043#comment-15673043 ] Dayue Gao commented on KYLIN-2200: -- Union also fails to remove duplicates. {code:sql} select count(*), sum(price) from kylin_sales union select count(*), sum(price) from kylin_sales {code} When input rowformat is Array, EnumerableUnion should use ExtendedEnumerable.union(source, Functions.arrayComparer()) instead of ExtendedEnumerable.union(source). > CompileException on UNION ALL query when result only contains one column > > > Key: KYLIN-2200 > URL: https://issues.apache.org/jira/browse/KYLIN-2200 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > > {code:sql} > select count(*) from kylin_sales > union all > select count(*) from kylin_sales > {code} > got following exception > {noformat} > Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column > 32: Cannot determine simple type name "Record11_1" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) > at > org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) > at > org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672620#comment-15672620 ] Dayue Gao commented on KYLIN-2200: -- Still got problem on the following query {code:sql} select count(*) from kylin_sales where lstg_format_name='FP-GTC' union all select count(*) from kylin_sales where lstg_format_name='FP-GTC' {code} Exception: {noformat} Caused by: org.codehaus.commons.compiler.CompileException: Line 138, Column 23: No applicable constructor/method found for actual parameters "int, long"; candidates are: "Baz$Record2_1()" {noformat} Generated Code: {code:java} /* 137 */ public Object current() { /* 138 */ return new Record2_1( /* 139 */ 0, /* 140 */ org.apache.calcite.runtime.SqlFunctions.toLong(((Object[]) inputEnumerator.current())[8])); /* 141 */ } {code} Interestingly, EnumerableCalc use constructor to initialize SyntheticRecordType but EnumerableRelImplementor doesn't generate a with-arg constructor for it. [~julianhyde] Is it a calcite bug or Kylin doesn't use it in the right way? > CompileException on UNION ALL query when result only contains one column > > > Key: KYLIN-2200 > URL: https://issues.apache.org/jira/browse/KYLIN-2200 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > > {code:sql} > select count(*) from kylin_sales > union all > select count(*) from kylin_sales > {code} > got following exception > {noformat} > Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column > 32: Cannot determine simple type name "Record11_1" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) > at > org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) > at > org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669839#comment-15669839 ] Dayue Gao commented on KYLIN-2200: -- I believe it's not a bug in calcite. Because OLAPTableScan returns Enumerable
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669833#comment-15669833 ] Dayue Gao commented on KYLIN-2200: -- Here is calcite generated code {code:java} /* 1 */ public static class Record1_0 implements java.io.Serializable { /* 2 */ public long f0; /* 3 */ public Record1_0() {} /* 4 */ public boolean equals(Object o) { /* 5 */ if (this == o) { /* 6 */ return true; /* 7 */ } /* 8 */ if (!(o instanceof Record1_0)) { /* 9 */ return false; /* 10 */ } /* 11 */ return this.f0 == ((Record1_0) o).f0; /* 12 */ } /* 13 */ /* 14 */ public int hashCode() { /* 15 */ int h = 0; /* 16 */ h = org.apache.calcite.runtime.Utilities.hash(h, this.f0); /* 17 */ return h; /* 18 */ } /* 19 */ /* 20 */ public int compareTo(Record1_0 that) { /* 21 */ final int c; /* 22 */ c = org.apache.calcite.runtime.Utilities.compare(this.f0, that.f0); /* 23 */ if (c != 0) { /* 24 */ return c; /* 25 */ } /* 26 */ return 0; /* 27 */ } /* 28 */ /* 29 */ public String toString() { /* 30 */ return "{f0=" + this.f0 + "}"; /* 31 */ } /* 32 */ /* 33 */ } /* 34 */ /* 35 */ org.apache.calcite.DataContext root; /* 36 */ /* 37 */ public org.apache.calcite.linq4j.Enumerable bind(final org.apache.calcite.DataContext root0) { /* 38 */ root = root0; /* 39 */ final org.apache.calcite.linq4j.Enumerable _inputEnumerable = ((org.apache.kylin.query.schema.OLAPTable) root.getRootSchema().getSubSchema("DEFAULT").getTable("KYLIN_SALES")).executeOLAPQuery(root, 0); /* 40 */ final org.apache.calcite.linq4j.AbstractEnumerable child = new org.apache.calcite.linq4j.AbstractEnumerable(){ /* 41 */ public org.apache.calcite.linq4j.Enumerator enumerator() { /* 42 */ return new org.apache.calcite.linq4j.Enumerator(){ /* 43 */ public final org.apache.calcite.linq4j.Enumerator inputEnumerator = _inputEnumerable.enumerator(); /* 44 */ public void reset() { /* 45 */ inputEnumerator.reset(); /* 46 */ } /* 47 */ /* 48 */ public boolean moveNext() { /* 49 */ return inputEnumerator.moveNext(); /* 50 */ } /* 51 */ /* 52 */ public void close() { /* 53 */ inputEnumerator.close(); /* 54 */ } /* 55 */ /* 56 */ public Object current() { /* 57 */ return org.apache.calcite.runtime.SqlFunctions.toLong(((Object[]) inputEnumerator.current())[8]); /* 58 */ } /* 59 */ /* 60 */ }; /* 61 */ } /* 62 */ /* 63 */ }; /* 64 */ final org.apache.calcite.linq4j.Enumerable _inputEnumerable0 = ((org.apache.kylin.query.schema.OLAPTable) root.getRootSchema().getSubSchema("DEFAULT").getTable("KYLIN_SALES")).executeOLAPQuery(root, 1); /* 65 */ final org.apache.calcite.linq4j.AbstractEnumerable child1 = new org.apache.calcite.linq4j.AbstractEnumerable(){ /* 66 */ public org.apache.calcite.linq4j.Enumerator enumerator() { /* 67 */ return new org.apache.calcite.linq4j.Enumerator(){ /* 68 */ public final org.apache.calcite.linq4j.Enumerator inputEnumerator = _inputEnumerable0.enumerator(); /* 69 */ public void reset() { /* 70 */ inputEnumerator.reset(); /* 71 */ } /* 72 */ /* 73 */ public boolean moveNext() { /* 74 */ return inputEnumerator.moveNext(); /* 75 */ } /* 76 */ /* 77 */ public void close() { /* 78 */ inputEnumerator.close(); /* 79 */ } /* 80 */ /* 81 */ public Object current() { /* 82 */ return ((Record11_1) inputEnumerator.current()).COUNT__; /* 83 */ } /* 84 */ /* 85 */ }; /* 86 */ } /* 87 */ /* 88 */ }; /* 89 */ return org.apache.calcite.linq4j.Linq4j.singletonEnumerable(child.aggregate(new org.apache.calcite.linq4j.function.Function0() { /* 90 */ public Object apply() { /* 91 */ long $SUM0a0s0; /* 92 */ $SUM0a0s0 = 0; /* 93 */ Record1_0 record0; /* 94 */ record0 = new Record1_0(); /* 95 */ record0.f0 = $SUM0a0s0; /* 96 */ return record0; /* 97 */ } /* 98 */ } /* 99 */ .apply(), new org.apache.calcite.linq4j.function.Function2() { /* 100 */ public Record1_0 apply(Record1_0 acc, long in) { /* 101 */ acc.f0 = acc.f0 + in; /* 102 */ return acc; /* 103 */ } /* 104 */ public Record1_0 apply(Record1_0 acc, Long in) { /* 105 */ return apply( /* 106 */ acc, /* 107 */ in.longValue()); /* 108 */ } /* 109 */ public Record1_0 apply(Object acc, Object in) { /* 110 */ return apply( /* 111 */ (Record1_0) acc, /* 112 */ (Long) in); /* 113 */ } /* 114 */ } /* 115 */
[jira] [Created] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
Dayue Gao created KYLIN-2200: Summary: CompileException on UNION ALL query when result only contains one column Key: KYLIN-2200 URL: https://issues.apache.org/jira/browse/KYLIN-2200 Project: Kylin Issue Type: Bug Components: Query Engine Affects Versions: v1.5.4.1 Reporter: Dayue Gao Assignee: Dayue Gao {code:sql} select count(*) from kylin_sales union all select count(*) from kylin_sales {code} got following exception {noformat} Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 32: Cannot determine simple type name "Record11_1" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) at org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) at org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) at org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) at org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened
[ https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678992#comment-15678992 ] Dayue Gao commented on KYLIN-2173: -- {quote} how about the test result? {quote} Problem should be fixed. To avoid regressions, let me add a few test cases to ITs. {quote} Besides I viewed the commit, you removed the check for empty segment, was that related with this JIRA? The skip for empty segment was for another JIRA, are you sure it is safe to remove that? {quote} Are you talking about KYLIN-1967? It's been fixed before my commit. I just removed several duplicated log (since skipZeroInputSegment always returns false). Please correct me if I was wrong. {quote} BTW, when I run "mvn clean package -DskipTests", checkstyle reported there is unused import (see below). Please check and ensure you have checkstyle plugin installed in IDEA: {quote} What a stupid mistake! Sorry for that. > push down limit leads to wrong answer when filter is loosened > - > > Key: KYLIN-2173 > URL: https://issues.apache.org/jira/browse/KYLIN-2173 > Project: Kylin > Issue Type: Bug > Components: Storage - HBase >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > To reproduce: > {noformat} > select > test_kylin_fact.cal_dt > ,sum(test_kylin_fact.price) as GMV > FROM test_kylin_fact > left JOIN edw.test_cal_dt as test_cal_dt > ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt > where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20') > group by test_kylin_fact.cal_dt > limit 12 > {noformat} > Kylin returns 5 rows, expect 12 rows. > Root cause: filter condition may be loosened when we translate derived filter > in DerivedFilterTranslator. If we push down limit, query server won't get > enough valid records from storage. In the above example, 24 rows returned > from storage, only 5 are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678953#comment-15678953 ] Dayue Gao commented on KYLIN-2200: -- Thanks Julian. I'll figure out whether it's a calcite bug or not. And if it is, submit a patch for it. > CompileException on UNION ALL query when result only contains one column > > > Key: KYLIN-2200 > URL: https://issues.apache.org/jira/browse/KYLIN-2200 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Attachments: KYLIN-2200.patch > > > {code:sql} > select count(*) from kylin_sales > union all > select count(*) from kylin_sales > {code} > got following exception > {noformat} > Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column > 32: Cannot determine simple type name "Record11_1" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) > at > org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) > at > org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened
[ https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678999#comment-15678999 ] Dayue Gao commented on KYLIN-2173: -- BTW, could we build a shared CI infrastructure? Running IT in my local sandbox is super slow. > push down limit leads to wrong answer when filter is loosened > - > > Key: KYLIN-2173 > URL: https://issues.apache.org/jira/browse/KYLIN-2173 > Project: Kylin > Issue Type: Bug > Components: Storage - HBase >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > To reproduce: > {noformat} > select > test_kylin_fact.cal_dt > ,sum(test_kylin_fact.price) as GMV > FROM test_kylin_fact > left JOIN edw.test_cal_dt as test_cal_dt > ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt > where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20') > group by test_kylin_fact.cal_dt > limit 12 > {noformat} > Kylin returns 5 rows, expect 12 rows. > Root cause: filter condition may be loosened when we translate derived filter > in DerivedFilterTranslator. If we push down limit, query server won't get > enough valid records from storage. In the above example, 24 rows returned > from storage, only 5 are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column
[ https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669874#comment-15669874 ] Dayue Gao commented on KYLIN-2200: -- The root cause is for the second OLAPTableScan, it returns CUSTOM as its JavaRowFormat by mistake. Here's a dump of rowformats for each operator. {noformat} OLAPToEnumerableConverter EnumerableLimit(format=SCALA) EnumerableUnion(format=SCALA) EnumerableAggregate(format=SCALA) EnumerableCalc(format=SCALA) OLAPTableScan(format=ARRAY) EnumerableAggregate(format=SCALA) EnumerableCalc(format=SCALA) OLAPTableScan(format=CUSTOM) {noformat} Due to EnumerableAggregate returns SCALA, EnumerableUnion changes Prefer from ARRAY to CUSTOM for the second half. And OLAPTableScan honors prefer in the following code {code:java} public Result implement(EnumerableRelImplementor implementor, Prefer pref) { // PhysType physType = PhysTypeImpl.of(implementor.getTypeFactory(), this.rowType, pref.preferArray()); // } {code} To fix it, we should use ARRAY for OLAPTableScan regardless of pref. > CompileException on UNION ALL query when result only contains one column > > > Key: KYLIN-2200 > URL: https://issues.apache.org/jira/browse/KYLIN-2200 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > > {code:sql} > select count(*) from kylin_sales > union all > select count(*) from kylin_sales > {code} > got following exception > {noformat} > Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column > 32: Cannot determine simple type name "Record11_1" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165) > at > org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598) > at > org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183) > at > org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2085) PrepareStatement return incorrect result in some cases
[ https://issues.apache.org/jira/browse/KYLIN-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625281#comment-15625281 ] Dayue Gao commented on KYLIN-2085: -- Encountered the same problem after upgrading to 1.5.4.1. Found this jira when I'm about to submitting a patch for it, so I just committed UTs for this, see cd9423fb5b6e88f7451d0684f7d9598b7e06c381. > PrepareStatement return incorrect result in some cases > -- > > Key: KYLIN-2085 > URL: https://issues.apache.org/jira/browse/KYLIN-2085 > Project: Kylin > Issue Type: Bug > Components: Query Engine >Reporter: Dong Li >Assignee: Dong Li > Fix For: v1.6.0 > > > With Kylin sample data,execute following SQL can get result: > select count(*) from kylin_sales where lstg_format_name>='ABIN' and > lstg_format_name<='' > Result: 4054 > Send post with prestate: > POST http://localhost:7070/kylin/api/query/prestate > {"sql":"select count(*) from kylin_sales where lstg_format_name>=? and > lstg_format_name<=?","offset":0,"limit":5,"acceptPartial":true,"project":"learn_kylin","params":[{"className":"java.lang.String", > "value":"ABIN"},{"className":"java.lang.String", "value":""}]} > Result: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2079) add explicit configuration knob for coprocessor timeout
[ https://issues.apache.org/jira/browse/KYLIN-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619993#comment-15619993 ] Dayue Gao commented on KYLIN-2079: -- [~mahongbin] , the way Kylin avoids retrying coprocessor call in these cases is by response successfully with flag normalComplete set to false, not by throwing DoNotRetryException. We just have to response before hbase.rpc.timeout, this is why I make upper bound of kylin.query.coprocessor.timeout.seconds to hbase.rpc.timeout x 0.9. I have tried using DoNotRetryException before I realized this fact. > add explicit configuration knob for coprocessor timeout > --- > > Key: KYLIN-2079 > URL: https://issues.apache.org/jira/browse/KYLIN-2079 > Project: Kylin > Issue Type: Sub-task > Components: Storage - HBase >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > Attachments: KYLIN-2079.patch > > > Current self-termination timeout for CubeVisitService is calculated as the > product of three parameters: > * hbase.rpc.timeout > * hbase.client.retries.number (hardcode to 5) > * kylin.query.cube.visit.timeout.times > It has a few problems: > # due to this timeout being longer than hbase.rpc.timeout, user sees "Error > in coprocessor" instead of more descriptive GTScanSelfTerminatedException. > moreover, the request (probably a bad query) will be retried 5 times, > increasing pressure on regionserver > # it's not intuitive to set coprocessor timeout by adjusting > kylin.query.cube.visit.timeout.times > I propose the following changes: > # add a new kylin configuration "kylin.query.coprocessor.timeout.seconds" to > explicitly set coprocessor timeout. It defaults to 0, which means no value, > use hbase.rpc.timeout x 0.9 instead. When user sets it to a positive number, > kylin will use min(hbase.rpc.timeout x 0.9, > kylin.query.coprocessor.timeout.seconds) as coprocessor timeout > # remove "kylin.query.cube.visit.timeout.times". For cube visit timeout > (ExpectedSizeIterator), it's really a last resort, in case coprocessor didn't > terminate itself. I don't see too much needs for user to control it, set it > to coprocessor timeout x 10 should be a large enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KYLIN-2079) add explicit configuration knob for coprocessor timeout
[ https://issues.apache.org/jira/browse/KYLIN-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2079. -- Resolution: Fixed Committed to master and v1.6.0-rc1. The default coprocessor timeout is set to (hbase.rpc.timeout * 0.9) / 1000 seconds. You can use "kylin.query.coprocessor.timeout.seconds" to set a lower value, 0 means default behavior. The older configuration "kylin.query.cube.visit.timeout.times" is removed in favor of the new one. > add explicit configuration knob for coprocessor timeout > --- > > Key: KYLIN-2079 > URL: https://issues.apache.org/jira/browse/KYLIN-2079 > Project: Kylin > Issue Type: Sub-task > Components: Storage - HBase >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > Attachments: KYLIN-2079.patch > > > Current self-termination timeout for CubeVisitService is calculated as the > product of three parameters: > * hbase.rpc.timeout > * hbase.client.retries.number (hardcode to 5) > * kylin.query.cube.visit.timeout.times > It has a few problems: > # due to this timeout being longer than hbase.rpc.timeout, user sees "Error > in coprocessor" instead of more descriptive GTScanSelfTerminatedException. > moreover, the request (probably a bad query) will be retried 5 times, > increasing pressure on regionserver > # it's not intuitive to set coprocessor timeout by adjusting > kylin.query.cube.visit.timeout.times > I propose the following changes: > # add a new kylin configuration "kylin.query.coprocessor.timeout.seconds" to > explicitly set coprocessor timeout. It defaults to 0, which means no value, > use hbase.rpc.timeout x 0.9 instead. When user sets it to a positive number, > kylin will use min(hbase.rpc.timeout x 0.9, > kylin.query.coprocessor.timeout.seconds) as coprocessor timeout > # remove "kylin.query.cube.visit.timeout.times". For cube visit timeout > (ExpectedSizeIterator), it's really a last resort, in case coprocessor didn't > terminate itself. I don't see too much needs for user to control it, set it > to coprocessor timeout x 10 should be a large enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner
Dayue Gao created KYLIN-2083: Summary: more RAM estimation test for MeasureAggregator and GTAggregateScanner Key: KYLIN-2083 URL: https://issues.apache.org/jira/browse/KYLIN-2083 Project: Kylin Issue Type: Sub-task Components: Tools, Build and Test Affects Versions: v1.5.4.1 Reporter: Dayue Gao Assignee: Dayue Gao Fix For: v1.6.0 Current RAM estimations of MeasureAggregator and GTAggregateScanner are based on test results from AggregationCacheMemSizeTest. I'd like to see if there is room for improvement, and if there is, how much. Points I'm considering are: # CompressedOops on vs off: when CompressedOops is off on large heap, each reference takes 8 bytes. I was wondering how much it will affect size of AggregationCache. # variable length aggregator: does the current estimation works well on var-len aggregator like BitmapAggregator # heap usage count via GC vs Instrumentation: the current approach to obtain the actual heap usage of objects seems fine, however, I was wondering if using Java instrumentation agent will give us more precise number. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner
[ https://issues.apache.org/jira/browse/KYLIN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao updated KYLIN-2083: - Description: Current RAM estimations for MeasureAggregator and GTAggregateScanner are based on test results from AggregationCacheMemSizeTest. I'd like to see if there is room for improvement, and if there is, how much we can improve. Points I'm interested in are: # *CompressedOops ON v.s OFF*: when CompressedOops is off on large heap, each reference takes 8 bytes. I was wondering how much it will affect the RAM of AggregationCache. # *Variable Length Aggregator*: does the current estimation works well on varlen aggregator like BitmapAggregator? # *Real Heap Usage Count via Instrumentation*: the current approach to obtain the actual heap usage of objects looks fine, however, I was wondering if using Java instrumentation agent will give us a more precise number. was: Current RAM estimations of MeasureAggregator and GTAggregateScanner are based on test results from AggregationCacheMemSizeTest. I'd like to see if there is room for improvement, and if there is, how much. Points I'm considering are: # CompressedOops on vs off: when CompressedOops is off on large heap, each reference takes 8 bytes. I was wondering how much it will affect size of AggregationCache. # variable length aggregator: does the current estimation works well on var-len aggregator like BitmapAggregator # heap usage count via GC vs Instrumentation: the current approach to obtain the actual heap usage of objects seems fine, however, I was wondering if using Java instrumentation agent will give us more precise number. > more RAM estimation test for MeasureAggregator and GTAggregateScanner > - > > Key: KYLIN-2083 > URL: https://issues.apache.org/jira/browse/KYLIN-2083 > Project: Kylin > Issue Type: Sub-task > Components: Tools, Build and Test >Affects Versions: v1.5.4.1 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Current RAM estimations for MeasureAggregator and GTAggregateScanner are > based on test results from AggregationCacheMemSizeTest. I'd like to see if > there is room for improvement, and if there is, how much we can improve. > Points I'm interested in are: > # *CompressedOops ON v.s OFF*: when CompressedOops is off on large heap, each > reference takes 8 bytes. I was wondering how much it will affect the RAM of > AggregationCache. > # *Variable Length Aggregator*: does the current estimation works well on > varlen aggregator like BitmapAggregator? > # *Real Heap Usage Count via Instrumentation*: the current approach to obtain > the actual heap usage of objects looks fine, however, I was wondering if > using Java instrumentation agent will give us a more precise number. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner
[ https://issues.apache.org/jira/browse/KYLIN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565404#comment-15565404 ] Dayue Gao commented on KYLIN-2083: -- Play with it for a while, refactored AggregationCacheMemSizeTest * use [jamm|https://github.com/jbellis/jamm] to replace the previous way for obtaining object's actual heap usage * move estimation test for individual aggregators to AggregatorMemEstimateTest * test different setups for aggregation cache * test different setups for bitmap aggregator * test both +UseCompressedOops and -UseCompressedOops Below is how to run the test and what I've found. Group 1: CompressedOops Enabled -- {noformat} $ mvn test -Dtest=AggregationCacheMemSizeTest#testEstimateMemSize -pl 'core-cube' -DargLine='-Xms2g -Xmx2g' -Dscale=10 {noformat} 1)WITHOUT_MEM_HUNGRY:contain three basic aggregators: longSum, doubleSum and bigdecimalSum {noformat} Size Estimate(bytes) Actual(bytes)Estimate(ms) Actual(ms) 100,000 32,400,000 31,200,080 1 1,174 200,000 64,800,000 62,400,080 1 2,899 300,000 97,200,000 93,600,080 1 5,779 400,000 129,600,000 124,800,080 1 9,338 500,000 162,000,000 156,000,080 1 13,547 600,000 194,400,000 187,200,080 1 19,555 700,000 226,800,000 218,400,080 1 26,240 800,000 259,200,000 249,600,080 1 33,895 900,000 291,600,000 280,800,080 1 42,416 1,000,000 324,000,000 312,000,080 1 50,853 {noformat} 2) WITH_HLLC: contain three basic aggregators and one HyperLogLog(14) aggregator {noformat} Size Estimate(bytes) Actual(bytes)Estimate(ms) Actual(ms) 5,000 83,840,000 83,840,096 0 51 10,000 167,680,000 167,680,096 0 148 15,000 251,520,000 251,520,096 0 303 20,000 335,360,000 335,360,096 0 486 25,000 419,200,000 419,200,096 0 717 30,000 503,040,000 503,040,096 0 1,008 35,000 586,880,000 586,880,096 0 1,334 40,000 670,720,000 670,720,096 0 1,711 45,000 754,560,000 754,560,096 0 2,120 50,000 838,400,000 838,400,096 0 2,648 {noformat} 3) WITH_LOW_CARD_BITMAP: contain three basic aggregators and one sparse bitmap aggregator (1 million bits but only 100 bits on). {noformat} Size Estimate(bytes) Actual(bytes)Estimate(ms) Actual(ms) 10,000 5,920,000 23,200,080 1 452 20,000 11,840,000 46,400,080 1 1,330 30,000 17,760,000 69,600,080 1 2,716 40,000 23,680,000 92,800,080 1 4,531 50,000 29,600,000 116,000,080 1 6,973 60,000 35,520,000 139,200,080 1 9,915 70,000 41,440,000 162,400,080 1 13,289 80,000 47,360,000 185,600,080 1 17,037 90,000 53,280,000 208,800,080 1 21,923 100,000 59,200,000 232,000,080 1 28,140 {noformat} 4) WITH_HIGH_CARD_BITMAP: contain three basic aggregators and one dense bitmap aggregator (1 million bits, 99.99% on) {noformat} Size Estimate(bytes) Actual(bytes)Estimate(ms) Actual(ms) 1,000 131,464,000 133,096,080 0 49 2,000 262,928,000 266,192,080 0 138 3,000 394,392,000 399,288,080 0 319 4,000 525,856,000 532,384,080 0 503 5,000 657,320,000 665,480,080 0 739 6,000 788,784,000 798,576,080 0 1,101 7,000 920,248,000 931,672,080 0 1,473 8,000 1,051,712,000 1,064,768,080 0 1,895 9,000 1,183,176,000 1,197,864,080 0 2,311 10,000 1,314,640,000 1,330,960,080 0 2,969 {noformat} Group 2: CompressedOops Disabled
[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571375#comment-15571375 ] Dayue Gao commented on KYLIN-2012: -- https://github.com/apache/kylin/commit/36cf99ef77486c1361a31f3e1f748bb277eca217 refines rules on lookup table. https://github.com/apache/kylin/commit/5974fc0870be17a8801c55c7496093d42dbb7c4f renames RealizationStatusEnum.DESCBROKEN to RealizationStatusEnum.BROKEN. It's difficult for user to understand what does DESCBROKEN means. > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dayue Gao resolved KYLIN-2012. -- Resolution: Fixed > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571341#comment-15571341 ] Dayue Gao commented on KYLIN-2012: -- I found that even after KYLIN-1985, we can only allow user to append columns to lookup table, the reasons are: * LookupTable use ColumnDesc's zerobasedindex to find key columns in SnapshotTable, if users insert/drop column in the middle of hive table, the indexes of ColumnDesc are not aligned with hive. * If users drop trailing unused column of lookup table, query can fail with ArrayIndexOutOfBoundsException at LookupStringTable#convertRow. That's because #columns of SnapshotTable is larger than length(LookupStringTable.colIsDateTime). > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KYLIN-2012) more robust approach to hive schema changes
[ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571511#comment-15571511 ] Dayue Gao edited comment on KYLIN-2012 at 10/13/16 10:17 AM: - oops, didn't think about the migration. Please revert it and keep the old name. was (Author: gaodayue): oops, didn't think about the migration. Please revert it. > more robust approach to hive schema changes > --- > > Key: KYLIN-2012 > URL: https://issues.apache.org/jira/browse/KYLIN-2012 > Project: Kylin > Issue Type: Bug > Components: Metadata, REST Service, Web >Affects Versions: v1.5.3 >Reporter: Dayue Gao >Assignee: Dayue Gao > Fix For: v1.6.0 > > > Our users occasionally want to change their existing cube, such as > adding/renaming/removing a dimension. Some of these changes require > modifications to its source hive table. So our user changed the table schema > and reloaded its metadata in Kylin, then several issues can happen depends on > what he changed. > I did some schema changing tests based on 1.5.3, the results after reloading > table are listed below > || type of changes || fact table || lookup table || > | *minor* | both query and build still works | query can fail or return wrong > answer | > | *major* | fail to load related cube | fail to load related cube | > {{minor}} changes refer to those doesn't change columns used in cubes, such > as insert/append new column, remove/change unused column. > {{major}} changes are the opposite, like remove/rename/change type of used > column. > Clearly from the table, reload a changed table is problematic in certain > cases. KYLIN-1536 reports a similar problem. > So what can we do to support this kind of iterative development process (load > -> define cube -> build -> reload -> change cube -> rebuild)? > My first thought is simply detect-and-prohibit reloading used table. User > should be able to know which cube is preventing him from reloading, and then > he could drop and recreate cube after reloading. However, defining a cube is > not an easy task (consider editing 100 measures). Force users to recreate > their cube over and over again will certainly not make them happy. > A better idea is to allow cube to be editable even if it's broken due to some > columns changed after reloading. Broken cube can't be built or queried, it > can only be edit or dropped. In fact, there is a cube status called > {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should > take advantage of it. > An enabled cube shouldn't allow schema changes, otherwise an unintentional > reload could make it unavailable. Similarly, a disabled but unpurged cube > shouldn't allow schema changes since it still has data in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)