[jira] [Updated] (KYLIN-1768) NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed length encoded to more than 256 bytes

2016-06-06 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1768:
-
Description: 
When user defined a dimension which is fixed length encoded to more than 256 
bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack 
trace is shown below:

{noformat}
Error: java.lang.ArrayIndexOutOfBoundsException at 
java.lang.System.arraycopy(Native Method) 
at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) 
at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) 
at org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) 
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{noformat}

The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per 
split, and trying to put a larger encoded dimension throws 
ArrayIndexOutOfBoundsException.

  was:
When user defined a dimension which is fixed length encoded to more than 256 
bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack 
trace is shown below:

{noformat}
Error: java.lang.ArrayIndexOutOfBoundsException at 
java.lang.System.arraycopy(Native Method) at 
org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) at 
org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) at 
org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) at 
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{noformat}

The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per 
split, and trying to put a larger encoded dimension throws 
ArrayIndexOutOfBoundsException.


> NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed 
> length encoded to more than 256 bytes
> --
>
> Key: KYLIN-1768
> URL: https://issues.apache.org/jira/browse/KYLIN-1768
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> When user defined a dimension which is fixed length encoded to more than 256 
> bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack 
> trace is shown below:
> {noformat}
> Error: java.lang.ArrayIndexOutOfBoundsException at 
> java.lang.System.arraycopy(Native Method) 
> at org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) 
> at 
> org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) 
> at 
> org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) 
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) 
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) 
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>  
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> {noformat}
> The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes 
> per split, and trying to put a larger encoded dimension throws 
> ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1768) NDCuboidMapper throws ArrayIndexOutOfBoundsException when dimension is fixed length encoded to more than 256 bytes

2016-06-06 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1768:


 Summary: NDCuboidMapper throws ArrayIndexOutOfBoundsException when 
dimension is fixed length encoded to more than 256 bytes
 Key: KYLIN-1768
 URL: https://issues.apache.org/jira/browse/KYLIN-1768
 Project: Kylin
  Issue Type: Bug
  Components: Job Engine
Affects Versions: v1.5.2
Reporter: Dayue Gao
Assignee: Dayue Gao


When user defined a dimension which is fixed length encoded to more than 256 
bytes, "Build N-Dimension Cuboid Data" step failed in map phase. The stack 
trace is shown below:

{noformat}
Error: java.lang.ArrayIndexOutOfBoundsException at 
java.lang.System.arraycopy(Native Method) at 
org.apache.kylin.cube.common.RowKeySplitter.split(RowKeySplitter.java:103) at 
org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:125) at 
org.apache.kylin.engine.mr.steps.NDCuboidMapper.map(NDCuboidMapper.java:49) at 
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{noformat}

The reason is that `RowKeySplitter` is hardcoded to 65 splits and 256 bytes per 
split, and trying to put a larger encoded dimension throws 
ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1770) Can't use PreparedStatement with "between and" expression

2016-06-07 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1770:


 Summary: Can't use PreparedStatement with "between and" expression
 Key: KYLIN-1770
 URL: https://issues.apache.org/jira/browse/KYLIN-1770
 Project: Kylin
  Issue Type: Bug
  Components: Driver - JDBC
Affects Versions: v1.5.2, v1.5.1
Reporter: Dayue Gao


Sample code to reproduce:

{code:java}
final String sql = "select count(*) from kylin_sales where LSTG_SITE_ID 
between ? and ?";

try (PreparedStatement stmt = conn.prepareStatement(sql)) {
stmt.setInt(1, 0);
stmt.setInt(2, 5);

try (ResultSet rs = stmt.executeQuery()) {
printResultSet(rs);
}
}
{code}

Exception stack trace from server log:
{noformat}
java.sql.SQLException: Error while preparing statement [select count(*) from 
kylin_sales where LSTG_SITE_ID between ? and ?]
at org.apache.calcite.avatica.Helper.createException(Helper.java:56)
at org.apache.calcite.avatica.Helper.createException(Helper.java:41)
at 
org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement_(CalciteConnectionImpl.java:203)
at 
org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:184)
at 
org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:85)
at 
org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:153)
at 
org.apache.kylin.rest.service.QueryService.execute(QueryService.java:353)
at 
org.apache.kylin.rest.service.QueryService.queryWithSqlMassage(QueryService.java:274)
at 
org.apache.kylin.rest.service.QueryService.query(QueryService.java:120)
at 
org.apache.kylin.rest.service.QueryService$$FastClassByCGLIB$$4957273f.invoke()
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at 
org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:618)
at 
org.apache.kylin.rest.service.QueryService$$EnhancerByCGLIB$$8610374f.query()
at 
org.apache.kylin.rest.controller.QueryController.doQueryWithCache(QueryController.java:192)
at 
org.apache.kylin.rest.controller.QueryController.prepareQuery(QueryController.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:213)
at 
org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:126)
at 
org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:96)
at 
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:617)
at 
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:578)
at 
org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
at 
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:923)
at 
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at 
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
at 
org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330)
at 
org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:118)
at 
org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:84)
at 

[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty

2016-06-06 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1752:
-
Attachment: KYLIN-1752.patch.1

uploaded KYLIN-1752.patch.1 according to Shaofeng's suggestion.

> Add an option to fail cube build job when source table is empty
> ---
>
> Key: KYLIN-1752
> URL: https://issues.apache.org/jira/browse/KYLIN-1752
> Project: Kylin
>  Issue Type: New Feature
>  Components: Job Engine
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Trivial
> Attachments: KYLIN-1752.patch, KYLIN-1752.patch.1
>
>
> For non-incremental build cube, it's valuable to be able to fail the build 
> job as long as the source table is empty. Otherwise, a mistake in upstream 
> ETL which results in empty source table will lead to an empty cube. Often in 
> this situation, user wants to still be able to query history data in the cube 
> before they fix their ETL and rebuild the cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1677) Distribute source data by certain columns when creating flat table

2016-06-07 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319955#comment-15319955
 ] 

Dayue Gao commented on KYLIN-1677:
--

Hi Shaofeng,

Here's the test result of using hive view as fact table:

|| KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) ||
| Count Source Table | 9.02 | Create Intermediate Flat Hive Table | 8.12 |
| Create Intermediate Flat Hive Table | 12.89 | Redistribute Intermediate Flat 
Hive | 2.39 |

As expected, KYLIN-1677 took more time due to materializing view twice instead 
of once in KYLIN-1656.

To be fair, I also tested a cube which uses non-view as fact table:
|| KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) ||
| Count Source Table | 1.10 | Create Intermediate Flat Hive Table | 3.74 |
| Create Intermediate Flat Hive Table | 1.70 | Redistribute Intermediate Flat 
Hive | 5.13 |

In this case, KYLIN-1677 behaves better than KYLIN-1656 due to avoiding one 
round of MR.

In general, I'm +1 to release KYLIN-1677 as an refinement to KYLIN-1656.


> Distribute source data by certain columns when creating flat table
> --
>
> Key: KYLIN-1677
> URL: https://issues.apache.org/jira/browse/KYLIN-1677
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Reporter: Shaofeng SHI
>Assignee: Shaofeng SHI
> Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain 
> columns when creating the flat hive table; Then the data assigned to a mapper 
> will have more similarity, more aggregation can happen at mapper side, and 
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality 
> column, mandantory column, partition date/time column, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1758) createLookupHiveViewMaterializationStep will create intermediate table for fact table

2016-06-03 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315319#comment-15315319
 ] 

Dayue Gao commented on KYLIN-1758:
--

Hi Shaofeng,

Please also note that `createLookupHiveViewMaterializationStep` should use 
`JoinedFlatTable.generateHiveSetStatements` when building hive command, 
otherwise it will not pick the correct queue configuration.

> createLookupHiveViewMaterializationStep will create intermediate table for 
> fact table 
> --
>
> Key: KYLIN-1758
> URL: https://issues.apache.org/jira/browse/KYLIN-1758
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v1.5.2
> Environment: hadoop2.4, hbase1.1.2
>Reporter: Xingxing Di
>Assignee: Shaofeng SHI
>Priority: Critical
>
> In our model, the fact table is a hive view.  When i build cube(I selected 
> one partition for one day's data), the job was trying to create intermediate 
> table by sql :
> DROP TABLE IF EXISTS kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU;
> CREATE TABLE IF NOT EXISTS kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU
> LOCATION 
> '/team/db/kylin15/kylin_metadata15/kylin-65c7ee9a-3024-4633-927c-19e992ed155a/kylin_intermediate_OLAP_OLAP_FLW_PCM_MAU'
> AS SELECT * FROM OLAP.OLAP_FLW_PCM_MAU;
> 1. There is no partition string in where string (I selected only one 
> partition), which cause a very very big MR. 
> 2. Note that, our lookup table is not a view. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-05-25 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300145#comment-15300145
 ] 

Dayue Gao commented on KYLIN-1656:
--

Hi Shaofeng,

We choose 500K to increase parallelism and reduce total build time. As we have 
a very big cluster, we don't see the problem of pending tasks, but you made a 
good point about it.

In your test, you have 5000+ mappers, which means the input has 2.5B+ rows. I'm 
not sure if it were the common case. Most of our cubes in production have ~100M 
rows per segment, thus 500K leads to 200 mappers, which looks a reasonable 
parallelism to me. If the setting is increased to 5M, then 20 mappers for the 
"Build Cube" step is way too small, leads to step timeout ultimately.

I do find input split of 500K rows to be small, but don't see it as a problem.



> Improve performance of MRv2 engine by making each mapper handles a configured 
> number of records
> ---
>
> Key: KYLIN-1656
> URL: https://issues.apache.org/jira/browse/KYLIN-1656
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.5.3
>
> Attachments: KYLIN-1656.patch
>
>
> In the current version of MRv2 build engine, each mapper handles one block of 
> the flat hive table (stored in sequence file). This has two major problems:
> # It's difficult for user to control the parallelism of mappers for each cube.
> User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a 
> global configuration and cannot be override using "override_kylin_properties" 
> introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
> # May encounter mapper execution skew due to a skew distribution of each 
> block's records number.
> This is a more severe problem since FactDistinctColumn and InMemCubing step 
> of MRv2 is very cpu intensive in map task. To give you a sense of how bad it 
> is, one of our cube's FactDistinctColumnStep takes ~100min in total with 
> average mapper time only 11min. This is because there exists several skewed 
> map tasks which handled 10x records than average map task. And the 
> InMemCubing steps failed because the skewed mapper tasks hit 
> "mapred.task.timeout".
> To avoid skew to happen, *we'd better make each mapper handles a configurable 
> number of records instead of handles a sequence file block.* The way we 
> achieved this is to add a `RedistributeFlatHiveTableStep` right after 
> "FlatHiveTableStep".
> Here's what RedistributeFlatHiveTableStep do:
> 1. we run a {{select count(1) from intermediate_table}} to determine the 
> `input_rowcount` of this build
> 2. we run a {{insert overwrite table intermediate_table select * from 
> intermediate_table distribute by rand()}} to evenly distribute records to 
> reducers.
> The number of reducers is specified as "input_rowcount / mapper_input_rows" 
> where `mapper_input_rows` is a new parameter for user to specify how many 
> records each mapper should handle. Since each reducer will write out its 
> records into one file, we're guaranteed that after 
> RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
> around mapper_input_rows. And since the followed up job's mapper handles one 
> block of each sequence file, they won't handle more than mapper_input_rows.
> The added RedistributeFlatHiveTableStep usually takes a small amount of time 
> compared to other steps, but the benefit it brings is remarkable. Here's what 
> performance improvement we saw:
> || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
> FactDistinctColumn after||
> | case#1 | 51.78min | 8.40min | 13.06min |
> | case#2 | 95.65min | 2.46min | 26.37min |
> And since mapper_input_rows is a kylin configuration, user can override it 
> for each cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1677) Distribute source data by certain columns when creating flat table

2016-05-25 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300164#comment-15300164
 ] 

Dayue Gao commented on KYLIN-1677:
--

Thanks for the reply. I'll test master branch on hive view tomorrow to see how 
it performs.

Our internal version is still using KYLIN-1656. And from the performance number 
given in 1656, the cost of RedistributeFlatHiveTableStep is usually negligible.

> Distribute source data by certain columns when creating flat table
> --
>
> Key: KYLIN-1677
> URL: https://issues.apache.org/jira/browse/KYLIN-1677
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Reporter: Shaofeng SHI
>Assignee: Shaofeng SHI
> Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain 
> columns when creating the flat hive table; Then the data assigned to a mapper 
> will have more similarity, more aggregation can happen at mapper side, and 
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality 
> column, mandantory column, partition date/time column, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty

2016-06-05 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1752:
-
Description: For non-incremental build cube, it's valuable to be able to 
fail the build job as long as the source table is empty. Otherwise, a mistake 
in upstream ETL which results in empty source table will lead to an empty cube. 
Often in this situation, user wants to still be able to query history data in 
the cube before they fix their ETL and rebuild the cube.  (was: For 
non-incremental build cube, it's valuable to be able to fail the build job as 
long as the source table is empty. Otherwise, a mistake in upper ETL which 
results in empty source table will lead to an empty cube. Often in this 
situation, user wants to still be able to query history data in the cube before 
they fix their ETL and rebuild the cube.)

> Add an option to fail cube build job when source table is empty
> ---
>
> Key: KYLIN-1752
> URL: https://issues.apache.org/jira/browse/KYLIN-1752
> Project: Kylin
>  Issue Type: New Feature
>  Components: Job Engine
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Trivial
> Attachments: KYLIN-1752.patch
>
>
> For non-incremental build cube, it's valuable to be able to fail the build 
> job as long as the source table is empty. Otherwise, a mistake in upstream 
> ETL which results in empty source table will lead to an empty cube. Often in 
> this situation, user wants to still be able to query history data in the cube 
> before they fix their ETL and rebuild the cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1752) Add an option to fail cube build job when source table is empty

2016-06-05 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1752:
-
Attachment: KYLIN-1752.patch

Here's the patch, tested internally.

[~Shaofengshi] Could you please take a look at this?

> Add an option to fail cube build job when source table is empty
> ---
>
> Key: KYLIN-1752
> URL: https://issues.apache.org/jira/browse/KYLIN-1752
> Project: Kylin
>  Issue Type: New Feature
>  Components: Job Engine
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Trivial
> Attachments: KYLIN-1752.patch
>
>
> For non-incremental build cube, it's valuable to be able to fail the build 
> job as long as the source table is empty. Otherwise, a mistake in upper ETL 
> which results in empty source table will lead to an empty cube. Often in this 
> situation, user wants to still be able to query history data in the cube 
> before they fix their ETL and rebuild the cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1706) Allow cube to override MR job configuration by properties

2016-06-04 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315482#comment-15315482
 ] 

Dayue Gao commented on KYLIN-1706:
--

This is one of the small but very helpful improvements I'd like to have in 
Kylin. There were sometimes I want to adjust mapper/reducer heap size and 
"mapreduce.task.io.sort.mb" for a specific cube to accelerate building. It's 
become feasible now with this jira. Thank you to bring it!

In addition, I'd like to comment that, this jira is not sufficient to support 
cube/project level queue isolation. To the best of my knowledge, there're still 
two problems to deal with,
# this jira only allows user to override MR queue configuration in 
kylin_job_conf.xml. However queue configuration also appears in 
kylin_hive_conf.xml, and we should address it too.
# queue can have ACLs to specify who can submit application to it. Therefore, 
Kylin needs to know not only which queue to submit job, but also which hadoop 
user to impersonate to.

> Allow cube to override MR job configuration by properties
> -
>
> Key: KYLIN-1706
> URL: https://issues.apache.org/jira/browse/KYLIN-1706
> Project: Kylin
>  Issue Type: Improvement
>Reporter: liyang
>Assignee: liyang
> Fix For: v1.5.3
>
>
> Currently cube can specify MR job configuration by a job_conf.xml file under 
> conf/. This is still not sufficient for example, to specify 50+ different job 
> queues, user will have to maintain 50+ different job_conf.xml files.
> By allowing config override from kylin properties, the 50+ job queue case 
> will become a lot more easier. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-05-31 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309184#comment-15309184
 ] 

Dayue Gao commented on KYLIN-1656:
--

Hi Shaofeng,

{quote}
did you observe the source file of the intermediate file size in your side?
{quote}
Yes, the intermediate file size is indeed very small.

{quote}
My concern is it may generate many small files on HDFS, adding NN's memory 
footprint.
{quote}
It's a very good concern if these files are left in HDFS "forever". But since 
files of intermediate table will be garbage collected as soon as the build 
succeeds, I don't think it's a big issue.

What's your opinion?

> Improve performance of MRv2 engine by making each mapper handles a configured 
> number of records
> ---
>
> Key: KYLIN-1656
> URL: https://issues.apache.org/jira/browse/KYLIN-1656
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.5.3
>
> Attachments: KYLIN-1656.patch
>
>
> In the current version of MRv2 build engine, each mapper handles one block of 
> the flat hive table (stored in sequence file). This has two major problems:
> # It's difficult for user to control the parallelism of mappers for each cube.
> User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a 
> global configuration and cannot be override using "override_kylin_properties" 
> introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
> # May encounter mapper execution skew due to a skew distribution of each 
> block's records number.
> This is a more severe problem since FactDistinctColumn and InMemCubing step 
> of MRv2 is very cpu intensive in map task. To give you a sense of how bad it 
> is, one of our cube's FactDistinctColumnStep takes ~100min in total with 
> average mapper time only 11min. This is because there exists several skewed 
> map tasks which handled 10x records than average map task. And the 
> InMemCubing steps failed because the skewed mapper tasks hit 
> "mapred.task.timeout".
> To avoid skew to happen, *we'd better make each mapper handles a configurable 
> number of records instead of handles a sequence file block.* The way we 
> achieved this is to add a `RedistributeFlatHiveTableStep` right after 
> "FlatHiveTableStep".
> Here's what RedistributeFlatHiveTableStep do:
> 1. we run a {{select count(1) from intermediate_table}} to determine the 
> `input_rowcount` of this build
> 2. we run a {{insert overwrite table intermediate_table select * from 
> intermediate_table distribute by rand()}} to evenly distribute records to 
> reducers.
> The number of reducers is specified as "input_rowcount / mapper_input_rows" 
> where `mapper_input_rows` is a new parameter for user to specify how many 
> records each mapper should handle. Since each reducer will write out its 
> records into one file, we're guaranteed that after 
> RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
> around mapper_input_rows. And since the followed up job's mapper handles one 
> block of each sequence file, they won't handle more than mapper_input_rows.
> The added RedistributeFlatHiveTableStep usually takes a small amount of time 
> compared to other steps, but the benefit it brings is remarkable. Here's what 
> performance improvement we saw:
> || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
> FactDistinctColumn after||
> | case#1 | 51.78min | 8.40min | 13.06min |
> | case#2 | 95.65min | 2.46min | 26.37min |
> And since mapper_input_rows is a kylin configuration, user can override it 
> for each cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1752) Add an option to fail cube build job when source table is empty

2016-05-31 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1752:


 Summary: Add an option to fail cube build job when source table is 
empty
 Key: KYLIN-1752
 URL: https://issues.apache.org/jira/browse/KYLIN-1752
 Project: Kylin
  Issue Type: New Feature
  Components: Job Engine
Affects Versions: v1.5.2
Reporter: Dayue Gao
Assignee: Dayue Gao
Priority: Trivial


For non-incremental build cube, it's valuable to be able to fail the build job 
as long as the source table is empty. Otherwise, a mistake in upper ETL which 
results in empty source table will lead to an empty cube. Often in this 
situation, user wants to still be able to query history data in the cube before 
they fix their ETL and rebuild the cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number

2016-05-26 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303443#comment-15303443
 ] 

Dayue Gao commented on KYLIN-1657:
--

can we merge this?

> Add new configuration kylin.job.mapreduce.min.reducer.number
> 
>
> Key: KYLIN-1657
> URL: https://issues.apache.org/jira/browse/KYLIN-1657
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Attachments: KYLIN-1657.patch
>
>
> We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of 
> reducers for cubing job, but min reducer is hard coded to 1. We should make 
> it also configurable and this could be helpful in some circumstances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1694) make multiply coefficient configurable when estimating cuboid size

2016-05-26 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303446#comment-15303446
 ] 

Dayue Gao commented on KYLIN-1694:
--

can we merge this?

> make multiply coefficient configurable when estimating cuboid size
> --
>
> Key: KYLIN-1694
> URL: https://issues.apache.org/jira/browse/KYLIN-1694
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: kangkaisen
>Assignee: Dong Li
> Attachments: KYLIN-1694.patch
>
>
> In the current version of MRv2 build engine, in CubeStatsReader when 
> estimating cuboid size , the curent method is "cube is memory hungry, storage 
> size estimation multiply 0.05" and "cube is not memory hungry, storage size 
> estimation multiply 0.25".
> This has one major problems:the default multiply coefficient is smaller, this 
> will make the estimated cuboid size much less than the actual
> cuboid size,which will lead to the region numbers of HBase and the reducer 
> numbers of CubeHFileJob are both smaller. obviously, the current method
> makes the job of CubeHFileJob much slower.
> After we remove the the default multiply coefficient, the job of CubeHFileJob 
> becomes much faster.
> we'd better make multiply coefficient configurable and this could be more 
> friendly for user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-06-01 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309376#comment-15309376
 ] 

Dayue Gao commented on KYLIN-1656:
--

no problem :-)

> Improve performance of MRv2 engine by making each mapper handles a configured 
> number of records
> ---
>
> Key: KYLIN-1656
> URL: https://issues.apache.org/jira/browse/KYLIN-1656
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.5.3
>
> Attachments: KYLIN-1656.patch
>
>
> In the current version of MRv2 build engine, each mapper handles one block of 
> the flat hive table (stored in sequence file). This has two major problems:
> # It's difficult for user to control the parallelism of mappers for each cube.
> User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a 
> global configuration and cannot be override using "override_kylin_properties" 
> introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
> # May encounter mapper execution skew due to a skew distribution of each 
> block's records number.
> This is a more severe problem since FactDistinctColumn and InMemCubing step 
> of MRv2 is very cpu intensive in map task. To give you a sense of how bad it 
> is, one of our cube's FactDistinctColumnStep takes ~100min in total with 
> average mapper time only 11min. This is because there exists several skewed 
> map tasks which handled 10x records than average map task. And the 
> InMemCubing steps failed because the skewed mapper tasks hit 
> "mapred.task.timeout".
> To avoid skew to happen, *we'd better make each mapper handles a configurable 
> number of records instead of handles a sequence file block.* The way we 
> achieved this is to add a `RedistributeFlatHiveTableStep` right after 
> "FlatHiveTableStep".
> Here's what RedistributeFlatHiveTableStep do:
> 1. we run a {{select count(1) from intermediate_table}} to determine the 
> `input_rowcount` of this build
> 2. we run a {{insert overwrite table intermediate_table select * from 
> intermediate_table distribute by rand()}} to evenly distribute records to 
> reducers.
> The number of reducers is specified as "input_rowcount / mapper_input_rows" 
> where `mapper_input_rows` is a new parameter for user to specify how many 
> records each mapper should handle. Since each reducer will write out its 
> records into one file, we're guaranteed that after 
> RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
> around mapper_input_rows. And since the followed up job's mapper handles one 
> block of each sequence file, they won't handle more than mapper_input_rows.
> The added RedistributeFlatHiveTableStep usually takes a small amount of time 
> compared to other steps, but the benefit it brings is remarkable. Here's what 
> performance improvement we saw:
> || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
> FactDistinctColumn after||
> | case#1 | 51.78min | 8.40min | 13.06min |
> | case#2 | 95.65min | 2.46min | 26.37min |
> And since mapper_input_rows is a kylin configuration, user can override it 
> for each cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1694) make multiply coefficient configurable when estimating cuboid size

2016-05-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284236#comment-15284236
 ] 

Dayue Gao commented on KYLIN-1694:
--

Hi shaofeng, we have cherry-picked that patch and verified it's not related to 
it.

> make multiply coefficient configurable when estimating cuboid size
> --
>
> Key: KYLIN-1694
> URL: https://issues.apache.org/jira/browse/KYLIN-1694
> Project: Kylin
>  Issue Type: Bug
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: kangkaisen
>Assignee: Dong Li
>
> In the current version of MRv2 build engine, in CubeStatsReader when 
> estimating cuboid size , the curent method is "cube is memory hungry, storage 
> size estimation multiply 0.05" and "cube is not memory hungry, storage size 
> estimation multiply 0.25".
> This has one major problems:the default multiply coefficient is smaller, this 
> will make the estimated cuboid size much less than the actual
> cuboid size,which will lead to the region numbers of HBase and the reducer 
> numbers of CubeHFileJob are both smaller. obviously, the current method
> makes the job of CubeHFileJob much slower.
> After we remove the the default multiply coefficient, the job of CubeHFileJob 
> becomes much faster.
> we'd better make multiply coefficient configurable and this could be more 
> friendly for user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1323) Improve performance of converting data to hfile

2016-05-05 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272072#comment-15272072
 ] 

Dayue Gao commented on KYLIN-1323:
--

Hi [~Shaofengshi], what's the progress of this on 1.5.x?

> Improve performance of converting data to hfile
> ---
>
> Key: KYLIN-1323
> URL: https://issues.apache.org/jira/browse/KYLIN-1323
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.2
>Reporter: Yerui Sun
>Assignee: Shaofeng SHI
> Fix For: v1.4.0, v1.3.0
>
> Attachments: KYLIN-1323-1.x-staging.2.patch, 
> KYLIN-1323-1.x-staging.patch, KYLIN-1323-2.x-staging.2.patch
>
>
> Supposed that we got 100GB data after cuboid building, and with setting that 
> 10GB per region. For now, 10 split keys was calculated, and 10 region 
> created, 10 reducer used in ‘convert to hfile’ step. 
> With optimization, we could calculate 100 (or more) split keys, and use all 
> them in ‘covert to file’ step, but sampled 10 keys in them to create regions. 
> The result is still 10 region created, but 100 reducer used in ‘convert to 
> file’ step. Of course, the hfile created is also 100, and load 10 files per 
> region. That’s should be fine, doesn’t affect the query performance 
> dramatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number

2016-05-05 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1657:


 Summary: Add new configuration 
kylin.job.mapreduce.min.reducer.number
 Key: KYLIN-1657
 URL: https://issues.apache.org/jira/browse/KYLIN-1657
 Project: Kylin
  Issue Type: Improvement
  Components: Job Engine
Affects Versions: v1.5.1
Reporter: Dayue Gao
Assignee: Dayue Gao
Priority: Minor


We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of 
reducers for cubing job, but min reducer is hard coded to 1. We should make it 
also configurable and this could be helpful in some circumstances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1657) Add new configuration kylin.job.mapreduce.min.reducer.number

2016-05-05 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1657:
-
Attachment: KYLIN-1657.patch

Added a new kylin configuration "kylin.job.mapreduce.min.reducer.number" which 
defaults to "1".

> Add new configuration kylin.job.mapreduce.min.reducer.number
> 
>
> Key: KYLIN-1657
> URL: https://issues.apache.org/jira/browse/KYLIN-1657
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Attachments: KYLIN-1657.patch
>
>
> We have "kylin.job.mapreduce.max.reducer.number" to limit the max number of 
> reducers for cubing job, but min reducer is hard coded to 1. We should make 
> it also configurable and this could be helpful in some circumstances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-05-04 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1656:


 Summary: Improve performance of MRv2 engine by making each mapper 
handles a configured number of records
 Key: KYLIN-1656
 URL: https://issues.apache.org/jira/browse/KYLIN-1656
 Project: Kylin
  Issue Type: Improvement
  Components: Job Engine
Affects Versions: v1.5.1, v1.5.0
Reporter: Dayue Gao
Assignee: Dayue Gao


In the current version of MRv2 build engine, each mapper handles one block of 
the flat hive table (stored in sequence file). This has two major problems:

# It's difficult for user to control the parallelism of mappers for each cube.
User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a global 
configuration and cannot be override using "override_kylin_properties" 
introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
# May encounter mapper execution skew due to a skew distribution of each 
block's records number.
This is a more severe problem since FactDistinctColumn and InMemCubing step of 
MRv2 is very cpu intensive in map task. To give you a sense of how bad it is, 
one of our cube's FactDistinctColumnStep takes ~100min in total with average 
mapper time only 11min. This is because there exists several skewed map tasks 
which handled 10x records than average map task. And the InMemCubing steps 
failed because the skewed mapper tasks hit "mapred.task.timeout".

To avoid skew to happen, *we'd better make each mapper handles a configurable 
number of records instead of handles a sequence file block.* The way we 
achieved this is to add a `RedistributeFlatHiveTableStep` right after 
"FlatHiveTableStep".

Here's what RedistributeFlatHiveTableStep do:
1. we run a {{select count(1) from intermediate_table}} to determine the 
`input_rowcount` of this build

2. we run a {{insert overwrite table intermediate_table select * from 
intermediate_table distribute by rand()}} to evenly distribute records to 
reducers.

The number of reducers is specified as "input_rowcount / mapper_input_rows" 
where `mapper_input_rows` is a new parameter for user to specify how many 
records each mapper should handle. Since each reducer will write out its 
records into one file, we're guaranteed that after 
RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
around mapper_input_rows. And since the followed up job's mapper handles one 
block of each sequence file, they won't handle more than mapper_input_rows.

The added RedistributeFlatHiveTableStep usually takes a small amount of time 
compared to other steps, but the benefit it brings is remarkable. Here's what 
performance improvement we saw:

|| cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
FactDistinctColumn after||
| case#1 | 51.78min | 8.40min | 13.06min |
| case#2 | 95.65min | 2.46min | 26.37min |

And since mapper_input_rows is a kylin configuration, user can override it for 
each cube.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-05-05 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1656:
-
Attachment: KYLIN-1656.patch

Please review the patch.

This adds a new kylin configuration named 
"kylin.job.mapreduce.mapper.input.rows" and defaults to "50".

> Improve performance of MRv2 engine by making each mapper handles a configured 
> number of records
> ---
>
> Key: KYLIN-1656
> URL: https://issues.apache.org/jira/browse/KYLIN-1656
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-1656.patch
>
>
> In the current version of MRv2 build engine, each mapper handles one block of 
> the flat hive table (stored in sequence file). This has two major problems:
> # It's difficult for user to control the parallelism of mappers for each cube.
> User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a 
> global configuration and cannot be override using "override_kylin_properties" 
> introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
> # May encounter mapper execution skew due to a skew distribution of each 
> block's records number.
> This is a more severe problem since FactDistinctColumn and InMemCubing step 
> of MRv2 is very cpu intensive in map task. To give you a sense of how bad it 
> is, one of our cube's FactDistinctColumnStep takes ~100min in total with 
> average mapper time only 11min. This is because there exists several skewed 
> map tasks which handled 10x records than average map task. And the 
> InMemCubing steps failed because the skewed mapper tasks hit 
> "mapred.task.timeout".
> To avoid skew to happen, *we'd better make each mapper handles a configurable 
> number of records instead of handles a sequence file block.* The way we 
> achieved this is to add a `RedistributeFlatHiveTableStep` right after 
> "FlatHiveTableStep".
> Here's what RedistributeFlatHiveTableStep do:
> 1. we run a {{select count(1) from intermediate_table}} to determine the 
> `input_rowcount` of this build
> 2. we run a {{insert overwrite table intermediate_table select * from 
> intermediate_table distribute by rand()}} to evenly distribute records to 
> reducers.
> The number of reducers is specified as "input_rowcount / mapper_input_rows" 
> where `mapper_input_rows` is a new parameter for user to specify how many 
> records each mapper should handle. Since each reducer will write out its 
> records into one file, we're guaranteed that after 
> RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
> around mapper_input_rows. And since the followed up job's mapper handles one 
> block of each sequence file, they won't handle more than mapper_input_rows.
> The added RedistributeFlatHiveTableStep usually takes a small amount of time 
> compared to other steps, but the benefit it brings is remarkable. Here's what 
> performance improvement we saw:
> || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
> FactDistinctColumn after||
> | case#1 | 51.78min | 8.40min | 13.06min |
> | case#2 | 95.65min | 2.46min | 26.37min |
> And since mapper_input_rows is a kylin configuration, user can override it 
> for each cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1662) tableName got truncated during request mapping for /tables/tableName

2016-05-05 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1662:
-
Attachment: KYLIN-1662.patch

attach patch.

> tableName got truncated during request mapping for /tables/tableName
> 
>
> Key: KYLIN-1662
> URL: https://issues.apache.org/jira/browse/KYLIN-1662
> Project: Kylin
>  Issue Type: Bug
>  Components: REST Service
>Affects Versions: v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-1662.patch
>
>
> Request '/tables/default.kylin_sales' for table metadata return empty string. 
> This is because Spring by default treats ".kylin_sales" as a file extension 
> and path variable {{tableName}} receives value "default" rather than 
> "default.kylin_sales". As a result, Kylin searchs metadata for table 
> "default.default".
> An easy fix is to use "/\{tableName:.+\}" in request mapping as suggested in 
> http://stackoverflow.com/questions/16332092/spring-mvc-pathvariable-with-dot-is-getting-truncated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1656) Improve performance of MRv2 engine by making each mapper handles a configured number of records

2016-05-09 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276344#comment-15276344
 ] 

Dayue Gao commented on KYLIN-1656:
--

Didn't get the time to create the branch, thank you Shaofeng!

> Improve performance of MRv2 engine by making each mapper handles a configured 
> number of records
> ---
>
> Key: KYLIN-1656
> URL: https://issues.apache.org/jira/browse/KYLIN-1656
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine
>Affects Versions: v1.5.0, v1.5.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-1656.patch
>
>
> In the current version of MRv2 build engine, each mapper handles one block of 
> the flat hive table (stored in sequence file). This has two major problems:
> # It's difficult for user to control the parallelism of mappers for each cube.
> User can change "dfs.block.size" in kylin_hive_conf.xml, however it's a 
> global configuration and cannot be override using "override_kylin_properties" 
> introduced in [KYLIN-1534|https://issues.apache.org/jira/browse/KYLIN-1534].
> # May encounter mapper execution skew due to a skew distribution of each 
> block's records number.
> This is a more severe problem since FactDistinctColumn and InMemCubing step 
> of MRv2 is very cpu intensive in map task. To give you a sense of how bad it 
> is, one of our cube's FactDistinctColumnStep takes ~100min in total with 
> average mapper time only 11min. This is because there exists several skewed 
> map tasks which handled 10x records than average map task. And the 
> InMemCubing steps failed because the skewed mapper tasks hit 
> "mapred.task.timeout".
> To avoid skew to happen, *we'd better make each mapper handles a configurable 
> number of records instead of handles a sequence file block.* The way we 
> achieved this is to add a `RedistributeFlatHiveTableStep` right after 
> "FlatHiveTableStep".
> Here's what RedistributeFlatHiveTableStep do:
> 1. we run a {{select count(1) from intermediate_table}} to determine the 
> `input_rowcount` of this build
> 2. we run a {{insert overwrite table intermediate_table select * from 
> intermediate_table distribute by rand()}} to evenly distribute records to 
> reducers.
> The number of reducers is specified as "input_rowcount / mapper_input_rows" 
> where `mapper_input_rows` is a new parameter for user to specify how many 
> records each mapper should handle. Since each reducer will write out its 
> records into one file, we're guaranteed that after 
> RedistributeFlatHiveTableStep, each sequence file of FlatHiveTable contains 
> around mapper_input_rows. And since the followed up job's mapper handles one 
> block of each sequence file, they won't handle more than mapper_input_rows.
> The added RedistributeFlatHiveTableStep usually takes a small amount of time 
> compared to other steps, but the benefit it brings is remarkable. Here's what 
> performance improvement we saw:
> || cube || FactDistinctColumn before || RedistributeFlatHiveTableStep || 
> FactDistinctColumn after||
> | case#1 | 51.78min | 8.40min | 13.06min |
> | case#2 | 95.65min | 2.46min | 26.37min |
> And since mapper_input_rows is a kylin configuration, user can override it 
> for each cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1898) Upgrade to Avatica 1.8 or higher

2016-07-20 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385511#comment-15385511
 ] 

Dayue Gao commented on KYLIN-1898:
--

I've relocated kylin-jdbc dependencies in KYLIN-1846, hopefully it will solve 
the problem mentioned.

> Upgrade to Avatica 1.8 or higher
> 
>
> Key: KYLIN-1898
> URL: https://issues.apache.org/jira/browse/KYLIN-1898
> Project: Kylin
>  Issue Type: Bug
>Reporter: Julian Hyde
> Attachments: KYLIN-1898.patch
>
>
> A [stackoverflow 
> question|http://stackoverflow.com/questions/38369871/how-to-install-two-different-version-of-a-specific-package-in-maven]
>  reports problems when mixing Avatica 1.6 (used by Kylin) and Avatica 1.8 
> (used by some unspecified other database). It appears that 1.6 and 1.8 are 
> not compatible, probably due to CALCITE-836 or CALCITE-1213. The solution is 
> for Kylin to upgrade to 1.8 or higher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1849) add basic search capability at model UI

2016-07-04 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1849:


 Summary: add basic search capability at model UI
 Key: KYLIN-1849
 URL: https://issues.apache.org/jira/browse/KYLIN-1849
 Project: Kylin
  Issue Type: New Feature
  Components: Web 
Affects Versions: v1.5.2
Reporter: Dayue Gao
Assignee: Zhong,Jason


In order to work with dozens of cubes, could we add a search box at "Model" 
page? Just like the one at "Monitor" page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1848) Can't sort cubes by any field in Web UI

2016-07-04 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1848:


 Summary: Can't sort cubes by any field in Web UI
 Key: KYLIN-1848
 URL: https://issues.apache.org/jira/browse/KYLIN-1848
 Project: Kylin
  Issue Type: Bug
  Components: Web 
Affects Versions: v1.5.2
Reporter: Dayue Gao
Assignee: Zhong,Jason


In a project containing dozens of cubes, it's helpful to sort cubes by fields 
like "Create Time", "Status", and so on. I tried it today but found it's not 
working, could we fix it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-1846) minimize dependencies of JDBC driver

2016-07-04 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-1846:
-
Attachment: KYLIN-1846.patch

upload patch, which was tested and solved several classloading problems 
encountered in our environment.

> minimize dependencies of JDBC driver
> 
>
> Key: KYLIN-1846
> URL: https://issues.apache.org/jira/browse/KYLIN-1846
> Project: Kylin
>  Issue Type: Improvement
>  Components: Driver - JDBC
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-1846.patch
>
>
> kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) 
> into an uber jar, which could cause problems when user tries to integrate 
> kylin-jdbc into their own application.
> I suggest making the following changes to packaging:
> # remove calcite-core dependency
> calcite-avatica is sufficient as far as I know.
> # remove guava dependency
> The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} 
> in KylinMeta.java, which can be simply replaced with 
> {{Collections.singletonList(metaResultSet)}}.
> # remove log4j, slf4j-log4j12 dependencies
> As a library, kylin-jdbc [should only depend on 
> slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging 
> framework to use should be a deployment-time choice made by user. This means 
> we should revert https://issues.apache.org/jira/browse/KYLIN-1160
> # relocate all dependencies to "org.apache.kylin.jdbc.shaded" using 
> maven-shade-plugin
> This includes calcite-avatica, jackson, commons-httpclient and commons-codec. 
> Relocating should help to avoid class version conflicts.
> I'll submit a patch for this, discussions are welcome~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1846) minimize dependencies of JDBC driver

2016-07-10 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370115#comment-15370115
 ] 

Dayue Gao commented on KYLIN-1846:
--

[~Shaofengshi]

I committed 9c578e7 to shade httpcomponents.

> minimize dependencies of JDBC driver
> 
>
> Key: KYLIN-1846
> URL: https://issues.apache.org/jira/browse/KYLIN-1846
> Project: Kylin
>  Issue Type: Improvement
>  Components: Driver - JDBC
>Affects Versions: v1.5.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-1846.patch
>
>
> kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) 
> into an uber jar, which could cause problems when user tries to integrate 
> kylin-jdbc into their own application.
> I suggest making the following changes to packaging:
> # remove calcite-core dependency
> calcite-avatica is sufficient as far as I know.
> # remove guava dependency
> The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} 
> in KylinMeta.java, which can be simply replaced with 
> {{Collections.singletonList(metaResultSet)}}.
> # remove log4j, slf4j-log4j12 dependencies
> As a library, kylin-jdbc [should only depend on 
> slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging 
> framework to use should be a deployment-time choice made by user. This means 
> we should revert https://issues.apache.org/jira/browse/KYLIN-1160
> # relocate all dependencies to "org.apache.kylin.jdbc.shaded" using 
> maven-shade-plugin
> This includes calcite-avatica, jackson, commons-httpclient and commons-codec. 
> Relocating should help to avoid class version conflicts.
> I'll submit a patch for this, discussions are welcome~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-1846) minimize dependencies of JDBC driver

2016-07-04 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-1846:


 Summary: minimize dependencies of JDBC driver
 Key: KYLIN-1846
 URL: https://issues.apache.org/jira/browse/KYLIN-1846
 Project: Kylin
  Issue Type: Improvement
  Components: Driver - JDBC
Affects Versions: v1.5.2
Reporter: Dayue Gao
Assignee: Dayue Gao


kylin-jdbc packages many dependencies (calcite-core, guava, jackson, etc) into 
an uber jar, which could cause problems when user tries to integrate kylin-jdbc 
into their own application.

I suggest making the following changes to packaging:

# remove calcite-core dependency
calcite-avatica is sufficient as far as I know.
# remove guava dependency
The only place kylin-jdbc uses guava is {{ImmutableList.of(metaResultSet)}} in 
KylinMeta.java, which can be simply replaced with 
{{Collections.singletonList(metaResultSet)}}.
# remove log4j, slf4j-log4j12 dependencies
As a library, kylin-jdbc [should only depend on 
slf4j-api|http://slf4j.org/manual.html#libraries]. Which underlying logging 
framework to use should be a deployment-time choice made by user. This means we 
should revert https://issues.apache.org/jira/browse/KYLIN-1160
# relocate all dependencies to "org.apache.kylin.jdbc.shaded" using 
maven-shade-plugin
This includes calcite-avatica, jackson, commons-httpclient and commons-codec. 
Relocating should help to avoid class version conflicts.

I'll submit a patch for this, discussions are welcome~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2437) collect number of bytes scanned to query metrics

2017-02-08 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2437:


 Summary: collect number of bytes scanned to query metrics
 Key: KYLIN-2437
 URL: https://issues.apache.org/jira/browse/KYLIN-2437
 Project: Kylin
  Issue Type: Improvement
  Components: Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao


Besides scanned row count, it's useful to know how many bytes are scanned from 
HBase to fulfil a query. It is perhaps a better indicator than row count that 
shows how much pressure a query puts on HBase. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache

2017-02-07 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2436:


 Summary: add a configuration knob to disable spilling of 
aggregation cache
 Key: KYLIN-2436
 URL: https://issues.apache.org/jira/browse/KYLIN-2436
 Project: Kylin
  Issue Type: Improvement
  Components: Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao


Kylin's aggregation operator can spill intermediate results to disk when its 
estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb 
to be specific). While it's a useful feature in general to prevent RegionServer 
from OOM, there are times when aborting this kind of memory-hungry query 
immediately is a more suitable choice to users.

To accommodate this requirement, I suggest adding a new configuration named 
"kylin.storage.hbase.coprocessor-spill-enabled". The default value would be 
true, which will keep the same behavior as before. If changed to false, query 
that uses more aggregation memory than threshold will fail immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KYLIN-2058) Make Kylin more resilient to bad queries

2017-02-07 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2058.
--
   Resolution: Fixed
Fix Version/s: v1.6.0

There are still works to be done to defend Kylin against bad query. However 
since 1.6.0 has been released, I'll fire new JIRAs to continue.

> Make Kylin more resilient to bad queries
> 
>
> Key: KYLIN-2058
> URL: https://issues.apache.org/jira/browse/KYLIN-2058
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Bad/Big queries are a huge threat to the overall performance and stability of 
> Kylin.  We occasionally saw some of these queries either causing heavy GC 
> activities or crashing regionservers. I'd like to start a series of work to 
> make Kylin more resilient to bad queries.
> This is an umbrella jira to relating works.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (KYLIN-1455) HBase ScanMetrics are not properly logged in query log

2017-02-07 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao closed KYLIN-1455.

Resolution: Won't Fix

> HBase ScanMetrics are not properly logged in query log
> --
>
> Key: KYLIN-1455
> URL: https://issues.apache.org/jira/browse/KYLIN-1455
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.2
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Attachments: KYLIN-1455-1.x-staging.patch
>
>
> HBase's ScanMetrics provide users valuable information when troubleshooting 
> query performance issues. But I found it was not properly logged, sometimes 
> missing from the log, sometimes duplicated.
> Below is an example of duplicated scan log, this is due to 
> {{CubeSegmentTupleIterator#closeScanner()}} method is invoked two times, 
> first in hasNext(), second in close().
> {noformat}
> [http-bio-8080-exec-8]:[2016-02-26 
> 17:31:50,227][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)]
>  - Scan 
> {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}},
>  
> ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> [http-bio-8080-exec-8]:[2016-02-26 
> 17:31:50,229][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:150)]
>  - HBase Metrics: count=17357, ms=3194, bytes=905594, remote_bytes=905594, 
> regions=1, not_serving_region=0, rpc=19, rpc_retries=0, remote_rpc=19, 
> remote_rpc_retries=0
> [http-bio-8080-exec-8]:[2016-02-26 
> 17:32:58,016][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)]
>  - Scan 
> {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}},
>  
> ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> [http-bio-8080-exec-8]:[2016-02-26 
> 17:33:04,443][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:150)]
>  - HBase Metrics: count=17357, ms=3194, bytes=905594, remote_bytes=905594, 
> regions=1, not_serving_region=0, rpc=19, rpc_retries=0, remote_rpc=19, 
> remote_rpc_retries=0
> {noformat}
> And sometimes ScanMetrics is missing from the log, showed below. I think this 
> is due to {{CubeSegmentTupleIterator#closeScanner()}} trying to get 
> ScanMetrics before close the current ResultScanner. After looking into HBase 
> client source, I found that ScanMetrics will not be written out until the 
> scanner is closed or exhausted (no cache entries). So it'd be better to get 
> ScanMetrics after closing the scanner.
> {noformat}
> [http-bio-8080-exec-2]:[2016-02-26 
> 17:18:43,928][DEBUG][org.apache.kylin.storage.hbase.CubeSegmentTupleIterator.closeScanner(CubeSegmentTupleIterator.java:146)]
>  - Scan 
> {"loadColumnFamiliesOnDemand":null,"filter":"FuzzyRowFilter{fuzzyKeysData={\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x00\\x00\\x00\\x17\\x00\\x00\\x17\\x00:\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\xFF\\x00\\x00\\x00\\xFF\\x00\\x00\\xFF\\x00}},
>  
> ","startRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x01\\x1C\\x04\\x0Cx\\x03\\x08Y","stopRow":"\\x00\\x00\\x00\\x00\\x00\\x00\\x09\\xC7\\x08\\x8F\\xFF\\x17\\xFF\\xFF\\x17\\xFF\\x00","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":5242880,"families":{"F1":["M"]},"caching":1024,"maxVersions":1,"timeRange":[0,9223372036854775807]}
> [http-bio-8080-exec-2]:[2016-02-26 
> 17:19:38,228][INFO][org.apache.kylin.rest.service.QueryService.logQuery(QueryService.java:242)]
>  -
> ==[QUERY]===
> {noformat}
> This should be easy to fix, I will submit a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KYLIN-2438) replace scan threshold with max scan bytes

2017-02-08 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2438:


 Summary: replace scan threshold with max scan bytes
 Key: KYLIN-2438
 URL: https://issues.apache.org/jira/browse/KYLIN-2438
 Project: Kylin
  Issue Type: Improvement
  Components: Query Engine, Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao


In order to guard against bad queries that can consume too much memory and then 
crash kylin / hbase server, kylin limits the maximum number of rows query can 
scan. The maximum value is determined by two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# otherwise, *kylin.query.mem.budget* / estimated_row_size is used as the 
maximum per region.

This approach however has several deficiencies:
* It doesn't work with complex, variable length metrics very well. The 
estimated threshold could be either too small or too large. If it's too small, 
good queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan in total
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* the old *kylin.query.scan.threshold* will be deprecated



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes

2017-02-08 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2438:
-
Description: 
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated

  was:
In order to guard against bad queries that can consume too much memory and then 
crash kylin / hbase server, kylin limits the maximum number of rows query can 
scan. The maximum value is determined by two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# otherwise, *kylin.query.mem.budget* / estimated_row_size is used as the 
maximum per region.

This approach however has several deficiencies:
* It doesn't work with complex, variable length metrics very well. The 
estimated threshold could be either too small or too large. If it's too small, 
good queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan in total
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* the old *kylin.query.scan.threshold* will be deprecated


> replace scan threshold with max scan bytes
> --
>
> Key: KYLIN-2438
> URL: https://issues.apache.org/jira/browse/KYLIN-2438
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> In order to guard against bad queries that can consume lots of memory and 
> potentially crash kylin / hbase server, kylin limits the maximum number of 
> rows query can scan. The maximum value is chosen based on two configs
> # *kylin.query.scan.threshold* is used if the query doesn't contain 
> memory-hungry metrics
> # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
> region maximum.
> This approach however has several deficiencies:
> * It doesn't work with complex, varlen metrics very well. The estimated 
> threshold could be either too small or too large. If it's too small, good 
> queries are killed. If it's too large, bad queries are not banned.
> * Row count doesn't correspond to memory consumption, thus it's difficult to 
> determine how large scan threshold should be set to.
> * kylin.query.scan.threshold can't be override at cube level.
> In this JIRA, I propose to replace the current row count based threshold with 
> a more intuitive size based threshold
> * KYLIN-2437 will collect the number of bytes scanned at both region and 
> query level
> * A new configuration *kylin.query.max-scan-bytes* will be added to limits 
> the maximum number of bytes query can scan
> * *kylin.query.mem.budget* will be renamed to 
> *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
> * The above two configs scan be override at cube level
> * the old *kylin.query.scan.threshold* will be deprecated



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes

2017-02-08 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2438:
-
Description: 
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. 
We don't need to rely on estimations about row size any more.
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated

  was:
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated


> replace scan threshold with max scan bytes
> --
>
> Key: KYLIN-2438
> URL: https://issues.apache.org/jira/browse/KYLIN-2438
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> In order to guard against bad queries that can consume lots of memory and 
> potentially crash kylin / hbase server, kylin limits the maximum number of 
> rows query can scan. The maximum value is chosen based on two configs
> # *kylin.query.scan.threshold* is used if the query doesn't contain 
> memory-hungry metrics
> # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
> region maximum.
> This approach however has several deficiencies:
> * It doesn't work with complex, varlen metrics very well. The estimated 
> threshold could be either too small or too large. If it's too small, good 
> queries are killed. If it's too large, bad queries are not banned.
> * Row count doesn't correspond to memory consumption, thus it's difficult to 
> determine how large scan threshold should be set to.
> * kylin.query.scan.threshold can't be override at cube level.
> In this JIRA, I propose to replace the current row count based threshold with 
> a more intuitive size based threshold
> * KYLIN-2437 will collect the number of bytes scanned at both region and 
> query level
> * A new configuration *kylin.query.max-scan-bytes* will be added to limits 
> the maximum number of bytes query can scan
> * *kylin.query.mem.budget* will be renamed to 
> *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region 
> level. We don't need to rely on estimations about row size any more.
> * The above two configs scan be 

[jira] [Updated] (KYLIN-2438) replace scan threshold with max scan bytes

2017-02-08 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2438:
-
Description: 
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. 
No need to rely on estimations about row size any more.
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated

  was:
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. 
We don't need to rely on estimations about row size any more.
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated


> replace scan threshold with max scan bytes
> --
>
> Key: KYLIN-2438
> URL: https://issues.apache.org/jira/browse/KYLIN-2438
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> In order to guard against bad queries that can consume lots of memory and 
> potentially crash kylin / hbase server, kylin limits the maximum number of 
> rows query can scan. The maximum value is chosen based on two configs
> # *kylin.query.scan.threshold* is used if the query doesn't contain 
> memory-hungry metrics
> # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
> region maximum.
> This approach however has several deficiencies:
> * It doesn't work with complex, varlen metrics very well. The estimated 
> threshold could be either too small or too large. If it's too small, good 
> queries are killed. If it's too large, bad queries are not banned.
> * Row count doesn't correspond to memory consumption, thus it's difficult to 
> determine how large scan threshold should be set to.
> * kylin.query.scan.threshold can't be override at cube level.
> In this JIRA, I propose to replace the current row count based threshold with 
> a more intuitive size based threshold
> * KYLIN-2437 will collect the number of bytes scanned at both region and 
> query level
> * A new configuration *kylin.query.max-scan-bytes* will be added to limits 
> the maximum number of bytes query can scan
> * *kylin.query.mem.budget* will be renamed to 
> *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region 
> level. No need to rely on estimations about 

[jira] [Resolved] (KYLIN-2412) Unclosed DataOutputStream in RoaringBitmapCounter#write()

2017-01-21 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2412.
--
Resolution: Fixed

commit 
https://github.com/apache/kylin/commit/d264339b1c16c195ffafc2217b793d81bdbd6434

> Unclosed DataOutputStream in RoaringBitmapCounter#write()
> -
>
> Key: KYLIN-2412
> URL: https://issues.apache.org/jira/browse/KYLIN-2412
> Project: Kylin
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Dayue Gao
>Priority: Minor
>
> {code}
> bitmap.serialize(new DataOutputStream(new 
> ByteBufferOutputStream(out)));
> {code}
> Upon return from the method, DataOutputStream should be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job

2017-02-22 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878043#comment-15878043
 ] 

Dayue Gao commented on KYLIN-2457:
--

+1. Hi [~zhengd], it would be better if you update the comments of 
{{makeDictForNewSegment}} and {{makeSnapshotForNewSegment}}?

> Should copy the latest dictionaries on dimension tables in a batch merge job
> 
>
> Key: KYLIN-2457
> URL: https://issues.apache.org/jira/browse/KYLIN-2457
> Project: Kylin
>  Issue Type: Bug
>Reporter: zhengdong
>Priority: Critical
> Attachments: 
> KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch
>
>
> In a batch merge job, we need to create dictionaries for all dimensions for 
> the new segment. For those dictionaries on dimension table, we currently just 
> copy them from the earliest segment of the merging segments. 
> However, we should select the newest dictionary for the new segment, since 
> the incremental dimension table is allowed. The older dictionary can't find 
> the records corresponding to the new key added to a dimension table and  lead 
> wrong query result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds

2017-02-20 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874279#comment-15874279
 ] 

Dayue Gao commented on KYLIN-2451:
--

LGTM 

> Set HBASE_RPC_TIMEOUT according to 
> kylin.storage.hbase.coprocessor-timeout-seconds
> --
>
> Key: KYLIN-2451
> URL: https://issues.apache.org/jira/browse/KYLIN-2451
> Project: Kylin
>  Issue Type: Improvement
>Reporter: liyang
>Assignee: liyang
> Fix For: v2.0.0
>
>
> Currently if HBASE_RPC_TIMEOUT is shorter than 
> "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will 
> timeout before coprocessor gives up. Shall let RPC wait longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds

2017-02-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871127#comment-15871127
 ] 

Dayue Gao commented on KYLIN-2451:
--

Hi [~liyang.g...@gmail.com], how is this possible? Haven't we already upper 
bounded coprocessor timeout to 0.9 x HBASE_RPC_TIMEOUT? Please take a look at 
CubeHBaseRPC.getCoprocessorTimeoutMillis.

> Set HBASE_RPC_TIMEOUT according to 
> kylin.storage.hbase.coprocessor-timeout-seconds
> --
>
> Key: KYLIN-2451
> URL: https://issues.apache.org/jira/browse/KYLIN-2451
> Project: Kylin
>  Issue Type: Improvement
>Reporter: liyang
>Assignee: liyang
> Fix For: v2.0.0
>
>
> Currently if HBASE_RPC_TIMEOUT is shorter than 
> "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will 
> timeout before coprocessor gives up. Shall let RPC wait longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2451) Set HBASE_RPC_TIMEOUT according to kylin.storage.hbase.coprocessor-timeout-seconds

2017-02-20 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874211#comment-15874211
 ] 

Dayue Gao commented on KYLIN-2451:
--

Hi [~liyang.g...@gmail.com],

I see what's the difference. Our goal is the same, make rpc timeout longer than 
coprocessor-timeout-seconds. The differences are 
that previously we override coprocessor-timeout-seconds according to 
HBASE_RPC_TIMEOUT (please see comment about coprocessor-timeout-seconds in 
kylin.properties), and now you want to do it oppositely, set HBASE_RPC_TIMEOUT 
according to coprocessor-timeout-seconds, right?

But I would prefer the previously approach because coprocessor-timeout-seconds 
is a cube level configs but HBASE_RPC_TIMEOUT is a global one. With your 
approach, use can't choose a larger value at cube level.

> Set HBASE_RPC_TIMEOUT according to 
> kylin.storage.hbase.coprocessor-timeout-seconds
> --
>
> Key: KYLIN-2451
> URL: https://issues.apache.org/jira/browse/KYLIN-2451
> Project: Kylin
>  Issue Type: Improvement
>Reporter: liyang
>Assignee: liyang
> Fix For: v2.0.0
>
>
> Currently if HBASE_RPC_TIMEOUT is shorter than 
> "kylin.storage.hbase.coprocessor-timeout-seconds", the HBase RPC call will 
> timeout before coprocessor gives up. Shall let RPC wait longer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2443) Report coprocessor error information back to client

2017-02-11 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862616#comment-15862616
 ] 

Dayue Gao commented on KYLIN-2443:
--

Commit 
https://github.com/apache/kylin/commit/43c0566728092d537201d751d3e8f6e3c0d5f051

Changes highlight
* Update CubeVisitResponse message with ErrorInfo and report error message back 
to end user
* Renamed GTScanTimeoutException to KylinTimeoutException, 
GTScanExceedThresholdException to ResourceLimitExceededException. Deleted 
GTScanSelfTerminatedException.
* Make SQLResponse#totalScanCount reflect hbase scan count rather than query 
server scan count. Rename StorageContext#totalScanCount to processedRowCount

[~mahongbin], could you peer review the commit?

> Report coprocessor error information back to client
> ---
>
> Key: KYLIN-2443
> URL: https://issues.apache.org/jira/browse/KYLIN-2443
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> When query aborts in coprocessor, the current error message (list below) 
> doesn't carry any concrete reason. User has to check regionserver's log in 
> order to figure out what's happening, which is a tedious work and not always 
> possible in a cloud environment. 
> {noformat}
>  4d65f9bf>The coprocessor thread stopped itself due to scan timeout or scan 
> threshold(check region server log), failing current query...
> {noformat}
> It would be better to report error message to client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache

2017-02-09 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2436.
--
   Resolution: Fixed
Fix Version/s: v2.0.0

> add a configuration knob to disable spilling of aggregation cache
> -
>
> Key: KYLIN-2436
> URL: https://issues.apache.org/jira/browse/KYLIN-2436
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v2.0.0
>
>
> Kylin's aggregation operator can spill intermediate results to disk when its 
> estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb 
> to be specific). While it's a useful feature in general to prevent 
> RegionServer from OOM, there are times when aborting this kind of 
> memory-hungry query immediately is a more suitable choice to users.
> To accommodate this requirement, I suggest adding a new configuration named 
> *kylin.storage.hbase.coprocessor-spill-enabled*. The default value would be 
> true, which will keep the same behavior as before. If changed to false, query 
> that uses more aggregation memory than threshold will fail immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2436) add a configuration knob to disable spilling of aggregation cache

2017-02-09 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860703#comment-15860703
 ] 

Dayue Gao commented on KYLIN-2436:
--

commit 
https://github.com/apache/kylin/commit/ecf6a69fece7cbda3a9bd8d678c928224ce677aa

> add a configuration knob to disable spilling of aggregation cache
> -
>
> Key: KYLIN-2436
> URL: https://issues.apache.org/jira/browse/KYLIN-2436
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v2.0.0
>
>
> Kylin's aggregation operator can spill intermediate results to disk when its 
> estimated memory usage exceeds some threshold (kylin.query.coprocessor.mem.gb 
> to be specific). While it's a useful feature in general to prevent 
> RegionServer from OOM, there are times when aborting this kind of 
> memory-hungry query immediately is a more suitable choice to users.
> To accommodate this requirement, I suggest adding a new configuration named 
> *kylin.storage.hbase.coprocessor-spill-enabled*. The default value would be 
> true, which will keep the same behavior as before. If changed to false, query 
> that uses more aggregation memory than threshold will fail immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KYLIN-2443) Report coprocessor error information back to client

2017-02-14 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2443.
--
   Resolution: Fixed
Fix Version/s: v2.0.0

> Report coprocessor error information back to client
> ---
>
> Key: KYLIN-2443
> URL: https://issues.apache.org/jira/browse/KYLIN-2443
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v2.0.0
>
>
> When query aborts in coprocessor, the current error message (list below) 
> doesn't carry any concrete reason. User has to check regionserver's log in 
> order to figure out what's happening, which is a tedious work and not always 
> possible in a cloud environment. 
> {noformat}
>  4d65f9bf>The coprocessor thread stopped itself due to scan timeout or scan 
> threshold(check region server log), failing current query...
> {noformat}
> It would be better to report error message to client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KYLIN-2437) collect number of bytes scanned to query metrics

2017-02-09 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2437.
--
Resolution: Fixed

commit 
https://github.com/apache/kylin/commit/e09338b34c0b07a7167096e45bf9185aa0d0cbd5

> collect number of bytes scanned to query metrics
> 
>
> Key: KYLIN-2437
> URL: https://issues.apache.org/jira/browse/KYLIN-2437
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> Besides scanned row count, it's useful to know how many bytes are scanned 
> from HBase to fulfil a query. It is perhaps a better indicator than row count 
> that shows how much pressure a query puts on HBase. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2437) collect number of bytes scanned to query metrics

2017-02-09 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2437:
-
Fix Version/s: v2.0.0

> collect number of bytes scanned to query metrics
> 
>
> Key: KYLIN-2437
> URL: https://issues.apache.org/jira/browse/KYLIN-2437
> Project: Kylin
>  Issue Type: Improvement
>  Components: Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v2.0.0
>
>
> Besides scanned row count, it's useful to know how many bytes are scanned 
> from HBase to fulfil a query. It is perhaps a better indicator than row count 
> that shows how much pressure a query puts on HBase. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance

2017-01-18 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829485#comment-15829485
 ] 

Dayue Gao commented on KYLIN-2387:
--

Choosing mutable or immutable bitmaps is implementation details, shouldn't 
affect the way client uses BitmapCounter. Hence I add mutate operations back to 
BitmapCounter interface, and merge the two subclasses into one 
RoaringBitmapCounter.

Commit here 
https://github.com/apache/kylin/commit/38c3e7bf691ecdfd0f8d42fcc97065a0596be018 



> A new BitmapCounter with better performance
> ---
>
> Key: KYLIN-2387
> URL: https://issues.apache.org/jira/browse/KYLIN-2387
> Project: Kylin
>  Issue Type: Improvement
>  Components: Metadata, Query Engine, Storage - HBase
>Affects Versions: v2.0.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> We found the old BitmapCounter does not perform very well on very large 
> bitmap. The inefficiency comes from
> * Poor serialize implementation: instead of serialize bitmap directly to 
> ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes 
> superfluous memory allocations
> * Poor peekLength implementation: the whole bitmap is deserialized in order 
> to retrieve its serialized size
> * Extra deserialize cost: even if only cardinality info is needed to answer 
> query, the whole bitmap is deserialize into MutableRoaringBitmap
> A new BitmapCounter is designed to solve these problems
> * It comes in tow flavors, mutable and immutable, which is based on 
> Mutable/Immutable RoaringBitmap correspondingly
> * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a 
> copied buffer. So we always deserialize to ImmutableBitmapCounter at first, 
> and convert it to MutableBitmapCounter only when necessary
> * peekLength is implemented using 
> ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only 
> the header of roaring format is examined
> * It can directly serializes to ByteBuffer, no intermediate buffer is 
> allocated
> * The wire format is the same as before 
> ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). 
> Therefore no cube rebuild is needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (KYLIN-2398) CubeSegmentScanner generated inaccurate

2017-01-15 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao closed KYLIN-2398.

   Resolution: Duplicate
Fix Version/s: (was: Future)

> CubeSegmentScanner generated inaccurate
> ---
>
> Key: KYLIN-2398
> URL: https://issues.apache.org/jira/browse/KYLIN-2398
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: WangSheng
>Assignee: liyang
>
> My project has three segment:
> 2016060100_2016060200,
> 2016060200_2016060300,
> 2016060300_2016060400
> When I used filter condition like this : day>='2016-06-01' and 
> day<'2016-06-02'
> Kylin would generated three CubeSegmentScanner, and each CubeSegmentScanner's 
> GTScanRequest are not empty!
> When I changed filter condition like this : day>='2016-06-01' and 
> day<='2016-06-02'
> Kylin would also generated three CubeSegmentScanner, but the last 
> CubeSegmentScanner's GTScanRequest is empty!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (KYLIN-2399) CubeSegmentScanner generated inaccurate

2017-01-15 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao closed KYLIN-2399.

Resolution: Duplicate

> CubeSegmentScanner generated inaccurate
> ---
>
> Key: KYLIN-2399
> URL: https://issues.apache.org/jira/browse/KYLIN-2399
> Project: Kylin
>  Issue Type: Improvement
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: WangSheng
>Assignee: liyang
> Fix For: Future
>
>
> My project has three segment:
> 2016060100_2016060200,
> 2016060200_2016060300,
> 2016060300_2016060400
> When I used filter condition like this : day>='2016-06-01' and 
> day<'2016-06-02'
> Kylin would generated three CubeSegmentScanner, and each CubeSegmentScanner's 
> GTScanRequest are not empty!
> When I changed filter condition like this : day>='2016-06-01' and 
> day<='2016-06-02'
> Kylin would also generated three CubeSegmentScanner, but the last 
> CubeSegmentScanner's GTScanRequest is empty!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance

2017-01-17 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825703#comment-15825703
 ] 

Dayue Gao commented on KYLIN-2387:
--

ImmutableRoaringBitmap.bitmapOf is only used in test, so it's possible to 
remove the usage of it.

But my question is, why does kylin load RoaringBitmap class from spark? Is it a 
classpath issue?

> A new BitmapCounter with better performance
> ---
>
> Key: KYLIN-2387
> URL: https://issues.apache.org/jira/browse/KYLIN-2387
> Project: Kylin
>  Issue Type: Improvement
>  Components: Metadata, Query Engine, Storage - HBase
>Affects Versions: v2.0.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> We found the old BitmapCounter does not perform very well on very large 
> bitmap. The inefficiency comes from
> * Poor serialize implementation: instead of serialize bitmap directly to 
> ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes 
> superfluous memory allocations
> * Poor peekLength implementation: the whole bitmap is deserialized in order 
> to retrieve its serialized size
> * Extra deserialize cost: even if only cardinality info is needed to answer 
> query, the whole bitmap is deserialize into MutableRoaringBitmap
> A new BitmapCounter is designed to solve these problems
> * It comes in tow flavors, mutable and immutable, which is based on 
> Mutable/Immutable RoaringBitmap correspondingly
> * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a 
> copied buffer. So we always deserialize to ImmutableBitmapCounter at first, 
> and convert it to MutableBitmapCounter only when necessary
> * peekLength is implemented using 
> ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only 
> the header of roaring format is examined
> * It can directly serializes to ByteBuffer, no intermediate buffer is 
> allocated
> * The wire format is the same as before 
> ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). 
> Therefore no cube rebuild is needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance

2017-01-17 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825814#comment-15825814
 ] 

Dayue Gao commented on KYLIN-2387:
--

Commit 
https://github.com/apache/kylin/commit/e894465007f422d619ddeab2acd87e38fa093fd9 
removes the usage of ImmutableRoaringBitmap.bitmapOf.

> A new BitmapCounter with better performance
> ---
>
> Key: KYLIN-2387
> URL: https://issues.apache.org/jira/browse/KYLIN-2387
> Project: Kylin
>  Issue Type: Improvement
>  Components: Metadata, Query Engine, Storage - HBase
>Affects Versions: v2.0.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> We found the old BitmapCounter does not perform very well on very large 
> bitmap. The inefficiency comes from
> * Poor serialize implementation: instead of serialize bitmap directly to 
> ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes 
> superfluous memory allocations
> * Poor peekLength implementation: the whole bitmap is deserialized in order 
> to retrieve its serialized size
> * Extra deserialize cost: even if only cardinality info is needed to answer 
> query, the whole bitmap is deserialize into MutableRoaringBitmap
> A new BitmapCounter is designed to solve these problems
> * It comes in tow flavors, mutable and immutable, which is based on 
> Mutable/Immutable RoaringBitmap correspondingly
> * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a 
> copied buffer. So we always deserialize to ImmutableBitmapCounter at first, 
> and convert it to MutableBitmapCounter only when necessary
> * peekLength is implemented using 
> ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only 
> the header of roaring format is examined
> * It can directly serializes to ByteBuffer, no intermediate buffer is 
> allocated
> * The wire format is the same as before 
> ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). 
> Therefore no cube rebuild is needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2387) A new BitmapCounter with better performance

2017-01-17 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825749#comment-15825749
 ] 

Dayue Gao commented on KYLIN-2387:
--

OK, I'll remove the usage of ImmutableRoaringBitmap.bitmapOf. But I'm not sure 
if there are any other incompatible methods.

> A new BitmapCounter with better performance
> ---
>
> Key: KYLIN-2387
> URL: https://issues.apache.org/jira/browse/KYLIN-2387
> Project: Kylin
>  Issue Type: Improvement
>  Components: Metadata, Query Engine, Storage - HBase
>Affects Versions: v2.0.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> We found the old BitmapCounter does not perform very well on very large 
> bitmap. The inefficiency comes from
> * Poor serialize implementation: instead of serialize bitmap directly to 
> ByteBuffer, it uses ByteArrayOutputStream as a temporal storage, which causes 
> superfluous memory allocations
> * Poor peekLength implementation: the whole bitmap is deserialized in order 
> to retrieve its serialized size
> * Extra deserialize cost: even if only cardinality info is needed to answer 
> query, the whole bitmap is deserialize into MutableRoaringBitmap
> A new BitmapCounter is designed to solve these problems
> * It comes in tow flavors, mutable and immutable, which is based on 
> Mutable/Immutable RoaringBitmap correspondingly
> * ImmutableBitmapCounter has lower deserialize cost, as it just maps to a 
> copied buffer. So we always deserialize to ImmutableBitmapCounter at first, 
> and convert it to MutableBitmapCounter only when necessary
> * peekLength is implemented using 
> ImmutableRoaringBitmap.serializedSizeInBytes, which is very fast since only 
> the header of roaring format is examined
> * It can directly serializes to ByteBuffer, no intermediate buffer is 
> allocated
> * The wire format is the same as before 
> ([RoaringFormatSpec|https://github.com/RoaringBitmap/RoaringFormatSpec/]). 
> Therefore no cube rebuild is needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job

2017-02-28 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2457.
--
Resolution: Fixed
  Assignee: zhengdong

> Should copy the latest dictionaries on dimension tables in a batch merge job
> 
>
> Key: KYLIN-2457
> URL: https://issues.apache.org/jira/browse/KYLIN-2457
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v1.6.0
>Reporter: zhengdong
>Assignee: zhengdong
>Priority: Critical
> Fix For: v2.0.0
>
> Attachments: 
> 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch
>
>
> In a batch merge job, we need to create dictionaries for all dimensions for 
> the new segment. For those dictionaries on dimension table, we currently just 
> copy them from the earliest segment of the merging segments. 
> However, we should select the newest dictionary for the new segment, since 
> the incremental dimension table is allowed. The older dictionary can't find 
> the records corresponding to the new key added to a dimension table and  lead 
> wrong query result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job

2017-02-28 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887881#comment-15887881
 ] 

Dayue Gao commented on KYLIN-2457:
--

Merged to master 
https://github.com/apache/kylin/commit/a8001226b2a07cd553e680b7e14de9bf8c9981f3

[~zhengd], nice work! Thank you for your contribution!

> Should copy the latest dictionaries on dimension tables in a batch merge job
> 
>
> Key: KYLIN-2457
> URL: https://issues.apache.org/jira/browse/KYLIN-2457
> Project: Kylin
>  Issue Type: Bug
>Reporter: zhengdong
>Priority: Critical
> Attachments: 
> 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch
>
>
> In a batch merge job, we need to create dictionaries for all dimensions for 
> the new segment. For those dictionaries on dimension table, we currently just 
> copy them from the earliest segment of the merging segments. 
> However, we should select the newest dictionary for the new segment, since 
> the incremental dimension table is allowed. The older dictionary can't find 
> the records corresponding to the new key added to a dimension table and  lead 
> wrong query result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job

2017-02-28 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2457:
-
Fix Version/s: v2.0.0

> Should copy the latest dictionaries on dimension tables in a batch merge job
> 
>
> Key: KYLIN-2457
> URL: https://issues.apache.org/jira/browse/KYLIN-2457
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v1.6.0
>Reporter: zhengdong
>Priority: Critical
> Fix For: v2.0.0
>
> Attachments: 
> 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch
>
>
> In a batch merge job, we need to create dictionaries for all dimensions for 
> the new segment. For those dictionaries on dimension table, we currently just 
> copy them from the earliest segment of the merging segments. 
> However, we should select the newest dictionary for the new segment, since 
> the incremental dimension table is allowed. The older dictionary can't find 
> the records corresponding to the new key added to a dimension table and  lead 
> wrong query result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2457) Should copy the latest dictionaries on dimension tables in a batch merge job

2017-02-28 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2457:
-
Affects Version/s: v1.6.0

> Should copy the latest dictionaries on dimension tables in a batch merge job
> 
>
> Key: KYLIN-2457
> URL: https://issues.apache.org/jira/browse/KYLIN-2457
> Project: Kylin
>  Issue Type: Bug
>Affects Versions: v1.6.0
>Reporter: zhengdong
>Priority: Critical
> Fix For: v2.0.0
>
> Attachments: 
> 0001-KYLIN-2457-Should-copy-the-latest-dictionaries-on-di.patch
>
>
> In a batch merge job, we need to create dictionaries for all dimensions for 
> the new segment. For those dictionaries on dimension table, we currently just 
> copy them from the earliest segment of the merging segments. 
> However, we should select the newest dictionary for the new segment, since 
> the incremental dimension table is allowed. The older dictionary can't find 
> the records corresponding to the new key added to a dimension table and  lead 
> wrong query result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache

2016-09-11 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2007:
-
Attachment: KYLIN-2007.patch

patch uploaded

> CUBOID_CACHE is not cleared when rebuilding ALL cache
> -
>
> Key: KYLIN-2007
> URL: https://issues.apache.org/jira/browse/KYLIN-2007
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Attachments: KYLIN-2007.patch
>
>
> CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to 
> invalidate possibly stale cache. However we forgot to clear 
> Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect 
> query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache

2016-09-11 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2007:


 Summary: CUBOID_CACHE is not cleared when rebuilding ALL cache
 Key: KYLIN-2007
 URL: https://issues.apache.org/jira/browse/KYLIN-2007
 Project: Kylin
  Issue Type: Bug
  Components: Query Engine
Affects Versions: v1.5.3
Reporter: Dayue Gao
Assignee: Dayue Gao
Priority: Minor


CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to 
invalidate possibly stale cache. However we forgot to clear Cuboid.CUBOID_CACHE 
in CacheService#rebuildCache, which can lead to incorrect query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2013) more robust approach to hive schema changes

2016-09-13 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2013:


 Summary: more robust approach to hive schema changes
 Key: KYLIN-2013
 URL: https://issues.apache.org/jira/browse/KYLIN-2013
 Project: Kylin
  Issue Type: Bug
  Components: Metadata, REST Service, Web 
Affects Versions: v1.5.3
Reporter: Dayue Gao
Assignee: Dayue Gao


Our users occasionally want to change their existing cube, such as 
adding/renaming/removing a dimension. Some of these changes require 
modifications to its source hive table. So our user changed the table schema 
and reloaded its metadata in Kylin, then several issues can happen depends on 
what he changed.

I did some schema changing tests based on 1.5.3, the results after reloading 
table are listed below

|| type of changes || fact table || lookup table ||
| *minor* | both query and build still works | query can fail or return wrong 
answer |
| *major* | fail to load related cube | fail to load related cube |

{{minor}} changes refer to those doesn't change columns used in cubes, such as 
insert/append new column, remove/change unused column.

{{major}} changes are the opposite, like remove/rename/change type of used 
column.

Clearly from the table, reload a changed table is problematic in certain cases. 
KYLIN-1536 reports a similar problem.

So what can we do to support this kind of iterative development process (load 
-> define cube -> build -> reload -> change cube -> rebuild)?

My first thought is simply detect-and-prohibit reloading used table. User 
should be able to know which cube is preventing him from reloading, and then he 
could drop and recreate cube after reloading. However, defining a cube is not 
an easy task (consider editing 100 measures). Force users to recreate their 
cube over and over again will certainly not make them happy.

A better idea is to allow cube to be editable even if it's broken due to some 
columns changed after reloading. Broken cube can't be built or queried, it can 
only be edit or dropped. In fact, there is a cube status called 
{{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
take advantage of it.

An enabled cube shouldn't allow schema changes, otherwise an unintentional 
reload could make it unavailable. Similarly, a disabled but unpurged cube 
shouldn't allow schema changes since it still has data in it.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2012) more robust approach to hive schema changes

2016-09-13 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2012:


 Summary: more robust approach to hive schema changes
 Key: KYLIN-2012
 URL: https://issues.apache.org/jira/browse/KYLIN-2012
 Project: Kylin
  Issue Type: Bug
  Components: Metadata, REST Service, Web 
Affects Versions: v1.5.3
Reporter: Dayue Gao
Assignee: Dayue Gao


Our users occasionally want to change their existing cube, such as 
adding/renaming/removing a dimension. Some of these changes require 
modifications to its source hive table. So our user changed the table schema 
and reloaded its metadata in Kylin, then several issues can happen depends on 
what he changed.

I did some schema changing tests based on 1.5.3, the results after reloading 
table are listed below

|| type of changes || fact table || lookup table ||
| *minor* | both query and build still works | query can fail or return wrong 
answer |
| *major* | fail to load related cube | fail to load related cube |

{{minor}} changes refer to those doesn't change columns used in cubes, such as 
insert/append new column, remove/change unused column.

{{major}} changes are the opposite, like remove/rename/change type of used 
column.

Clearly from the table, reload a changed table is problematic in certain cases. 
KYLIN-1536 reports a similar problem.

So what can we do to support this kind of iterative development process (load 
-> define cube -> build -> reload -> change cube -> rebuild)?

My first thought is simply detect-and-prohibit reloading used table. User 
should be able to know which cube is preventing him from reloading, and then he 
could drop and recreate cube after reloading. However, defining a cube is not 
an easy task (consider editing 100 measures). Force users to recreate their 
cube over and over again will certainly not make them happy.

A better idea is to allow cube to be editable even if it's broken due to some 
columns changed after reloading. Broken cube can't be built or queried, it can 
only be edit or dropped. In fact, there is a cube status called 
{{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
take advantage of it.

An enabled cube shouldn't allow schema changes, otherwise an unintentional 
reload could make it unavailable. Similarly, a disabled but unpurged cube 
shouldn't allow schema changes since it still has data in it.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2013) more robust approach to hive schema changes

2016-09-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489140#comment-15489140
 ] 

Dayue Gao commented on KYLIN-2013:
--

Ah... I have no idea why it got created twice.

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2013
> URL: https://issues.apache.org/jira/browse/KYLIN-2013
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2013) more robust approach to hive schema changes

2016-09-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489097#comment-15489097
 ] 

Dayue Gao commented on KYLIN-2013:
--

Hi [~yimingliu], could you point to me to the jira it's duplicated with? Has 
this issue already been fixed? I'm just going to submit a patch for it.

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2013
> URL: https://issues.apache.org/jira/browse/KYLIN-2013
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache

2016-09-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489189#comment-15489189
 ] 

Dayue Gao commented on KYLIN-2007:
--

committed to master

> CUBOID_CACHE is not cleared when rebuilding ALL cache
> -
>
> Key: KYLIN-2007
> URL: https://issues.apache.org/jira/browse/KYLIN-2007
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Attachments: KYLIN-2007.patch
>
>
> CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to 
> invalidate possibly stale cache. However we forgot to clear 
> Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect 
> query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

2016-09-14 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489712#comment-15489712
 ] 

Dayue Gao commented on KYLIN-2012:
--

commit 17569f6 to master.

SchemaChecker is the main workhorse, it prevents danger reloads according to 
the following rules:
* if table has been used as fact table, all columns used in cube can't be 
changed. It means
** remove/rename used column is not allowed
** type change of used column is generally not allowed, except 
{{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}
** add/remove/change unused column is ok
* if table has been used as lookup table, the old and new schema should be the 
same, except these type change {{float<=>double}} and 
{{tinyint<=>smallint<=>integer<=>bigint}}, It means
** add/remove/rename/reorder column is not allowed

(PS: I'm aware that KYLIN-1985 could allow some degree of schema changes on 
lookup table, so the above rule for lookup table may be too strict)

When a non-empty cube violates these rules, no reloading will be performed. An 
error message containing details about the violation is shown.

When only empty cube violates these rules, reloading will success. All 
violating cubes are changed to {{DESCBROKEN}} status (see 
CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label 
at front-end so that user can easily find out all broken cubes.

User can edit or drop broken cube, but can't disable/enable/build/copy it. 
After user fixes all the problems in his cube (and model), the cube will back 
to DISABLE status. Trying to save a broken cube won't success like always. 
Therefore, DESCBROKEN status can only appear after reloading a changed table.

[~Shaofengshi] [~yimingliu] Do you have time to review the code?

[~zhongjian] I'm not an expert on front-end, could you also review the 
front-end changes?

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KYLIN-2012) more robust approach to hive schema changes

2016-09-14 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489712#comment-15489712
 ] 

Dayue Gao edited comment on KYLIN-2012 at 9/14/16 7:52 AM:
---

commit 17569f6 to master.

SchemaChecker is the main workhorse, it prevents danger reloads according to 
the following rules:
* if table has been used as fact table, all columns used in cube can't be 
changed. It means
** remove/rename used column is not allowed
** type change of used column is generally not allowed, except 
{{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}
** add/remove/change unused column is ok
* if table has been used as lookup table, then old and new schema should be the 
same, except type changes above. It means
** add/remove/rename/reorder column is not allowed

(PS: I'm aware that after KYLIN-1985, we can allow some schema changes on 
lookup table, so the above rule for lookup table may be too strict)

{color:red}When a non-empty cube violates these rules, no reloading will be 
performed{color}. An error message containing details about the violation is 
shown.

{color:blue}When only empty cube violates these rules, reloading will 
success{color}. All violating cubes are then changed to {{DESCBROKEN}} status 
(see CubeManager#reloadCubeLocalAt). The status is shown in an orange warning 
label at front-end so that user can easily find out all broken cubes.

*User can edit or drop broken cube, but can't disable/enable/build/copy it.* 
After user fixes all the problems in his cube (and model), the cube will back 
to DISABLE status. Trying to save a broken cube won't success like always. 
Therefore, DESCBROKEN status can only appear after reloading a changed table.

[~Shaofengshi] [~yimingliu] Do you have time to review the code?

[~zhongjian] I'm not an expert on front-end, could you also review the 
front-end changes?


was (Author: gaodayue):
commit 17569f6 to master.

SchemaChecker is the main workhorse, it prevents danger reloads according to 
the following rules:
* if table has been used as fact table, all columns used in cube can't be 
changed. It means
** remove/rename used column is not allowed
** type change of used column is generally not allowed, except 
{{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}
** add/remove/change unused column is ok
* if table has been used as lookup table, the old and new schema should be the 
same, except these type change {{float<=>double}} and 
{{tinyint<=>smallint<=>integer<=>bigint}}, It means
** add/remove/rename/reorder column is not allowed

(PS: I'm aware that KYLIN-1985 could allow some degree of schema changes on 
lookup table, so the above rule for lookup table may be too strict)

When a non-empty cube violates these rules, no reloading will be performed. An 
error message containing details about the violation is shown.

When only empty cube violates these rules, reloading will success. All 
violating cubes are changed to {{DESCBROKEN}} status (see 
CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label 
at front-end so that user can easily find out all broken cubes.

User can edit or drop broken cube, but can't disable/enable/build/copy it. 
After user fixes all the problems in his cube (and model), the cube will back 
to DISABLE status. Trying to save a broken cube won't success like always. 
Therefore, DESCBROKEN status can only appear after reloading a changed table.

[~Shaofengshi] [~yimingliu] Do you have time to review the code?

[~zhongjian] I'm not an expert on front-end, could you also review the 
front-end changes?

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, 

[jira] [Closed] (KYLIN-2007) CUBOID_CACHE is not cleared when rebuilding ALL cache

2016-09-17 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao closed KYLIN-2007.

   Resolution: Fixed
Fix Version/s: v1.6.0

> CUBOID_CACHE is not cleared when rebuilding ALL cache
> -
>
> Key: KYLIN-2007
> URL: https://issues.apache.org/jira/browse/KYLIN-2007
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Minor
> Fix For: v1.6.0
>
> Attachments: KYLIN-2007.patch
>
>
> CubeMigrationCLI tool sends requests to wipe ALL cache of kylin instances to 
> invalidate possibly stale cache. However we forgot to clear 
> Cuboid.CUBOID_CACHE in CacheService#rebuildCache, which can lead to incorrect 
> query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2058) Make Kylin more resilient to bad queries

2016-09-28 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2058:


 Summary: Make Kylin more resilient to bad queries
 Key: KYLIN-2058
 URL: https://issues.apache.org/jira/browse/KYLIN-2058
 Project: Kylin
  Issue Type: Improvement
  Components: Query Engine, Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao


Bad/Big queries are a huge threat to the overall performance and stability of 
Kylin.  We occasionally saw some of these queries either causing heavy GC 
activities or crashing regionservers. I'd like to start a series of work to 
make Kylin more resilient to bad queries.

This is an umbrella jira to relating works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened

2016-11-09 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2173:


 Summary: push down limit leads to wrong answer when filter is 
loosened
 Key: KYLIN-2173
 URL: https://issues.apache.org/jira/browse/KYLIN-2173
 Project: Kylin
  Issue Type: Bug
  Components: Storage - HBase
Affects Versions: v1.5.4.1
Reporter: Dayue Gao
Assignee: Dayue Gao


To reproduce:
{noformat}
select
 test_kylin_fact.cal_dt
 ,sum(test_kylin_fact.price) as GMV
 FROM test_kylin_fact 
 left JOIN edw.test_cal_dt as test_cal_dt 
 ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
 where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20')
 group by test_kylin_fact.cal_dt 
 limit 12
{noformat}

Kylin returns 5 rows, expect 12 rows.

Root cause: filter condition may be loosened when we translate derived filter 
in DerivedFilterTranslator. If we push down limit, query server won't get 
enough valid records from storage. In the above example, 24 rows returned from 
storage, only 5 are valid.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-1609) Push down undefined Count Distinct aggregation to storage

2016-11-09 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650757#comment-15650757
 ] 

Dayue Gao commented on KYLIN-1609:
--

Hi [~lidong_sjtu], is there further plan on this? What's the motivation for 
pushdown count(distinct dim)?

> Push down undefined Count Distinct aggregation to storage
> -
>
> Key: KYLIN-1609
> URL: https://issues.apache.org/jira/browse/KYLIN-1609
> Project: Kylin
>  Issue Type: New Feature
>  Components: Query Engine
>Affects Versions: v1.5.1
>Reporter: Dong Li
>Assignee: Dong Li
>Priority: Minor
>
> KYLIN-1016 already enabled count distinct aggregation on dimension which are 
> not defined as COUNT_DISTINCT measures. But it's only in query engine level. 
> This JIRA will push it deeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2159) Redistribution Hive Table Step always requires row_count filename as 000000_0

2016-11-05 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15639870#comment-15639870
 ] 

Dayue Gao commented on KYLIN-2159:
--

We also run into this problem once, should find target file for pattern 
"00_*".

> Redistribution Hive Table Step always requires row_count filename as 00_0 
> --
>
> Key: KYLIN-2159
> URL: https://issues.apache.org/jira/browse/KYLIN-2159
> Project: Kylin
>  Issue Type: Bug
>Reporter: Dong Li
>
> In some case, the filename is not 00_0.
> For example, the output of second attempt of mr job might become 00_01000.
> java.io.FileNotFoundException: File does not exist: 
> /kylin/kylin_metadata/kylin-xxx/row_count/00_0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2115) some extended column query returns wrong answer

2016-10-20 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2115:


 Summary: some extended column query returns wrong answer
 Key: KYLIN-2115
 URL: https://issues.apache.org/jira/browse/KYLIN-2115
 Project: Kylin
  Issue Type: Bug
  Components: General
Affects Versions: v1.5.4, v1.5.4.1
Reporter: Dayue Gao
Assignee: Dayue Gao
Priority: Critical


KYLIN-1979 introduces a bug, which can cause extended column query returns 
wrong result if user defines more than one extended column metrics.

{noformat}
Example: let's define two extended columns
1. metricA(host=h1, extend=e1)
2. metricB(host=h2, extend=e2).

"select h1, e1 ... group by h1,e1" correct.
"select h1, e1, h2, e2 ... group by h1,e1, h2, e2" correct.
"select h2, e2 ... group by h2, e2" wrong. (column e2 is empty)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KYLIN-2115) some extended column query returns wrong answer

2016-10-20 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2115.
--
   Resolution: Fixed
Fix Version/s: v1.6.0

fixed in 
https://github.com/apache/kylin/commit/bec4a888cbacf1db1dca81a50919c935b7cb1d96

> some extended column query returns wrong answer
> ---
>
> Key: KYLIN-2115
> URL: https://issues.apache.org/jira/browse/KYLIN-2115
> Project: Kylin
>  Issue Type: Bug
>  Components: General
>Affects Versions: v1.5.4, v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>Priority: Critical
> Fix For: v1.6.0
>
>
> KYLIN-1979 introduces a bug, which can cause extended column query returns 
> wrong result if user defines more than one extended column metrics.
> {noformat}
> Example: let's define two extended columns
> 1. metricA(host=h1, extend=e1)
> 2. metricB(host=h2, extend=e2).
> "select h1, e1 ... group by h1,e1" correct.
> "select h1, e1, h2, e2 ... group by h1,e1, h2, e2" correct.
> "select h2, e2 ... group by h2, e2" wrong. (column e2 is empty)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2105) add QueryId

2016-10-20 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591776#comment-15591776
 ] 

Dayue Gao commented on KYLIN-2105:
--

[~liyang.g...@gmail.com], thank you for your suggestions.

The motivation for adding project to query ID (also thread name) is that user 
can quickly tell which project is putting the most load on HBase (by tailing 
regionserver log). Timestamp is redundant, I include it mainly for improving 
uniqueness of query ID.

I just did a quick search on query ID format used by other projects:
* Presto includes timestamp in query ID
* Druid and Drill use UUID. Druid also includes datasource name (similar to 
Kylin cube name) in query thread name.

For code simplicity and log cleanliness, UUID seems a good choice. What do you 
think?



> add QueryId
> ---
>
> Key: KYLIN-2105
> URL: https://issues.apache.org/jira/browse/KYLIN-2105
> Project: Kylin
>  Issue Type: Sub-task
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>  Labels: patch
> Fix For: v1.6.0
>
>
> * for each query, generate an unique id.
> * the id could describe some context information about the query, like start 
> time, project name, etc.
> * for query thread, we could use query id as the name of the thread. As long 
> as user logs thread's name, he can grep query log by query id afterwards.
> * pass query id to coprocessor, so that query id gets logged both in query 
> server and region server.
> * BadQueryDetector should also log query id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2105) add QueryId

2016-10-18 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2105:


 Summary: add QueryId
 Key: KYLIN-2105
 URL: https://issues.apache.org/jira/browse/KYLIN-2105
 Project: Kylin
  Issue Type: Sub-task
  Components: Query Engine, Storage - HBase
Affects Versions: v1.6.0
Reporter: Dayue Gao
Assignee: Dayue Gao
 Fix For: v1.6.0


* for each query, generate an unique id.
* the id could describe some context information about the query, like start 
time, project name, etc.
* for query thread, we could use query id as the name of the thread. As long as 
user logs thread's name, he can grep query log by query id afterwards.
* pass query id to coprocessor, so that query id gets logged both in query 
server and region server.
* BadQueryDetector should also log query id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2105) add QueryId

2016-10-18 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587203#comment-15587203
 ] 

Dayue Gao commented on KYLIN-2105:
--

commit 
https://github.com/apache/kylin/commit/db09f5f9cae5a3d3ff731221cbb1c026da4f4e41 
to master.

QueryID format "MMdd_HHmmss_project_xx"
* "MMdd_HHmmss": query submit time
* "project": which project the query is submitted to
* "xx": six random generated base26 (a-z) characters

Need a threadlocal context to pass QueryID to storage layer, use 
BackdoorToggles for now. Maybe we should rename BackdoorToggles to QueryContext?

Add a field to coprocessor protobuf interface, which is backward compatible 
with 1.5.4.


> add QueryId
> ---
>
> Key: KYLIN-2105
> URL: https://issues.apache.org/jira/browse/KYLIN-2105
> Project: Kylin
>  Issue Type: Sub-task
>  Components: Query Engine, Storage - HBase
>Affects Versions: v1.6.0
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>  Labels: patch
> Fix For: v1.6.0
>
>
> * for each query, generate an unique id.
> * the id could describe some context information about the query, like start 
> time, project name, etc.
> * for query thread, we could use query id as the name of the thread. As long 
> as user logs thread's name, he can grep query log by query id afterwards.
> * pass query id to coprocessor, so that query id gets logged both in query 
> server and region server.
> * BadQueryDetector should also log query id



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened

2016-11-14 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663899#comment-15663899
 ] 

Dayue Gao commented on KYLIN-2173:
--

commit to master and v1.6.0-rc2. UT seems to be broken by previous commit, will 
add test case later

> push down limit leads to wrong answer when filter is loosened
> -
>
> Key: KYLIN-2173
> URL: https://issues.apache.org/jira/browse/KYLIN-2173
> Project: Kylin
>  Issue Type: Bug
>  Components: Storage - HBase
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> To reproduce:
> {noformat}
> select
>  test_kylin_fact.cal_dt
>  ,sum(test_kylin_fact.price) as GMV
>  FROM test_kylin_fact 
>  left JOIN edw.test_cal_dt as test_cal_dt 
>  ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
>  where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20')
>  group by test_kylin_fact.cal_dt 
>  limit 12
> {noformat}
> Kylin returns 5 rows, expect 12 rows.
> Root cause: filter condition may be loosened when we translate derived filter 
> in DerivedFilterTranslator. If we push down limit, query server won't get 
> enough valid records from storage. In the above example, 24 rows returned 
> from storage, only 5 are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2221) rethink on KYLIN-1684

2016-11-21 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685855#comment-15685855
 ] 

Dayue Gao commented on KYLIN-2221:
--

+1.

If we can improve the way how empty segment is determined, even 
"kylin.query.skip-empty-segments" is superfluous.

> rethink on KYLIN-1684
> -
>
> Key: KYLIN-2221
> URL: https://issues.apache.org/jira/browse/KYLIN-2221
> Project: Kylin
>  Issue Type: Improvement
>Reporter: hongbin ma
>Assignee: hongbin ma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-17 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2200:
-
Attachment: KYLIN-2200.patch

patch uploaded. 

There is no ITs for UNION/UNION_ALL right now. Let me add test cases once my 
sandbox env is fixed.

> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-2200.patch
>
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673043#comment-15673043
 ] 

Dayue Gao commented on KYLIN-2200:
--

Union also fails to remove duplicates.
{code:sql}
select count(*), sum(price) from kylin_sales
union 
select count(*), sum(price) from kylin_sales
{code}

When input rowformat is Array, EnumerableUnion should use 
ExtendedEnumerable.union(source, Functions.arrayComparer()) instead of 
ExtendedEnumerable.union(source).



> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672620#comment-15672620
 ] 

Dayue Gao commented on KYLIN-2200:
--

Still got problem on the following query
{code:sql}
select count(*) from kylin_sales where lstg_format_name='FP-GTC'
union all 
select count(*) from kylin_sales where lstg_format_name='FP-GTC'
{code}

Exception:
{noformat}
Caused by: org.codehaus.commons.compiler.CompileException: Line 138, Column 23: 
No applicable constructor/method found for actual parameters "int, long"; 
candidates are: "Baz$Record2_1()"
{noformat}

Generated Code:
{code:java}
/* 137 */   public Object current() {
/* 138 */ return new Record2_1(
/* 139 */ 0,
/* 140 */ 
org.apache.calcite.runtime.SqlFunctions.toLong(((Object[]) 
inputEnumerator.current())[8]));
/* 141 */   }
{code}

Interestingly, EnumerableCalc use constructor to initialize SyntheticRecordType 
but EnumerableRelImplementor doesn't generate a with-arg constructor for it.

[~julianhyde] Is it a calcite bug or Kylin doesn't use it in the right way?


> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669839#comment-15669839
 ] 

Dayue Gao commented on KYLIN-2200:
--

I believe it's not a bug in calcite. Because OLAPTableScan returns 
Enumerable, it's an error to cast Object[] to custom class Record11_1.

> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669833#comment-15669833
 ] 

Dayue Gao commented on KYLIN-2200:
--

Here is calcite generated code
{code:java}
/*   1 */ public static class Record1_0 implements java.io.Serializable {
/*   2 */   public long f0;
/*   3 */   public Record1_0() {}
/*   4 */   public boolean equals(Object o) {
/*   5 */ if (this == o) {
/*   6 */   return true;
/*   7 */ }
/*   8 */ if (!(o instanceof Record1_0)) {
/*   9 */   return false;
/*  10 */ }
/*  11 */ return this.f0 == ((Record1_0) o).f0;
/*  12 */   }
/*  13 */ 
/*  14 */   public int hashCode() {
/*  15 */ int h = 0;
/*  16 */ h = org.apache.calcite.runtime.Utilities.hash(h, this.f0);
/*  17 */ return h;
/*  18 */   }
/*  19 */ 
/*  20 */   public int compareTo(Record1_0 that) {
/*  21 */ final int c;
/*  22 */ c = org.apache.calcite.runtime.Utilities.compare(this.f0, 
that.f0);
/*  23 */ if (c != 0) {
/*  24 */   return c;
/*  25 */ }
/*  26 */ return 0;
/*  27 */   }
/*  28 */ 
/*  29 */   public String toString() {
/*  30 */ return "{f0=" + this.f0 + "}";
/*  31 */   }
/*  32 */ 
/*  33 */ }
/*  34 */ 
/*  35 */ org.apache.calcite.DataContext root;
/*  36 */ 
/*  37 */ public org.apache.calcite.linq4j.Enumerable bind(final 
org.apache.calcite.DataContext root0) {
/*  38 */   root = root0;
/*  39 */   final org.apache.calcite.linq4j.Enumerable _inputEnumerable = 
((org.apache.kylin.query.schema.OLAPTable) 
root.getRootSchema().getSubSchema("DEFAULT").getTable("KYLIN_SALES")).executeOLAPQuery(root,
 0);
/*  40 */   final org.apache.calcite.linq4j.AbstractEnumerable child = new 
org.apache.calcite.linq4j.AbstractEnumerable(){
/*  41 */ public org.apache.calcite.linq4j.Enumerator enumerator() {
/*  42 */   return new org.apache.calcite.linq4j.Enumerator(){
/*  43 */   public final org.apache.calcite.linq4j.Enumerator 
inputEnumerator = _inputEnumerable.enumerator();
/*  44 */   public void reset() {
/*  45 */ inputEnumerator.reset();
/*  46 */   }
/*  47 */ 
/*  48 */   public boolean moveNext() {
/*  49 */ return inputEnumerator.moveNext();
/*  50 */   }
/*  51 */ 
/*  52 */   public void close() {
/*  53 */ inputEnumerator.close();
/*  54 */   }
/*  55 */ 
/*  56 */   public Object current() {
/*  57 */ return 
org.apache.calcite.runtime.SqlFunctions.toLong(((Object[]) 
inputEnumerator.current())[8]);
/*  58 */   }
/*  59 */ 
/*  60 */ };
/*  61 */ }
/*  62 */ 
/*  63 */   };
/*  64 */   final org.apache.calcite.linq4j.Enumerable _inputEnumerable0 = 
((org.apache.kylin.query.schema.OLAPTable) 
root.getRootSchema().getSubSchema("DEFAULT").getTable("KYLIN_SALES")).executeOLAPQuery(root,
 1);
/*  65 */   final org.apache.calcite.linq4j.AbstractEnumerable child1 = new 
org.apache.calcite.linq4j.AbstractEnumerable(){
/*  66 */ public org.apache.calcite.linq4j.Enumerator enumerator() {
/*  67 */   return new org.apache.calcite.linq4j.Enumerator(){
/*  68 */   public final org.apache.calcite.linq4j.Enumerator 
inputEnumerator = _inputEnumerable0.enumerator();
/*  69 */   public void reset() {
/*  70 */ inputEnumerator.reset();
/*  71 */   }
/*  72 */ 
/*  73 */   public boolean moveNext() {
/*  74 */ return inputEnumerator.moveNext();
/*  75 */   }
/*  76 */ 
/*  77 */   public void close() {
/*  78 */ inputEnumerator.close();
/*  79 */   }
/*  80 */ 
/*  81 */   public Object current() {
/*  82 */ return ((Record11_1) inputEnumerator.current()).COUNT__;
/*  83 */   }
/*  84 */ 
/*  85 */ };
/*  86 */ }
/*  87 */ 
/*  88 */   };
/*  89 */   return 
org.apache.calcite.linq4j.Linq4j.singletonEnumerable(child.aggregate(new 
org.apache.calcite.linq4j.function.Function0() {
/*  90 */   public Object apply() {
/*  91 */ long $SUM0a0s0;
/*  92 */ $SUM0a0s0 = 0;
/*  93 */ Record1_0 record0;
/*  94 */ record0 = new Record1_0();
/*  95 */ record0.f0 = $SUM0a0s0;
/*  96 */ return record0;
/*  97 */   }
/*  98 */ }
/*  99 */ .apply(), new org.apache.calcite.linq4j.function.Function2() {
/* 100 */   public Record1_0 apply(Record1_0 acc, long in) {
/* 101 */ acc.f0 = acc.f0 + in;
/* 102 */ return acc;
/* 103 */   }
/* 104 */   public Record1_0 apply(Record1_0 acc, Long in) {
/* 105 */ return apply(
/* 106 */   acc,
/* 107 */   in.longValue());
/* 108 */   }
/* 109 */   public Record1_0 apply(Object acc, Object in) {
/* 110 */ return apply(
/* 111 */   (Record1_0) acc,
/* 112 */   (Long) in);
/* 113 */   }
/* 114 */ }
/* 115 */   

[jira] [Created] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2200:


 Summary: CompileException on UNION ALL query when result only 
contains one column
 Key: KYLIN-2200
 URL: https://issues.apache.org/jira/browse/KYLIN-2200
 Project: Kylin
  Issue Type: Bug
  Components: Query Engine
Affects Versions: v1.5.4.1
Reporter: Dayue Gao
Assignee: Dayue Gao


{code:sql}
select count(*) from kylin_sales
union all
select count(*) from kylin_sales
{code}

got following exception
{noformat}
Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 32: 
Cannot determine simple type name "Record11_1"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
at org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
at 
org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
at org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
at org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened

2016-11-19 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678992#comment-15678992
 ] 

Dayue Gao commented on KYLIN-2173:
--

{quote}
how about the test result?
{quote}
Problem should be fixed. To avoid regressions, let me add a few test cases to 
ITs.

{quote}
Besides I viewed the commit, you removed the check for empty segment, was that 
related with this JIRA? The skip for empty segment was for another JIRA, are 
you sure it is safe to remove that?
{quote}
Are you talking about KYLIN-1967? It's been fixed before my commit. I just 
removed several duplicated log (since skipZeroInputSegment always returns 
false). Please correct me if I was wrong.

{quote}
BTW, when I run "mvn clean package -DskipTests", checkstyle reported there is 
unused import (see below). Please check and ensure you have checkstyle plugin 
installed in IDEA:
{quote}
What a stupid mistake! Sorry for that.

> push down limit leads to wrong answer when filter is loosened
> -
>
> Key: KYLIN-2173
> URL: https://issues.apache.org/jira/browse/KYLIN-2173
> Project: Kylin
>  Issue Type: Bug
>  Components: Storage - HBase
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> To reproduce:
> {noformat}
> select
>  test_kylin_fact.cal_dt
>  ,sum(test_kylin_fact.price) as GMV
>  FROM test_kylin_fact 
>  left JOIN edw.test_cal_dt as test_cal_dt 
>  ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
>  where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20')
>  group by test_kylin_fact.cal_dt 
>  limit 12
> {noformat}
> Kylin returns 5 rows, expect 12 rows.
> Root cause: filter condition may be loosened when we translate derived filter 
> in DerivedFilterTranslator. If we push down limit, query server won't get 
> enough valid records from storage. In the above example, 24 rows returned 
> from storage, only 5 are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-19 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678953#comment-15678953
 ] 

Dayue Gao commented on KYLIN-2200:
--

Thanks Julian. I'll figure out whether it's a calcite bug or not. And if it is, 
submit a patch for it.

> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Attachments: KYLIN-2200.patch
>
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2173) push down limit leads to wrong answer when filter is loosened

2016-11-19 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678999#comment-15678999
 ] 

Dayue Gao commented on KYLIN-2173:
--

BTW, could we build a shared CI infrastructure? Running IT in my local sandbox 
is super slow.

> push down limit leads to wrong answer when filter is loosened
> -
>
> Key: KYLIN-2173
> URL: https://issues.apache.org/jira/browse/KYLIN-2173
> Project: Kylin
>  Issue Type: Bug
>  Components: Storage - HBase
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> To reproduce:
> {noformat}
> select
>  test_kylin_fact.cal_dt
>  ,sum(test_kylin_fact.price) as GMV
>  FROM test_kylin_fact 
>  left JOIN edw.test_cal_dt as test_cal_dt 
>  ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
>  where test_cal_dt.week_beg_dt in ('2012-01-01', '2012-01-20')
>  group by test_kylin_fact.cal_dt 
>  limit 12
> {noformat}
> Kylin returns 5 rows, expect 12 rows.
> Root cause: filter condition may be loosened when we translate derived filter 
> in DerivedFilterTranslator. If we push down limit, query server won't get 
> enough valid records from storage. In the above example, 24 rows returned 
> from storage, only 5 are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2200) CompileException on UNION ALL query when result only contains one column

2016-11-16 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669874#comment-15669874
 ] 

Dayue Gao commented on KYLIN-2200:
--

The root cause is for the second OLAPTableScan, it returns CUSTOM as its 
JavaRowFormat by mistake.

Here's a dump of rowformats for each operator.
{noformat}
OLAPToEnumerableConverter
  EnumerableLimit(format=SCALA)
EnumerableUnion(format=SCALA)
  EnumerableAggregate(format=SCALA)
EnumerableCalc(format=SCALA)
  OLAPTableScan(format=ARRAY)
  EnumerableAggregate(format=SCALA)
EnumerableCalc(format=SCALA)
  OLAPTableScan(format=CUSTOM)
{noformat}

Due to EnumerableAggregate returns SCALA, EnumerableUnion changes Prefer from 
ARRAY to CUSTOM for the second half. And OLAPTableScan honors prefer in the 
following code

{code:java}
public Result implement(EnumerableRelImplementor implementor, Prefer pref) {
// 
PhysType physType = PhysTypeImpl.of(implementor.getTypeFactory(), 
this.rowType, pref.preferArray());
// 
}
{code}

To fix it, we should use ARRAY for OLAPTableScan regardless of pref.

> CompileException on UNION ALL query when result only contains one column
> 
>
> Key: KYLIN-2200
> URL: https://issues.apache.org/jira/browse/KYLIN-2200
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
>
> {code:sql}
> select count(*) from kylin_sales
> union all
> select count(*) from kylin_sales
> {code}
> got following exception
> {noformat}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 82, Column 
> 32: Cannot determine simple type name "Record11_1"
> at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5375)
> at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5184)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5165)
> at 
> org.codehaus.janino.UnitCompiler.access$12600(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitReferenceType(UnitCompiler.java:5096)
> at org.codehaus.janino.Java$ReferenceType.accept(Java.java:2880)
> at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5136)
> at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5598)
> at 
> org.codehaus.janino.UnitCompiler.access$13300(UnitCompiler.java:183)
> at 
> org.codehaus.janino.UnitCompiler$16.visitCast(UnitCompiler.java:5104)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2085) PrepareStatement return incorrect result in some cases

2016-11-01 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625281#comment-15625281
 ] 

Dayue Gao commented on KYLIN-2085:
--

Encountered the same problem after upgrading to 1.5.4.1. Found this jira when 
I'm about to submitting a patch for it, so I just committed UTs for this, see 
cd9423fb5b6e88f7451d0684f7d9598b7e06c381.

> PrepareStatement return incorrect result in some cases
> --
>
> Key: KYLIN-2085
> URL: https://issues.apache.org/jira/browse/KYLIN-2085
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Reporter: Dong Li
>Assignee: Dong Li
> Fix For: v1.6.0
>
>
> With Kylin sample data,execute following SQL can get result: 
> select count(*) from kylin_sales where lstg_format_name>='ABIN' and 
> lstg_format_name<=''
> Result: 4054
> Send post with prestate:
> POST http://localhost:7070/kylin/api/query/prestate
> {"sql":"select count(*) from kylin_sales where lstg_format_name>=? and 
> lstg_format_name<=?","offset":0,"limit":5,"acceptPartial":true,"project":"learn_kylin","params":[{"className":"java.lang.String",
>  "value":"ABIN"},{"className":"java.lang.String", "value":""}]}
> Result: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2079) add explicit configuration knob for coprocessor timeout

2016-10-30 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15619993#comment-15619993
 ] 

Dayue Gao commented on KYLIN-2079:
--

[~mahongbin] , the way Kylin avoids retrying coprocessor call in these cases is 
by response successfully with flag normalComplete set to false, not by throwing 
DoNotRetryException. We just have to response before hbase.rpc.timeout, this is 
why I make upper bound of kylin.query.coprocessor.timeout.seconds to 
hbase.rpc.timeout x 0.9. I have tried using DoNotRetryException before I 
realized this fact.

> add explicit configuration knob for coprocessor timeout
> ---
>
> Key: KYLIN-2079
> URL: https://issues.apache.org/jira/browse/KYLIN-2079
> Project: Kylin
>  Issue Type: Sub-task
>  Components: Storage - HBase
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
> Attachments: KYLIN-2079.patch
>
>
> Current self-termination timeout for CubeVisitService is calculated as the 
> product of three parameters:
> * hbase.rpc.timeout
> * hbase.client.retries.number (hardcode to 5)
> * kylin.query.cube.visit.timeout.times
> It has a few problems:
> # due to this timeout being longer than hbase.rpc.timeout, user sees "Error 
> in coprocessor" instead of more descriptive GTScanSelfTerminatedException. 
> moreover, the request (probably a bad query) will be retried 5 times, 
> increasing pressure on regionserver
> # it's not intuitive to set coprocessor timeout by adjusting 
> kylin.query.cube.visit.timeout.times
> I propose the following changes:
> # add a new kylin configuration "kylin.query.coprocessor.timeout.seconds" to 
> explicitly set coprocessor timeout. It defaults to 0, which means no value, 
> use hbase.rpc.timeout x 0.9 instead. When user sets it to a positive number, 
> kylin will use min(hbase.rpc.timeout x 0.9, 
> kylin.query.coprocessor.timeout.seconds) as coprocessor timeout
> # remove "kylin.query.cube.visit.timeout.times". For cube visit timeout 
> (ExpectedSizeIterator), it's really a last resort, in case coprocessor didn't 
> terminate itself. I don't see too much needs for user to control it, set it 
> to coprocessor timeout x 10 should be a large enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KYLIN-2079) add explicit configuration knob for coprocessor timeout

2016-10-30 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2079.
--
Resolution: Fixed

Committed to master and v1.6.0-rc1.

The default coprocessor timeout is set to (hbase.rpc.timeout * 0.9) / 1000 
seconds. You can use "kylin.query.coprocessor.timeout.seconds" to set a lower 
value, 0 means default behavior. The older configuration 
"kylin.query.cube.visit.timeout.times" is removed in favor of the new one.

> add explicit configuration knob for coprocessor timeout
> ---
>
> Key: KYLIN-2079
> URL: https://issues.apache.org/jira/browse/KYLIN-2079
> Project: Kylin
>  Issue Type: Sub-task
>  Components: Storage - HBase
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
> Attachments: KYLIN-2079.patch
>
>
> Current self-termination timeout for CubeVisitService is calculated as the 
> product of three parameters:
> * hbase.rpc.timeout
> * hbase.client.retries.number (hardcode to 5)
> * kylin.query.cube.visit.timeout.times
> It has a few problems:
> # due to this timeout being longer than hbase.rpc.timeout, user sees "Error 
> in coprocessor" instead of more descriptive GTScanSelfTerminatedException. 
> moreover, the request (probably a bad query) will be retried 5 times, 
> increasing pressure on regionserver
> # it's not intuitive to set coprocessor timeout by adjusting 
> kylin.query.cube.visit.timeout.times
> I propose the following changes:
> # add a new kylin configuration "kylin.query.coprocessor.timeout.seconds" to 
> explicitly set coprocessor timeout. It defaults to 0, which means no value, 
> use hbase.rpc.timeout x 0.9 instead. When user sets it to a positive number, 
> kylin will use min(hbase.rpc.timeout x 0.9, 
> kylin.query.coprocessor.timeout.seconds) as coprocessor timeout
> # remove "kylin.query.cube.visit.timeout.times". For cube visit timeout 
> (ExpectedSizeIterator), it's really a last resort, in case coprocessor didn't 
> terminate itself. I don't see too much needs for user to control it, set it 
> to coprocessor timeout x 10 should be a large enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner

2016-10-11 Thread Dayue Gao (JIRA)
Dayue Gao created KYLIN-2083:


 Summary: more RAM estimation test for MeasureAggregator and 
GTAggregateScanner
 Key: KYLIN-2083
 URL: https://issues.apache.org/jira/browse/KYLIN-2083
 Project: Kylin
  Issue Type: Sub-task
  Components: Tools, Build and Test
Affects Versions: v1.5.4.1
Reporter: Dayue Gao
Assignee: Dayue Gao
 Fix For: v1.6.0


Current RAM estimations of MeasureAggregator and GTAggregateScanner are based 
on test results from AggregationCacheMemSizeTest. I'd like to see if there is 
room for improvement, and if there is, how much.

Points I'm considering are:
# CompressedOops on vs off: when CompressedOops is off on large heap, each 
reference takes 8 bytes. I was wondering how much it will affect size of 
AggregationCache.
# variable length aggregator: does the current estimation works well on var-len 
aggregator like BitmapAggregator
# heap usage count via GC vs Instrumentation: the current approach to obtain 
the actual heap usage of objects seems fine, however, I was wondering if using 
Java instrumentation agent will give us more precise number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner

2016-10-11 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2083:
-
Description: 
Current RAM estimations for MeasureAggregator and GTAggregateScanner are based 
on test results from AggregationCacheMemSizeTest. I'd like to see if there is 
room for improvement, and if there is, how much we can improve.

Points I'm interested in are:
# *CompressedOops ON v.s OFF*: when CompressedOops is off on large heap, each 
reference takes 8 bytes. I was wondering how much it will affect the RAM of 
AggregationCache.
# *Variable Length Aggregator*: does the current estimation works well on 
varlen aggregator like BitmapAggregator?
# *Real Heap Usage Count via Instrumentation*: the current approach to obtain 
the actual heap usage of objects looks fine, however, I was wondering if using 
Java instrumentation agent will give us a more precise number.

  was:
Current RAM estimations of MeasureAggregator and GTAggregateScanner are based 
on test results from AggregationCacheMemSizeTest. I'd like to see if there is 
room for improvement, and if there is, how much.

Points I'm considering are:
# CompressedOops on vs off: when CompressedOops is off on large heap, each 
reference takes 8 bytes. I was wondering how much it will affect size of 
AggregationCache.
# variable length aggregator: does the current estimation works well on var-len 
aggregator like BitmapAggregator
# heap usage count via GC vs Instrumentation: the current approach to obtain 
the actual heap usage of objects seems fine, however, I was wondering if using 
Java instrumentation agent will give us more precise number.


> more RAM estimation test for MeasureAggregator and GTAggregateScanner
> -
>
> Key: KYLIN-2083
> URL: https://issues.apache.org/jira/browse/KYLIN-2083
> Project: Kylin
>  Issue Type: Sub-task
>  Components: Tools, Build and Test
>Affects Versions: v1.5.4.1
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Current RAM estimations for MeasureAggregator and GTAggregateScanner are 
> based on test results from AggregationCacheMemSizeTest. I'd like to see if 
> there is room for improvement, and if there is, how much we can improve.
> Points I'm interested in are:
> # *CompressedOops ON v.s OFF*: when CompressedOops is off on large heap, each 
> reference takes 8 bytes. I was wondering how much it will affect the RAM of 
> AggregationCache.
> # *Variable Length Aggregator*: does the current estimation works well on 
> varlen aggregator like BitmapAggregator?
> # *Real Heap Usage Count via Instrumentation*: the current approach to obtain 
> the actual heap usage of objects looks fine, however, I was wondering if 
> using Java instrumentation agent will give us a more precise number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2083) more RAM estimation test for MeasureAggregator and GTAggregateScanner

2016-10-11 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565404#comment-15565404
 ] 

Dayue Gao commented on KYLIN-2083:
--

Play with it for a while, refactored AggregationCacheMemSizeTest
* use [jamm|https://github.com/jbellis/jamm] to replace the previous way for 
obtaining object's actual heap usage
* move estimation test for individual aggregators to AggregatorMemEstimateTest
* test different setups for aggregation cache
* test different setups for bitmap aggregator
* test both +UseCompressedOops and -UseCompressedOops

Below is how to run the test and what I've found.

Group 1: CompressedOops Enabled
--
{noformat}
$ mvn test -Dtest=AggregationCacheMemSizeTest#testEstimateMemSize -pl 
'core-cube' -DargLine='-Xms2g -Xmx2g' -Dscale=10
{noformat}

1)WITHOUT_MEM_HUNGRY:contain three basic aggregators: longSum, doubleSum and 
bigdecimalSum
{noformat}
   Size Estimate(bytes)   Actual(bytes)Estimate(ms)  Actual(ms)
100,000  32,400,000  31,200,080   1   1,174
200,000  64,800,000  62,400,080   1   2,899
300,000  97,200,000  93,600,080   1   5,779
400,000 129,600,000 124,800,080   1   9,338
500,000 162,000,000 156,000,080   1  13,547
600,000 194,400,000 187,200,080   1  19,555
700,000 226,800,000 218,400,080   1  26,240
800,000 259,200,000 249,600,080   1  33,895
900,000 291,600,000 280,800,080   1  42,416
  1,000,000 324,000,000 312,000,080   1  50,853
{noformat}

2) WITH_HLLC: contain three basic aggregators and one HyperLogLog(14) aggregator
{noformat}
   Size Estimate(bytes)   Actual(bytes)Estimate(ms)  Actual(ms)
  5,000  83,840,000  83,840,096   0  51
 10,000 167,680,000 167,680,096   0 148
 15,000 251,520,000 251,520,096   0 303
 20,000 335,360,000 335,360,096   0 486
 25,000 419,200,000 419,200,096   0 717
 30,000 503,040,000 503,040,096   0   1,008
 35,000 586,880,000 586,880,096   0   1,334
 40,000 670,720,000 670,720,096   0   1,711
 45,000 754,560,000 754,560,096   0   2,120
 50,000 838,400,000 838,400,096   0   2,648
{noformat}

3) WITH_LOW_CARD_BITMAP: contain three basic aggregators and one sparse bitmap 
aggregator (1 million bits but only 100 bits on).
{noformat}
   Size Estimate(bytes)   Actual(bytes)Estimate(ms)  Actual(ms)
 10,000   5,920,000  23,200,080   1 452
 20,000  11,840,000  46,400,080   1   1,330
 30,000  17,760,000  69,600,080   1   2,716
 40,000  23,680,000  92,800,080   1   4,531
 50,000  29,600,000 116,000,080   1   6,973
 60,000  35,520,000 139,200,080   1   9,915
 70,000  41,440,000 162,400,080   1  13,289
 80,000  47,360,000 185,600,080   1  17,037
 90,000  53,280,000 208,800,080   1  21,923
100,000  59,200,000 232,000,080   1  28,140
{noformat}

4) WITH_HIGH_CARD_BITMAP: contain three basic aggregators and one dense bitmap 
aggregator (1 million bits, 99.99% on)
{noformat}
   Size Estimate(bytes)   Actual(bytes)Estimate(ms)  Actual(ms)
  1,000 131,464,000 133,096,080   0  49
  2,000 262,928,000 266,192,080   0 138
  3,000 394,392,000 399,288,080   0 319
  4,000 525,856,000 532,384,080   0 503
  5,000 657,320,000 665,480,080   0 739
  6,000 788,784,000 798,576,080   0   1,101
  7,000 920,248,000 931,672,080   0   1,473
  8,000   1,051,712,000   1,064,768,080   0   1,895
  9,000   1,183,176,000   1,197,864,080   0   2,311
 10,000   1,314,640,000   1,330,960,080   0   2,969
{noformat}

Group 2: CompressedOops Disabled

[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

2016-10-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571375#comment-15571375
 ] 

Dayue Gao commented on KYLIN-2012:
--

https://github.com/apache/kylin/commit/36cf99ef77486c1361a31f3e1f748bb277eca217 
refines rules on lookup table.

https://github.com/apache/kylin/commit/5974fc0870be17a8801c55c7496093d42dbb7c4f 
renames RealizationStatusEnum.DESCBROKEN to RealizationStatusEnum.BROKEN. It's 
difficult for user to understand what does DESCBROKEN means.

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KYLIN-2012) more robust approach to hive schema changes

2016-10-13 Thread Dayue Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao resolved KYLIN-2012.
--
Resolution: Fixed

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

2016-10-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571341#comment-15571341
 ] 

Dayue Gao commented on KYLIN-2012:
--

I found that even after KYLIN-1985, we can only allow user to append columns to 
lookup table, the reasons are:
* LookupTable use ColumnDesc's zerobasedindex to find key columns in 
SnapshotTable, if users insert/drop column in the middle of hive table, the 
indexes of ColumnDesc are not aligned with hive.
* If users drop trailing unused column of lookup table, query can fail with 
ArrayIndexOutOfBoundsException at LookupStringTable#convertRow. That's because 
#columns of SnapshotTable is larger than 
length(LookupStringTable.colIsDateTime).

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KYLIN-2012) more robust approach to hive schema changes

2016-10-13 Thread Dayue Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571511#comment-15571511
 ] 

Dayue Gao edited comment on KYLIN-2012 at 10/13/16 10:17 AM:
-

oops, didn't think about the migration. Please revert it and keep the old name.


was (Author: gaodayue):
oops, didn't think about the migration. Please revert it.

> more robust approach to hive schema changes
> ---
>
> Key: KYLIN-2012
> URL: https://issues.apache.org/jira/browse/KYLIN-2012
> Project: Kylin
>  Issue Type: Bug
>  Components: Metadata, REST Service, Web 
>Affects Versions: v1.5.3
>Reporter: Dayue Gao
>Assignee: Dayue Gao
> Fix For: v1.6.0
>
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >