你好,少峰:
kylin cube中,无论是使用 aggregation group还是其他cube优化策略,最终得到的都是一系列组合(如:<day_time,
gender>),而这些组合实际上是与 cuboid 唯一对应的。
在使用sql查询的时候,如果没有对应的 cuobid,那么查询是失败的(排除 extend、derived的维度组合)。
下图是,apache kylin官网对 aggregation
group的解析。按同样的规则,在上封邮件中定义的Cube,应该只会产生10种维度组合,即
<day_time, gender> 576
<day_time, age> 544
<day_time, brand> 528
<day_time, model> 520
<day_time, resolution> 516
<day_time, os_version> 514
<day_time, ntt> 513
<ts_minute> 256
<ts_hour> 128
<day_time> 512
对应的 cuboid 为后面的数字。从cube_statistics任务中看,最后只有
1023,516,576,513,528,514,544,520 这些组合(查询Hbase Meta表也是这种情况)。
在 《Apache Kylin 权威指南》一书中,有介绍在一些极端情况下(如:precisely define the
cuboids/combinations) aggregation group 的使用方法。
所以,我以为目前 kylin 是支持这种定义方法的。
[email protected]
From: ShaoFeng Shi
Date: 2017-04-02 22:17
To: user
Subject: Re: How Kylin Cuboid Scheduler Work With Aggregation Groups ?
Hi Bing,
An aggregation group is a dimension group, or say a sub-cube; it is NOT a
cuboid.
I guess you want to precisely define the cuboids/combinations, that isn't
supported as in many cases user couldn't list all the combinations they use.
But you can describe them with the agg group / mandatory / joint as close as
possible.
2017-03-31 15:49 GMT+08:00 [email protected] <[email protected]>:
Hi,all
I have a Cube, the desc is :
{
"uuid": "bcf11be2-83e4-497e-9e35-a402460a6446",
"last_modified": 1490860973892,
"version": "1.6.0",
"name": "adx_flow_insight",
"model_name": "adx_operator",
"description": "",
"null_string": null,
"dimensions": [
{
"name": "GENDER",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "GENDER",
"derived": null
},
{
"name": "AGE",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "AGE",
"derived": null
},
{
"name": "BRAND",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "BRAND",
"derived": null
},
{
"name": "MODEL",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "MODEL",
"derived": null
},
{
"name": "RESOLUTION",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "RESOLUTION",
"derived": null
},
{
"name": "OS_VERSION",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "OS_VERSION",
"derived": null
},
{
"name": "NTT",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "NTT",
"derived": null
},
{
"name": "TS_MINUTE",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "TS_MINUTE",
"derived": null
},
{
"name": "TS_HOUR",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "TS_HOUR",
"derived": null
},
{
"name": "DAY_TIME",
"table": "FLOW_INSIGHT.VIEW_FLOW_INSIGHT",
"column": "DAY_TIME",
"derived": null
}
],
"measures": [
{
"name": "_COUNT_",
"function": {
"expression": "COUNT",
"parameter": {
"type": "constant",
"value": "1",
"next_parameter": null
},
"returntype": "bigint"
},
"dependent_measure_ref": null
},
{
"name": "REQUEST_PV",
"function": {
"expression": "SUM",
"parameter": {
"type": "column",
"value": "REQUEST",
"next_parameter": null
},
"returntype": "bigint"
},
"dependent_measure_ref": null
},
{
"name": "IMPRESS_PV",
"function": {
"expression": "SUM",
"parameter": {
"type": "column",
"value": "IMPRESS",
"next_parameter": null
},
"returntype": "bigint"
},
"dependent_measure_ref": null
},
{
"name": "CLICK_PV",
"function": {
"expression": "SUM",
"parameter": {
"type": "column",
"value": "CLICK",
"next_parameter": null
},
"returntype": "bigint"
},
"dependent_measure_ref": null
},
{
"name": "FILL_PV",
"function": {
"expression": "SUM",
"parameter": {
"type": "column",
"value": "FILL",
"next_parameter": null
},
"returntype": "bigint"
},
"dependent_measure_ref": null
},
{
"name": "UV_DID",
"function": {
"expression": "COUNT_DISTINCT",
"parameter": {
"type": "column",
"value": "DID",
"next_parameter": null
},
"returntype": "hllc(10)"
},
"dependent_measure_ref": null
}
],
"dictionaries": [],
"rowkey": {
"rowkey_columns": [
{
"column": "DAY_TIME",
"encoding": "date",
"isShardBy": false
},
{
"column": "TS_MINUTE",
"encoding": "integer:4",
"isShardBy": false
},
{
"column": "TS_HOUR",
"encoding": "integer:4",
"isShardBy": false
},
{
"column": "GENDER",
"encoding": "dict",
"isShardBy": false
},
{
"column": "AGE",
"encoding": "dict",
"isShardBy": false
},
{
"column": "BRAND",
"encoding": "dict",
"isShardBy": false
},
{
"column": "MODEL",
"encoding": "dict",
"isShardBy": false
},
{
"column": "RESOLUTION",
"encoding": "dict",
"isShardBy": false
},
{
"column": "OS_VERSION",
"encoding": "dict",
"isShardBy": false
},
{
"column": "NTT",
"encoding": "dict",
"isShardBy": false
}
]
},
"hbase_mapping": {
"column_family": [
{
"name": "F1",
"columns": [
{
"qualifier": "M",
"measure_refs": [
"_COUNT_",
"REQUEST_PV",
"IMPRESS_PV",
"CLICK_PV",
"FILL_PV"
]
}
]
},
{
"name": "F2",
"columns": [
{
"qualifier": "M",
"measure_refs": [
"UV_DID"
]
}
]
}
]
},
"aggregation_groups": [
{
"includes": [
"DAY_TIME",
"GENDER"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"AGE"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"BRAND"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"MODEL"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"RESOLUTION"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"RESOLUTION"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"OS_VERSION"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"DAY_TIME",
"NTT"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"DAY_TIME"
],
"joint_dims": []
}
},
{
"includes": [
"TS_MINUTE"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"TS_MINUTE"
],
"joint_dims": []
}
},
{
"includes": [
"TS_HOUR"
],
"select_rule": {
"hierarchy_dims": [],
"mandatory_dims": [
"TS_HOUR"
],
"joint_dims": []
}
}
],
"signature": "DSSmByHn2sATiETlBdjANQ==",
"notify_list": [],
"status_need_notify": [
"ERROR",
"DISCARDED",
"SUCCEED"
],
"partition_date_start": 1488326400000,
"partition_date_end": 3153600000000,
"auto_merge_time_ranges": [
604800000,
2419200000
],
"retention_range": 0,
"engine_type": 2,
"storage_type": 2,
"override_kylin_properties": {
"kylin.job.mr.config.override.mapreduce.job.queuename": "ad"
}
}
There have 10 dims, and use aggregation groups. I want Cube only contains 10
combs:
<day_time, gender> 576
<day_time, age> 544
<day_time, brand> 528
<day_time, model> 520
<day_time, resolution> 516
<day_time, os_version> 514
<day_time, ntt> 513
<ts_minute> 256
<ts_hour> 128
<day_time> 512
But the Cuboid Scheduler parse as follower:
2017-03-31 15:32:47,735 (main) [INFO -
org..apache.kylin.cube.CubeManager.loadAllCubeInstance(CubeManager.java:908)]
Loaded 4 cubes, fail on 0 cubes
1023
516
576
513
528
514
544
520
2017-03-31 15:32:47,742 (Thread-0) [INFO -
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.closeMasterService(ConnectionManager.java:2259)]
Closing master protocol: MasterService
Question:
How Aggregation Groups Work? I can not set single dim in aggregation?
Thanks for you suggestion~~~
[email protected]
--
Best regards,
Shaofeng Shi 史少锋