[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2018-05-04 Thread Shaofeng SHI (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaofeng SHI updated KYLIN-3122:

Fix Version/s: v2.4.0

> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump conditions
> ---
>
> Key: KYLIN-3122
> URL: https://issues.apache.org/jira/browse/KYLIN-3122
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v2.2.0
> Environment: HDP 2.5.6, Kylin 2.2.0
>Reporter: Vsevolod Ostapenko
>Assignee: hongbin ma
>Priority: Critical
> Fix For: v2.4.0
>
> Attachments: partition_elimination_bug_single_column_test.log
>
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
>  We are using a model where cubes are partitioned by date and time:
>  "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
> "MMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
> "partition_condition_builder": 
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>  }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each 
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality 
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
> integrates over all the segment cubes (hundreds of them) only to skip all 
> except for the one that needs to be scanned (which can be observed by looking 
> in the logs).
>  The expectation is that Kylin would use existing info on the partitioning 
> columns (date and time) and known hierarchical relations between date and 
> time to locate required partition much more efficiently that linear scan 
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the 
> partition pruning and scanning becomes not very logical, which suggests bugs 
> in the logic.
> If filtering condition is on specific date and closed-open range of hours 
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in 
> addition to sequentially scanning all the cube partitions (as described 
> above), Kylin will scan HBase tables for all the hours from the specified 
> starting hour and till the last hour of the day (e.g. from hour 10 to 24, 
> instead of just hour 10).
>  As the result query will run much longer that necessary, and might run out 
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as 
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= 
> '10'), Kylin will scan all HBase tables for all the later dates and hours 
> (e.g. from hour 10 and till the most recent hour on the most recent day, 
> which can be hundreds of tables and thousands of regions).
>  As the result query execution will dramatically increase and in most cases 
> Kylin server will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2018-01-29 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Attachment: partition_elimination_bug_single_column_test.log

> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump conditions
> ---
>
> Key: KYLIN-3122
> URL: https://issues.apache.org/jira/browse/KYLIN-3122
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v2.2.0
> Environment: HDP 2.5.6, Kylin 2.2.0
>Reporter: Vsevolod Ostapenko
>Assignee: hongbin ma
>Priority: Critical
> Attachments: partition_elimination_bug_single_column_test.log
>
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
>  We are using a model where cubes are partitioned by date and time:
>  "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
> "MMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
> "partition_condition_builder": 
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>  }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each 
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality 
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
> integrates over all the segment cubes (hundreds of them) only to skip all 
> except for the one that needs to be scanned (which can be observed by looking 
> in the logs).
>  The expectation is that Kylin would use existing info on the partitioning 
> columns (date and time) and known hierarchical relations between date and 
> time to locate required partition much more efficiently that linear scan 
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the 
> partition pruning and scanning becomes not very logical, which suggests bugs 
> in the logic.
> If filtering condition is on specific date and closed-open range of hours 
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in 
> addition to sequentially scanning all the cube partitions (as described 
> above), Kylin will scan HBase tables for all the hours from the specified 
> starting hour and till the last hour of the day (e.g. from hour 10 to 24, 
> instead of just hour 10).
>  As the result query will run much longer that necessary, and might run out 
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as 
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= 
> '10'), Kylin will scan all HBase tables for all the later dates and hours 
> (e.g. from hour 10 and till the most recent hour on the most recent day, 
> which can be hundreds of tables and thousands of regions).
>  As the result query execution will dramatically increase and in most cases 
> Kylin server will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2018-01-28 Thread Shaofeng SHI (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaofeng SHI updated KYLIN-3122:

Component/s: (was: Storage - HBase)
 Query Engine

> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump conditions
> ---
>
> Key: KYLIN-3122
> URL: https://issues.apache.org/jira/browse/KYLIN-3122
> Project: Kylin
>  Issue Type: Bug
>  Components: Query Engine
>Affects Versions: v2.2.0
> Environment: HDP 2.5.6, Kylin 2.2.0
>Reporter: Vsevolod Ostapenko
>Assignee: hongbin ma
>Priority: Critical
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
>  We are using a model where cubes are partitioned by date and time:
>  "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
> "MMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
> "partition_condition_builder": 
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>  }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each 
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality 
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
> integrates over all the segment cubes (hundreds of them) only to skip all 
> except for the one that needs to be scanned (which can be observed by looking 
> in the logs).
>  The expectation is that Kylin would use existing info on the partitioning 
> columns (date and time) and known hierarchical relations between date and 
> time to locate required partition much more efficiently that linear scan 
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the 
> partition pruning and scanning becomes not very logical, which suggests bugs 
> in the logic.
> If filtering condition is on specific date and closed-open range of hours 
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in 
> addition to sequentially scanning all the cube partitions (as described 
> above), Kylin will scan HBase tables for all the hours from the specified 
> starting hour and till the last hour of the day (e.g. from hour 10 to 24, 
> instead of just hour 10).
>  As the result query will run much longer that necessary, and might run out 
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as 
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= 
> '10'), Kylin will scan all HBase tables for all the later dates and hours 
> (e.g. from hour 10 and till the most recent hour on the most recent day, 
> which can be hundreds of tables and thousands of regions).
>  As the result query execution will dramatically increase and in most cases 
> Kylin server will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2018-01-25 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Priority: Critical  (was: Major)

> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump conditions
> ---
>
> Key: KYLIN-3122
> URL: https://issues.apache.org/jira/browse/KYLIN-3122
> Project: Kylin
>  Issue Type: Bug
>  Components: Storage - HBase
>Affects Versions: v2.2.0
> Environment: HDP 2.5.6, Kylin 2.2.0
>Reporter: Vsevolod Ostapenko
>Assignee: hongbin ma
>Priority: Critical
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
>  We are using a model where cubes are partitioned by date and time:
>  "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
> "MMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
> "partition_condition_builder": 
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>  }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each 
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality 
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
> integrates over all the segment cubes (hundreds of them) only to skip all 
> except for the one that needs to be scanned (which can be observed by looking 
> in the logs).
>  The expectation is that Kylin would use existing info on the partitioning 
> columns (date and time) and known hierarchical relations between date and 
> time to locate required partition much more efficiently that linear scan 
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the 
> partition pruning and scanning becomes not very logical, which suggests bugs 
> in the logic.
> If filtering condition is on specific date and closed-open range of hours 
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in 
> addition to sequentially scanning all the cube partitions (as described 
> above), Kylin will scan HBase tables for all the hours from the specified 
> starting hour and till the last hour of the day (e.g. from hour 10 to 24, 
> instead of just hour 10).
>  As the result query will run much longer that necessary, and might run out 
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as 
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= 
> '10'), Kylin will scan all HBase tables for all the later dates and hours 
> (e.g. from hour 10 and till the most recent hour on the most recent day, 
> which can be hundreds of tables and thousands of regions).
>  As the result query execution will dramatically increase and in most cases 
> Kylin server will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2018-01-25 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
 We are using a model where cubes are partitioned by date and time:
 "partition_desc":

{ "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
"A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
"MMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
 }

,

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
 The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
 As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.

If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of tables and thousands of regions).
 As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
 "partition_date_column": "A_VL_HOURLY_V.THEDATE",
 "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
 "partition_date_start": 0,
 "partition_date_format": "MMdd",
 "partition_time_format": "HH",
 "partition_type": "APPEND",
 "partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
},

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server 

[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2017-12-20 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
 "partition_date_column": "A_VL_HOURLY_V.THEDATE",
 "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
 "partition_date_start": 0,
 "partition_date_format": "MMdd",
 "partition_time_format": "HH",
 "partition_type": "APPEND",
 "partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
},

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
bq. "partition_desc": {
bq. "partition_date_column": "A_VL_HOURLY_V.THEDATE",
bq. "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
bq. "partition_date_start": 0,
bq. "partition_date_format": "MMdd",
bq. "partition_time_format": "HH",
bq. "partition_type": "APPEND",
bq. "partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
bq. },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most 

[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2017-12-20 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
bq. "partition_desc": {
bq. "partition_date_column": "A_VL_HOURLY_V.THEDATE",
bq. "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
bq. "partition_date_start": 0,
bq. "partition_date_format": "MMdd",
bq. "partition_time_format": "HH",
bq. "partition_type": "APPEND",
bq. "partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
bq. },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
{{"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },}}

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most 

[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2017-12-20 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
{{"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },}}

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server will be 

[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2017-12-20 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggests bugs in the logic.

If filtering condition is on specific date and closed-open range of hours (e.g. 
thedate = '20171011' and thehour >= '10' and thehour < '11'), in addition to 
sequentially scanning all the cube partitions (as described above), Kylin will 
scan HBase tables for all the hours from the specified starting hour and till 
the last hour of the day (e.g. from hour 10 to 24, instead of just hour 10).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If filtering condition is on specific date by hour interval is specified as 
open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= '10'), 
Kylin will scan all HBase tables for all the later dates and hours (e.g. from 
hour 10 and till the most recent hour on the most recent day, which can be 
hundreds of r).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '00') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggest bugs in the logic.

If condition is on specific date and closed-open range of hours (e.g. thedate = 
'20171011' and thehour >= '10' and thehour < '11'), in addition to sequentially 
scanning all the cube partitions (as described above), Kylin will scan HBase 
regions for all the hours from the starting hour and till the last hour of the 
day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If condition is on specific date by hour interval is specified as open-closed 
(e.g. thedate = '20171011' and thehour > '10' and thehour <= '11'), Kylin will 
scan all HBase regions for all the later dates and hours (e.g. from hour 10 and 
till the most recent hour on the most recent day).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.


> Partition elimination algorithm seems to be 

[jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump condi

2017-12-20 Thread Vsevolod Ostapenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vsevolod Ostapenko updated KYLIN-3122:
--
Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '00') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggest bugs in the logic.

If condition is on specific date and closed-open range of hours (e.g. thedate = 
'20171011' and thehour >= '10' and thehour < '11'), in addition to sequentially 
scanning all the cube partitions (as described above), Kylin will scan HBase 
regions for all the hours from the starting hour and till the last hour of the 
day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of 
memory, causing JVM heap dump and Kylin server crash.


If condition is on specific date by hour interval is specified as open-closed 
(e.g. thedate = '20171011' and thehour > '10' and thehour <= '11'), Kylin will 
scan all HBase regions for all the later dates and hours (e.g. from hour 10 and 
till the most recent hour on the most recent day).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "MMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder": 
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube 
segment corresponds to just one hour.

When a query is issued where both date and hour are specified using equality 
condition (e.g. thedate = '20171011' and thehour = '00') Kylin sequentially 
integrates over all the segment cubes (hundreds of them) only to skip all 
except for the one that needs to be scanned (which can be observed by looking 
in the logs).
The expectation is that Kylin would use existing info on the partitioning 
columns (date and time) and known hierarchical relations between date and time 
to locate required partition much more efficiently that linear scan through all 
the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition 
pruning and scanning becomes not very logical, which suggest bugs in the logic.

If condition is on specific date and closed-open range of hours (e.g. thedate = 
'20171011' and thehour >= '10' and thehour < '11'), in addition to sequentially 
scanning all the cube partitions (as described above), Kylin will scan HBase 
regions for all the hours from the starting hour and till the last hour of the 
day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of 
memory.


If condition is on specific date by hour interval is specified as open-closed 
(e.g. thedate = '20171011' and thehour > '10' and thehour <= '11'), Kylin will 
scan all HBase regions for all the later dates and hours (e.g. from hour 10 and 
till the most recent hour on the most recent day).
As the result query execution will dramatically increase and in most cases 
Kylin server will be terminated with OOM error and JVM heap dump.


> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump