[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333839#comment-17333839 ] Flink Jira Bot commented on FLINK-15004: This issue was marked "stale-assigned" and has not received an update in 7 days. It is now automatically unassigned. If you are still working on it, you can assign it to yourself again. Please also give an update about the status of the work. > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.9.1, 1.10.0 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available, stale-assigned > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323147#comment-17323147 ] Flink Jira Bot commented on FLINK-15004: This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.9.1, 1.10.0 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available, stale-assigned > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987434#comment-16987434 ] godfrey he commented on FLINK-15004: yes, you are right > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Planner >Affects Versions: 1.9.0, 1.9.1 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986768#comment-16986768 ] Kurt Young commented on FLINK-15004: So eventually we need both ndv and row count to determine the aggregation ratio, right? That sounds reasonable to me. > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Planner >Affects Versions: 1.9.0, 1.9.1 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986742#comment-16986742 ] godfrey he commented on FLINK-15004: [~ykt836], yes, `ndv` had been considered before: if `ndv` is unknown, the planner will choose two-phase aggregate. `DistinctRowCount` metadata handler could return null which means unknown, while `RowCount` metadata handler always returns primitive type and planner does not know whether the inputs have real row count or just use default value. the issue mainly solves the following scenarios: `ndv` is known, while row count is unknown. > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Planner >Affects Versions: 1.9.0, 1.9.1 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15004) Choose two-phase Aggregate if the statistics is unknown
[ https://issues.apache.org/jira/browse/FLINK-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986719#comment-16986719 ] Kurt Young commented on FLINK-15004: Is row count sufficient for us to decide whether we want to have one or two phase aggregation? I think the key's ndv will be much more important here. > Choose two-phase Aggregate if the statistics is unknown > --- > > Key: FLINK-15004 > URL: https://issues.apache.org/jira/browse/FLINK-15004 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Planner >Affects Versions: 1.9.0, 1.9.1 >Reporter: godfrey he >Assignee: godfrey he >Priority: Major > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, blink planner will use default rowCount value (defined in > {{FlinkPreparingTableBase#DEFAULT_ROWCOUNT}} ) when the statistics is > unknown, and maybe choose one-phase Aggregate. The job will hang if the data > is skewed. So It's better to use two-phase Aggregate for execution stability > if the statistics is unknown. -- This message was sent by Atlassian Jira (v8.3.4#803005)