[jira] [Updated] (FLINK-5394) the estimateRowCount method of DataSetCalc didn't work in TableAPI

2017-03-14 Thread jingzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingzhang updated FLINK-5394:
-
Description: 
The estimateRowCount method of DataSetCalc didn't work now. 
If I run the following code,

{code}
Table table = tableEnv
  .fromDataSet(data, "a, b, c")
  .where("a == 1")
  .groupBy("a")
  .select("a, a.avg, b.sum, c.count");
{code}

the cost of every node in Optimized node tree is :

{code}
DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, 
COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 
cpu, 28000.0 io}
  DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative 
cost = {2000.0 rows, 2000.0 cpu, 0.0 io}
  DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative 
cost = {1000.0 rows, 1000.0 cpu, 0.0 io}
{code}

We expect the input rowcount of DataSetAggregate less than 1000, however the 
actual input rowcount is still 1000 because the the estimateRowCount method of 
DataSetCalc didn't work. 

There are two reasons caused to this:
1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls 
RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount which 
would dispatch to RelMdRowCount.
2. DataSetCalc is subclass of SingleRel. So previous function call would match 
getRowCount(SingleRel rel, RelMetadataQuery mq) which would never use 
DataSetCalc.estimateRowCount.

The question would also appear to all Flink RelNodes which are subclass of 
SingleRel.

I plan to resolve this problem by adding a FlinkRelMdRowCount which contains 
specific getRowCount of Flink RelNodes.

  was:
The estimateRowCount method of DataSetCalc didn't work now. 
If I run the following code,

{code}
Table table = tableEnv
  .fromDataSet(data, "a, b, c")
  .groupBy("a")
  .select("a, a.avg, b.sum, c.count")
  .where("a == 1");
{code}

the cost of every node in Optimized node tree is :

{code}
DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, 
COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 5000.0 
cpu, 28000.0 io}
  DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, cumulative 
cost = {2000.0 rows, 2000.0 cpu, 0.0 io}
  DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative 
cost = {1000.0 rows, 1000.0 cpu, 0.0 io}
{code}

We expect the input rowcount of DataSetAggregate less than 1000, however the 
actual input rowcount is still 1000 because the the estimateRowCount method of 
DataSetCalc didn't work. 

There are two reasons caused to this:
1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls 
RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount which 
would dispatch to RelMdRowCount.
2. DataSetCalc is subclass of SingleRel. So previous function call would match 
getRowCount(SingleRel rel, RelMetadataQuery mq) which would never use 
DataSetCalc.estimateRowCount.

The question would also appear to all Flink RelNodes which are subclass of 
SingleRel.

I plan to resolve this problem by adding a FlinkRelMdRowCount which contains 
specific getRowCount of Flink RelNodes.


> the estimateRowCount method of DataSetCalc didn't work in TableAPI
> --
>
> Key: FLINK-5394
> URL: https://issues.apache.org/jira/browse/FLINK-5394
> Project: Flink
>  Issue Type: Bug
>  Components: Table API & SQL
>Reporter: jingzhang
>Assignee: jingzhang
> Fix For: 1.2.0
>
>
> The estimateRowCount method of DataSetCalc didn't work now. 
> If I run the following code,
> {code}
> Table table = tableEnv
>   .fromDataSet(data, "a, b, c")
>   .where("a == 1")
>   .groupBy("a")
>   .select("a, a.avg, b.sum, c.count");
> {code}
> the cost of every node in Optimized node tree is :
> {code}
> DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, 
> COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 
> 5000.0 cpu, 28000.0 io}
>   DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, 
> cumulative cost = {2000.0 rows, 2000.0 cpu, 0.0 io}
>   DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative 
> cost = {1000.0 rows, 1000.0 cpu, 0.0 io}
> {code}
> We expect the input rowcount of DataSetAggregate less than 1000, however the 
> actual input rowcount is still 1000 because the the estimateRowCount method 
> of DataSetCalc didn't work. 
> There are two reasons caused to this:
> 1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls 
> RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount 
> which would dispatch to RelMdRowCount.
> 2. DataSetCalc is subclass of SingleRel. So previous function call would 
> match getRowCount(SingleRel rel, RelMetadataQuery 

[jira] [Updated] (FLINK-5394) the estimateRowCount method of DataSetCalc didn't work in TableAPI

2017-03-14 Thread jingzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingzhang updated FLINK-5394:
-
Summary: the estimateRowCount method of DataSetCalc didn't work in TableAPI 
 (was: the estimateRowCount method of DataSetCalc didn't work)

> the estimateRowCount method of DataSetCalc didn't work in TableAPI
> --
>
> Key: FLINK-5394
> URL: https://issues.apache.org/jira/browse/FLINK-5394
> Project: Flink
>  Issue Type: Bug
>  Components: Table API & SQL
>Reporter: jingzhang
>Assignee: jingzhang
> Fix For: 1.2.0
>
>
> The estimateRowCount method of DataSetCalc didn't work now. 
> If I run the following code,
> {code}
> Table table = tableEnv
>   .fromDataSet(data, "a, b, c")
>   .groupBy("a")
>   .select("a, a.avg, b.sum, c.count")
>   .where("a == 1");
> {code}
> the cost of every node in Optimized node tree is :
> {code}
> DataSetAggregate(groupBy=[a], select=[a, AVG(a) AS TMP_0, SUM(b) AS TMP_1, 
> COUNT(c) AS TMP_2]): rowcount = 1000.0, cumulative cost = {3000.0 rows, 
> 5000.0 cpu, 28000.0 io}
>   DataSetCalc(select=[a, b, c], where=[=(a, 1)]): rowcount = 1000.0, 
> cumulative cost = {2000.0 rows, 2000.0 cpu, 0.0 io}
>   DataSetScan(table=[[_DataSetTable_0]]): rowcount = 1000.0, cumulative 
> cost = {1000.0 rows, 1000.0 cpu, 0.0 io}
> {code}
> We expect the input rowcount of DataSetAggregate less than 1000, however the 
> actual input rowcount is still 1000 because the the estimateRowCount method 
> of DataSetCalc didn't work. 
> There are two reasons caused to this:
> 1. Didn't provide custom metadataProvider yet. So when DataSetAggregate calls 
> RelMetadataQuery.getRowCount(DataSetCalc) to estimate its input rowcount 
> which would dispatch to RelMdRowCount.
> 2. DataSetCalc is subclass of SingleRel. So previous function call would 
> match getRowCount(SingleRel rel, RelMetadataQuery mq) which would never use 
> DataSetCalc.estimateRowCount.
> The question would also appear to all Flink RelNodes which are subclass of 
> SingleRel.
> I plan to resolve this problem by adding a FlinkRelMdRowCount which contains 
> specific getRowCount of Flink RelNodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)