Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time). > > Jesús Camacho Rodríguez wrote: > The second one (it was reformatted): > > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#20
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time). > > Jesús Camacho Rodríguez wrote: > The second one (it was reformatted): > > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#20
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time). > > Jesús Camacho Rodríguez wrote: > The second one (it was reformatted): > > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#20
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time). > > Jesús Camacho Rodríguez wrote: > The second one (it was reformatted): > > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#20
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time). The second one (it was reformatted): HiveUnion(all=[true]) HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, $4.writeid)]) HiveTableSca
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? The incremental rebuild works in two steps: 1) produce the partial rewriting using the MV, and 2) transform rewriting into INSERT/MERGE depending on whether the MV constains Aggregate or not. The costing is done over the partial rewriting. That is Union(MV contents, PFJ of new data), and in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new data)). The cost of the union input using the MV is already reduced using heuristics (we favour plans containing materialized views). However, the other input to the union is cost as usual. In both cases (with and without Aggregate), we may end up overestimating number of rows coming through that input. If we estimate Filter condition over ROWID almost did not reduce input number of rows, then it is easy to estimate that the Union rewriting will be more expensive as new operators in the tree (e.g. additional Project to remove that ROWID column or separate Filter operator for ROWID) will add to the total cost because they need to process those rows. Without this patch, here are the two plans for the simple mv that you mentioned (ignore cpu cost as that is only taken into account in case of draw for the cardinality): - FULL REBUILD: HiveProject(key=[$0], value=[$1]) HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, src_txn]], table:alias=[src_txn]) Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} - PARTIAL REWRITING (INC REBUILD): HiveUnion(all=[true]) HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, $4.writeid)]) HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, src_txn]], table:alias=[src_txn]) HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, partition_mv_1]], table:alias=[default.partition_mv_1]) Total cost: {876.752276249 rows, 1378.75283625 cpu, 0.0 io} (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, but that will still not change outcome in many cases -Project for ROWID will still add overhead- and will add to the optimization time). - Jesús --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/#review207113 --- On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68261/ > --- > > (Updated Aug. 8, 2018, 3:39 p.m.) >
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? >From changes in q.out looks like before this patch rewriting wasn't trigerred >even for PFJ cases. Why would that be the case? In those cases there are 2 >candidate plans: one for full rebuild + onverwrite and another for full build >with additional predicate on writeId + insert into. This Second plan should be >cheaper because of additional predicates. Why didn't we pick that before this >patch? - Ashutosh --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/#review207113 --- On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68261/ > --- > > (Updated Aug. 8, 2018, 3:39 p.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-20332 > https://issues.apache.org/jira/browse/HIVE-20332 > > > Repository: hive-git > > > Description > --- > > HIVE-20332 > > > Diffs > - > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java > 5bdcac88d0015d2410da050524e6697a22d83eb9 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java > 635d27e723dc1d260574723296f3484c26106a9c > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java > 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java > 80b939a9f65142baa149b79460b753ddf469aacf > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java > 575902d78de2a7f95585c23a3c2fc03b9ce89478 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java > 97097381d9619e67bcab8a268d571d2a392485b3 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java > 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 > ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java > 361f150193a155d45eb64266f88eb88f0a881ad3 > ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out > b12df11a98e55c00c8b77e8292666373f3509364 > ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out > 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 > > > Diff: https://reviews.apache.org/r/68261/diff/2/ > > > Testing > --- > > > Thanks, > > Jesús Camacho Rodríguez > >
Re: Review Request 68261: HIVE-20332
> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since they are always insert only? Yes, it will always be cheaper. bq. If so, we don't need cost based decision there. I just thought we preferred to make rewriting decisions cost-based instead of using Hep. bq.Also can you remind an example for a MV containing aggregate where incremental rebuild via merge can be costlier? When there are many new rows and NDV for grouping columns is high: GBy does not reduce the number of rows and MERGE may end up doing a lot of work with OUTER JOIN + INSERT/UPDATE. We can use HepPlanner for incremental rebuild (it needs a minor extension in Calcite and it should mostly work). Then if a rewriting is produced, 1) for Project-Filter-Join MVs we always use it, and 2) for Project-Filter-Join-Aggregate MVs make use of the heuristic. However, note that we will still need to introduce a parameter to be able to tune the heuristic, right? If that is the case, we may introduce Hep for Project-Filter-Join MVs in a follow-up? - Jesús --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/#review207113 --- On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68261/ > --- > > (Updated Aug. 8, 2018, 3:39 p.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-20332 > https://issues.apache.org/jira/browse/HIVE-20332 > > > Repository: hive-git > > > Description > --- > > HIVE-20332 > > > Diffs > - > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java > 5bdcac88d0015d2410da050524e6697a22d83eb9 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java > 635d27e723dc1d260574723296f3484c26106a9c > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java > 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java > 80b939a9f65142baa149b79460b753ddf469aacf > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java > 575902d78de2a7f95585c23a3c2fc03b9ce89478 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java > 97097381d9619e67bcab8a268d571d2a392485b3 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java > 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 > ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java > 361f150193a155d45eb64266f88eb88f0a881ad3 > ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out > b12df11a98e55c00c8b77e8292666373f3509364 > ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out > 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 > > > Diff: https://reviews.apache.org/r/68261/diff/2/ > > > Testing > --- > > > Thanks, > > Jesús Camacho Rodríguez > >
Re: Review Request 68261: HIVE-20332
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/#review207113 --- Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since they are always insert only? If so, we don't need cost based decision there. Also can you remind an example for a MV containing aggregate where incremental rebuild via merge can be costlier? - Ashutosh Chauhan On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68261/ > --- > > (Updated Aug. 8, 2018, 3:39 p.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-20332 > https://issues.apache.org/jira/browse/HIVE-20332 > > > Repository: hive-git > > > Description > --- > > HIVE-20332 > > > Diffs > - > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java > 5bdcac88d0015d2410da050524e6697a22d83eb9 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java > 635d27e723dc1d260574723296f3484c26106a9c > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java > 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java > 80b939a9f65142baa149b79460b753ddf469aacf > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java > 575902d78de2a7f95585c23a3c2fc03b9ce89478 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java > 97097381d9619e67bcab8a268d571d2a392485b3 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java > 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 > ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java > 361f150193a155d45eb64266f88eb88f0a881ad3 > ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out > b12df11a98e55c00c8b77e8292666373f3509364 > ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out > 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 > > > Diff: https://reviews.apache.org/r/68261/diff/2/ > > > Testing > --- > > > Thanks, > > Jesús Camacho Rodríguez > >
Re: Review Request 68261: HIVE-20332
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/ --- (Updated Aug. 8, 2018, 3:39 p.m.) Review request for hive and Ashutosh Chauhan. Bugs: HIVE-20332 https://issues.apache.org/jira/browse/HIVE-20332 Repository: hive-git Description --- HIVE-20332 Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 5bdcac88d0015d2410da050524e6697a22d83eb9 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java 635d27e723dc1d260574723296f3484c26106a9c ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java 80b939a9f65142baa149b79460b753ddf469aacf ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java 575902d78de2a7f95585c23a3c2fc03b9ce89478 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java 97097381d9619e67bcab8a268d571d2a392485b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 361f150193a155d45eb64266f88eb88f0a881ad3 ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out b12df11a98e55c00c8b77e8292666373f3509364 ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 Diff: https://reviews.apache.org/r/68261/diff/2/ Changes: https://reviews.apache.org/r/68261/diff/1-2/ Testing --- Thanks, Jesús Camacho Rodríguez