Julian Hyde created CALCITE-6188:
------------------------------------

             Summary: Multi-query optimization
                 Key: CALCITE-6188
                 URL: https://issues.apache.org/jira/browse/CALCITE-6188
             Project: Calcite
          Issue Type: Improvement
            Reporter: Julian Hyde


Devise extensions to SQL so that queries with multiple input tables, multiple 
intermediate tables, and multiple output data sets (result sets and DML 
operations) can be defined in one SQL statement, optimized, and executed 
atomically.

There are many flavors of multi-query optimization, depending on whether each 
occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
"several". Our goal, here, is to allow them all to be expressed. We can then 
devise planning strategies that work for particular flavors.

Examples of multi-queries:
 * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table but 
also updates an index,
 * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
represents an ETL/ELT job;
 * {*}Multiple query outputs{*}. A query that produces several data sets (say a 
list of invoices for
orders and a list of products that need to be restocked);
 * {*}DAG query{*}. A query that uses intermediate results more than once.

See discussion in the [Multi-query optimization email 
thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].

Here are some SQL examples.

We add a new keyword {{MULTI}} that represents a statement whose output 
contains multiple data sets and DML operations, each with a unique name. For 
intermediate results, we use the existing {{WITH}} clause.

h3. 1. Multi DML

Read from one or more tables, write to one or more tables.
An example is inserting into a table and also an index on that table 
(represented as a sorted table).
{code:sql}
WITH
  empDelta AS (
      SELECT * FROM emp2
      EXCEPT
      SELECT * FROM emp)
MULTI
  insertEmp AS (
    INSERT INTO emp
    TABLE empDelta),
  insertEmpDeptno AS (
    MERGE empDeptno AS e
    USING TABLE empDelta AS d
    ON e.deptno = d.deptno
    WHEN NOT MATCHED THEN INSERT VALUES (deptno));
{code}
h3. 2. Query that creates temporary table and uses it more than once
{code:sql}
WITH
  temp AS (
    SELECT *
    FROM emp AS e
    JOIN dept USING (deptno)
    WHERE e.job = 'MANAGER'
    OR d.location = 'CHICAGO')
SELECT deptno,
  (SELECT AVG(sal)
    FROM temp AS t
    WHERE t.deptno = e.deptno) AS deptAvgSal,
  (SELECT AVG(sal)
    FROM temp AS t
    WHERE t.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}

h3. 3. Query that should create temporary table

This query produces the same result as the previous query. There is a common 
relational expression, so the optimizer should consider a DAG plan.
{code:sql}
SELECT deptno,
  (SELECT AVG(e2.sal)
    FROM emp AS e2
    JOIN dept AS d USING (deptno)
    WHERE (e2.job = 'MANAGER'
      OR d.location = 'CHICAGO')
    AND e2.deptno = e.deptno) AS deptAvgSal,
  (SELECT AVG(e3.sal)
    FROM emp AS e3
    JOIN dept AS d USING (deptno)
    WHERE (e3.job = 'MANAGER'
      OR d.location = 'CHICAGO')
    AND e3.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}
h3. 4. Query that produces several data sets
{code:sql}
WITH
  newOrders AS (
    SELECT *
    FROM orders
    WHERE orderDate > DATE '2023-01-25')
MULTI
  invoices AS (
    SELECT customerName, SUM(amount)
    FROM newOrders
    GROUP BY customerName),
  restock AS (
    SELECT productId
    FROM inventory
    WHERE productId IN (
      SELECT productId FROM newOrders)
    AND itemsOnHand < 10);
{code}
h3. 5. Query with a complex DAG, multiple output data sets and one DML
{code:sql}
WITH
  t0 AS (
    SELECT * FROM t WHERE x > 5),
  t00 AS (
    SELECT * FROM t0 WHERE y < 10),
  t000 AS (
    SELECT x, MIN(y) FROM t00 GROUP BY x),
  t1 AS (
    SELECT * FROM t WHERE y > 3),
  t10 AS (
    SELECT * FROM t1 WHERE x < 8),
  t2 AS (
    SELECT * FROM t000
    INTERSECT
    SELECT * FROM t10),
  t3 AS (
    SELECT * FROM u
    INTERSECT
    SELECT * FROM t00)
MULTI (
  q0 AS (SELECT * FROM t2),
  q1 AS (SELECT * FROM t3),
  d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1))
{code}
The data flow is the following DAG:
{noformat}
u --------------------+
                      +--------------> t3 (q0)
    +--> t0 --> t00 --+
    |                 +--> t000 --+
t --+                             +--> t2 (q1)
    |         +--> t10 -----------+
    +--> t1 --+
              +----------------------> d0
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to