Re: Index usage on Joins

Ilya Kasnacheev Mon, 08 Jun 2020 06:15:20 -0700

Hello!

This sounds like a good application point of *enforceJoinOrder=true*.
Consider:
3: jdbc:ignite:thin://localhost> !connect
jdbc:ignite:thin://localhost?enforceJoinOrder=true
4: jdbc:ignite:thin://localhost>
4: jdbc:ignite:thin://localhost> EXPLAIN SELECT f.Date_key,
. . . . . . . . . . . . . . . .> loc.Location_name,
. . . . . . . . . . . . . . . .> SUM(f.Revenue)
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> FROM DimensionProduct pr,
DimensionLocation loc, FactTableRevenue f
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> WHERE pr._key = f.Product_Key AND loc._key
= f.Location_Key
. . . . . . . . . . . . . . . .> AND f.Date_Key = 20200604
. . . . . . . . . . . . . . . .> AND pr.Product_Name = 'Product 1'
. . . . . . . . . . . . . . . .> AND loc.Location_Name IN ('London',
'Paris')
. . . . . . . . . . . . . . . .>
. . . . . . . . . . . . . . . .> GROUP BY f.Date_Key, loc.Location_name;
PLAN  SELECT
    F__Z2.DATE_KEY AS __C0_0,
    LOC__Z1.LOCATION_NAME AS __C0_1,
    SUM(F__Z2.REVENUE) AS __C0_2
FROM PUBLIC.DIMENSIONPRODUCT PR__Z0
    /* PUBLIC.IX_PRODUCT_PRODUCT_NAME: PRODUCT_NAME = 'Product 1' */
    /* WHERE PR__Z0.PRODUCT_NAME = 'Product 1'
    */
INNER JOIN PUBLIC.DIMENSIONLOCATION LOC__Z1
    /* PUBLIC.IX_LOCATION_LOCATION_NAME: LOCATION_NAME IN('London',
'Paris') */
    ON 1=1
    /* WHERE LOC__Z1.LOCATION_NAME IN('London', 'Paris')
    */
INNER JOIN PUBLIC.FACTTABLEREVENUE F__Z2
    /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604
        AND PRODUCT_KEY = PR__Z0._KEY
        AND LOCATION_KEY = LOC__Z1._KEY
     */
    ON 1=1
WHERE (LOC__Z1.LOCATION_NAME IN('London', 'Paris'))
    AND ((PR__Z0.PRODUCT_NAME = 'Product 1')
    AND ((F__Z2.DATE_KEY = 20200604)
    AND ((PR__Z0._KEY = F__Z2.PRODUCT_KEY)
    AND (LOC__Z1._KEY = F__Z2.LOCATION_KEY))))
GROUP BY F__Z2.DATE_KEY, LOC__Z1.LOCATION_NAME


PLAN  SELECT
    __C0_0 AS DATE_KEY,
    __C0_1 AS LOCATION_NAME,
    CAST(CAST(SUM(__C0_2) AS DOUBLE) AS DOUBLE) AS __C0_2
FROM PUBLIC.__T0
    /* PUBLIC."merge_scan" */
GROUP BY __C0_0, __C0_1

2 rows selected (0,011 seconds)

Is this what you wanted? First we filter pr and loc by varchar, then join
this small result to facts using secondary index.

Regards,
-- 
Ilya Kasnacheev


чт, 4 июн. 2020 г. в 16:58, njcstreet <[email protected]>:

> Hi,
>
> I am evaluating Ignite for a data warehouse style system which would have a
> central very large "fact" table with potentially billions of records, and
> several "dimensions" that describe the data. The fact table would be
> partitioned as it is large, and the dimensions would be replicated across
> all nodes. I am using the latest version 2.8.
>
> My question is about index usage and joins. I need to join between the fact
> table (which has the numerical transaction values), and the dimensions
> which
> describe the data (such as product / location). However it seems that
> indexes on the fact table won't be used when joining. I understand that you
> can only use one index per table in a query, so I was hoping to use a group
> index for the query against the fact table, since there are a few
> attributes
> that users will always filter on. Here is an example schema (heavily
> simplified and with little data, but enough to demonstrate that the Explain
> plan is not using the index on the Fact table)
>
> CREATE TABLE IF NOT EXISTS FactTableRevenue (
>
> id int PRIMARY KEY,
>
> date_key int,
> product_key int,
> location_key int,
> revenue float
>
> ) WITH "template=partitioned,backups=0";
>
>
>
> CREATE TABLE IF NOT EXISTS DimensionProduct (
>
> id int PRIMARY KEY,
> product_name varchar
>
> ) WITH "TEMPLATE=REPLICATED";
>
>
>
> CREATE TABLE IF NOT EXISTS DimensionLocation (
>
> id int PRIMARY KEY,
> location_name varchar
>
> )WITH "TEMPLATE=REPLICATED";
>
>
>
> CREATE INDEX ix_product_product_name ON DimensionProduct(product_name);
> CREATE INDEX ix_location_location_name ON DimensionLocation(location_name);
> CREATE INDEX ix_revenue_date_product_location ON FactTableRevenue(date_key,
> product_key, location_key);
>
>
> INSERT INTO DimensionProduct (id, product_name) VALUES (1, 'Product 1');
> INSERT INTO DimensionProduct (id, product_name) VALUES (2, 'Product 2');
> INSERT INTO DimensionProduct (id, product_name) VALUES (3, 'Product 3');
>
> INSERT INTO DimensionLocation (id, location_name) VALUES (1, 'London');
> INSERT INTO DimensionLocation (id, location_name) VALUES (2, 'Paris');
> INSERT INTO DimensionLocation (id, location_name) VALUES (3, 'New York');
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (1, 20200604, 1, 1, 500);
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (2, 20200604, 1, 2, 700);
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (3, 20200604, 1, 3, 90);
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (4, 20200604, 2, 1, 267);
>
> INSERT INTO FactTableRevenue (id, date_key, product_key, location_key,
> revenue) VALUES
> (5, 20200604, 2, 3, 755);
>
>
>
> /* 1st example query with no joins but filling in the dimension keys up
> front. This utilities the full group index which is great, but not how we
> want to query the system as you would have to first run some queries to
> determine the keys */
>
> /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: LOCATION_KEY IN(1, 2)
>
>         AND DATE_KEY = 20200604
>
>         AND PRODUCT_KEY = 1
>
>      */
>
> EXPLAIN SELECT f.Date_key,
>
> loc.Location_name,
> SUM(f.Revenue)
> FROM FactTableRevenue f
>
> INNER JOIN DimensionProduct pr
> ON pr._key = f.Product_Key
>
> INNER JOIN DimensionLocation loc
> ON loc._key = f.Location_Key
>
>
> WHERE f.Date_Key = 20200604
> AND f.Product_key = 1 -- Filter specifically on key
> AND f.Location_Key IN (1, 2) -- Filter specifically on key
> GROUP BY f.Date_Key, loc.Location_name
>
>
>
> /*  Second example - this is how we actually want to query by filtering on
> specific dimension attributes. Here, only the date_key is used in the index
> because there is no join on it. But product_key and location_key are not
> included in the index usage because they are in joins */
>
> -- key which is directly on Fact table
>
> /* PUBLIC.IX_REVENUE_DATE_PRODUCT_LOCATION: DATE_KEY = 20200604 */
>
> EXPLAIN SELECT f.Date_key,
> loc.Location_name,
> SUM(f.Revenue)
> FROM FactTableRevenue f
>
> INNER JOIN DimensionProduct pr
> ON pr._key = f.Product_Key
>
> INNER JOIN DimensionLocation loc
> ON loc._key = f.Location_Key
>
> WHERE f.Date_Key = 20200604
> AND pr.Product_Name = 'Product 1'
> AND loc.Location_Name IN ('London', 'Paris')
>
> GROUP BY f.Date_Key, loc.Location_name
>
>
> Now I know that I could put product_name and location_name directly on the
> fact table, however these are not the only attributes that the user may
> want
> to filter on. Also it will result in a much higher data usage because of
> having to store strings across billions of rows rather than integers.
>
> Thanks very much for your help!
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Index usage on Joins

Reply via email to