Hi

Actually, first query uses index on affinity key which looks more efficient
than index on category_id column.
The first query can process groups one by one and stream partial results
from map phase to reduce phase as it use sorted index lookup,
while second query should process full dataset on map phase before pass it
for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more
efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[email protected]>
wrote:

> Hello,
>
> Schema:
>
>    -
>
>    PUBLIC.GATABLE2.CUSTOMER_ID
>
>    PUBLIC.GATABLE2.DT
>
>    PUBLIC.GATABLE2.CATEGORY_ID
>
>    PUBLIC.GATABLE2.VERTICAL_ID
>
>    PUBLIC.GATABLE2.SERVICE
>
>    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>
>    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>
>    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>
>    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>
>    PUBLIC.GATABLE2.PDP_SESSIONS_APP
>
>    PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>    - pkey = customer_id,dt
>    - affinityKey = customer
>
> Query:
>
>    - select COUNT(*) FROM( Select customer_id from GATABLE2 where
>    category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450,
>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>    SUM(product_clicks_app) > 1 )
>
> The table has 600M rows.
> At first, the query took 1m, when we added an index on category_id the
> query started taking 3m.
>
> The SQL execution plan for both queries is attached.
>
> We are using a single x1.16xlarge insntace with query parallelism set to
> 32
>
> Cheers,
> Eugene
>
>

-- 
Best regards,
Andrey V. Mashenkov

Reply via email to