Thanks! Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work.
Query1: - Select COUNT(*) FROM( Select customer_id from GATABLE3 use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 ) - exec time = 17s - *Result: 3105868* - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index - Using an index on category_id increases the query time 33s Query2: - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id group by customer_id having SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 ) - exec time = 38s - *Result: 3113921* - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index - Using an index on category_id doesnt change the run time Query plans are attached. 3 questions: 1. Why is the result differnt for the 2 queries - this is quite concerning. 2. Why is the 2nd query taking longer 3. Why category_id index doesn't work in case of query 2. On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[email protected]> wrote: > Hello! > > I don't think that we're able to use index with IN () clauses. Please > convert it into OR clauses. > > Please see > https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations > > Regards, > -- > Ilya Kasnacheev > > > пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[email protected] > >: > >> Hi >> >> Actually, first query uses index on affinity key which looks more >> efficient than index on category_id column. >> The first query can process groups one by one and stream partial results >> from map phase to reduce phase as it use sorted index lookup, >> while second query should process full dataset on map phase before pass >> it for reducing. >> >> Try to use composite index (customer_id, category_id). >> >> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build >> more efficient plan when group by on collocated column is used. >> >> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[email protected]> >> wrote: >> >>> Hello, >>> >>> Schema: >>> >>> - >>> >>> PUBLIC.GATABLE2.CUSTOMER_ID >>> >>> PUBLIC.GATABLE2.DT >>> >>> PUBLIC.GATABLE2.CATEGORY_ID >>> >>> PUBLIC.GATABLE2.VERTICAL_ID >>> >>> PUBLIC.GATABLE2.SERVICE >>> >>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>> >>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>> >>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>> >>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>> >>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>> >>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>> - pkey = customer_id,dt >>> - affinityKey = customer >>> >>> Query: >>> >>> - select COUNT(*) FROM( Select customer_id from GATABLE2 where >>> category_id in (175925, 101450, 9005, 175930, 175930, >>> 175940,175945,101450, >>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>> SUM(product_clicks_app) > 1 ) >>> >>> The table has 600M rows. >>> At first, the query took 1m, when we added an index on category_id the >>> query started taking 3m. >>> >>> The SQL execution plan for both queries is attached. >>> >>> We are using a single x1.16xlarge insntace with query parallelism set >>> to 32 >>> >>> Cheers, >>> Eugene >>> >>> >> >> -- >> Best regards, >> Andrey V. Mashenkov >> >
Query1_pKeyIdx
Description: Binary data
Query1_categoryIdIdx
Description: Binary data
Query2_categoryIdx
Description: Binary data
Query2_pKeyIdx
Description: Binary data
