Hello, Just wanted to see if anybody had time to look into this.
Cheers, Eugene On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[email protected]> wrote: > Thanks! > > Tried joining with an inlined table instead of IN as per the second > suggestion, and it didn't quite work. > > Query1: > > - Select COUNT(*) FROM( Select customer_id from GATABLE3 use Index( ) > where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) > group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > - exec time = 17s > - *Result: 3105868* > - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or > customer_id index > - Using an index on category_id increases the query time 33s > > Query2: > > - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga use index > (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, > 175930, 175940,175945,101450, 6453)) cats on cats.category_id = > ga.category_id group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clicks_app) > 1 ) > - exec time = 38s > - *Result: 3113921* > - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or > customer_id index or category_id index > - Using an index on category_id doesnt change the run time > > Query plans are attached. > > 3 questions: > > 1. Why is the result differnt for the 2 queries - this is quite > concerning. > 2. Why is the 2nd query taking longer > 3. Why category_id index doesn't work in case of query 2. > > > On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[email protected]> > wrote: > >> Hello! >> >> I don't think that we're able to use index with IN () clauses. Please >> convert it into OR clauses. >> >> Please see >> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[email protected] >> >: >> >>> Hi >>> >>> Actually, first query uses index on affinity key which looks more >>> efficient than index on category_id column. >>> The first query can process groups one by one and stream partial results >>> from map phase to reduce phase as it use sorted index lookup, >>> while second query should process full dataset on map phase before pass >>> it for reducing. >>> >>> Try to use composite index (customer_id, category_id). >>> >>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build >>> more efficient plan when group by on collocated column is used. >>> >>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> Schema: >>>> >>>> - >>>> >>>> PUBLIC.GATABLE2.CUSTOMER_ID >>>> >>>> PUBLIC.GATABLE2.DT >>>> >>>> PUBLIC.GATABLE2.CATEGORY_ID >>>> >>>> PUBLIC.GATABLE2.VERTICAL_ID >>>> >>>> PUBLIC.GATABLE2.SERVICE >>>> >>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>>> >>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>>> >>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>>> >>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>>> >>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>>> >>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>>> - pkey = customer_id,dt >>>> - affinityKey = customer >>>> >>>> Query: >>>> >>>> - select COUNT(*) FROM( Select customer_id from GATABLE2 where >>>> category_id in (175925, 101450, 9005, 175930, 175930, >>>> 175940,175945,101450, >>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>> SUM(product_clicks_app) > 1 ) >>>> >>>> The table has 600M rows. >>>> At first, the query took 1m, when we added an index on category_id the >>>> query started taking 3m. >>>> >>>> The SQL execution plan for both queries is attached. >>>> >>>> We are using a single x1.16xlarge insntace with query parallelism set >>>> to 32 >>>> >>>> Cheers, >>>> Eugene >>>> >>>> >>> >>> -- >>> Best regards, >>> Andrey V. Mashenkov >>> >>
