Hello! > 2) ga_customer_and_category_id: on customer_id and category_id
Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id. You can also try adding affinity key to this index in various places, see if it helps further. Regards, -- Ilya Kasnacheev ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[email protected]>: > Hi Ilya, > > I created 4 indexs on the table: > 1) ga_pKey: on customer_id, dt, category_id (that's our primary key > columns) > 2) ga_customer_and_category_id: on customer_id and category_id > 2) ga_customer_id: on customer_id > 4) ga_category_id: on category_id > > > For the first query (category in ()), the execution plan when using the > first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */ > When using #4 (alone or in combination with any of the other 3) > > 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /* > PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, > 101450) */ > 2. The query runs slower. > > For the second query (join on an inlined table) the behaviour is very > similar. Using the first 3 indexes results in the same plan - using /* > PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID > */. > When using #4 (alone or in combination with any of the other 3) > > 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = > CATS__Z1.CATEGORY_ID */ are used > 2. The query is much slower. > > > Theoretically the query seems pretty simple > > 1. Use affinity key to make sure the query runs in parallel and there > are no shuffles > 2. Filter rows that match category_id using the category_id index > 3. Used customer_id index for the group_by (not sure if this step > makes sense) > > But I cannot get it to work. > > Cheers, > Eugene > > > > > On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev < > [email protected]> wrote: > >> Hello! >> >> I can see you try to use _key_PK as index. If your primary key is >> composite, it won't work properly for you. I recommend creating an explicit >> (category_id, customer_id) index. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[email protected] >> >: >> >>> Hi Ilya, >>> >>> The different query result was my mistake - one of the categoy_ids was >>> duplicate, so in the query that used join, it counted rows for that >>> category twice. My apologies. >>> >>> However, we are still having an issue with query time, and the index not >>> being applied to category_id. Would appreciate if you could take a look. >>> >>> Cheers, >>> Eugene >>> >>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev < >>> [email protected]> wrote: >>> >>>> Hello! >>>> >>>> Why don't you diff the results of those two queries, tell us what the >>>> difference is? >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky < >>>> [email protected]>: >>>> >>>>> Hello, >>>>> >>>>> Just wanted to see if anybody had time to look into this. >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks! >>>>>> >>>>>> Tried joining with an inlined table instead of IN as per the second >>>>>> suggestion, and it didn't quite work. >>>>>> >>>>>> Query1: >>>>>> >>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 use >>>>>> Index( ) where category_id in (9005, 175930, 175930, >>>>>> 175940,175945,101450, >>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>>>> SUM(product_clicks_app) > 1 ) >>>>>> - exec time = 17s >>>>>> - *Result: 3105868* >>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or >>>>>> customer_id index >>>>>> - Using an index on category_id increases the query time 33s >>>>>> >>>>>> Query2: >>>>>> >>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga use >>>>>> index (PUBLIC."_key_PK") inner join table(category_id int = (9005, >>>>>> 175930, >>>>>> 175930, 175940,175945,101450, 6453)) cats on cats.category_id = >>>>>> ga.category_id group by customer_id having SUM(product_views_app) > >>>>>> 2 OR >>>>>> SUM(product_clicks_app) > 1 ) >>>>>> - exec time = 38s >>>>>> - *Result: 3113921* >>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or >>>>>> customer_id index or category_id index >>>>>> - Using an index on category_id doesnt change the run time >>>>>> >>>>>> Query plans are attached. >>>>>> >>>>>> 3 questions: >>>>>> >>>>>> 1. Why is the result differnt for the 2 queries - this is quite >>>>>> concerning. >>>>>> 2. Why is the 2nd query taking longer >>>>>> 3. Why category_id index doesn't work in case of query 2. >>>>>> >>>>>> >>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> I don't think that we're able to use index with IN () clauses. >>>>>>> Please convert it into OR clauses. >>>>>>> >>>>>>> Please see >>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations >>>>>>> >>>>>>> Regards, >>>>>>> -- >>>>>>> Ilya Kasnacheev >>>>>>> >>>>>>> >>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov < >>>>>>> [email protected]>: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> Actually, first query uses index on affinity key which looks more >>>>>>>> efficient than index on category_id column. >>>>>>>> The first query can process groups one by one and stream partial >>>>>>>> results from map phase to reduce phase as it use sorted index lookup, >>>>>>>> while second query should process full dataset on map phase before >>>>>>>> pass it for reducing. >>>>>>>> >>>>>>>> Try to use composite index (customer_id, category_id). >>>>>>>> >>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to >>>>>>>> build more efficient plan when group by on collocated column is used. >>>>>>>> >>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> Schema: >>>>>>>>> >>>>>>>>> - >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.DT >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.CATEGORY_ID >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.VERTICAL_ID >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.SERVICE >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>>>>>>>> >>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>>>>>>>> - pkey = customer_id,dt >>>>>>>>> - affinityKey = customer >>>>>>>>> >>>>>>>>> Query: >>>>>>>>> >>>>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2 where >>>>>>>>> category_id in (175925, 101450, 9005, 175930, 175930, >>>>>>>>> 175940,175945,101450, >>>>>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>>>>>>> SUM(product_clicks_app) > 1 ) >>>>>>>>> >>>>>>>>> The table has 600M rows. >>>>>>>>> At first, the query took 1m, when we added an index on category_id >>>>>>>>> the query started taking 3m. >>>>>>>>> >>>>>>>>> The SQL execution plan for both queries is attached. >>>>>>>>> >>>>>>>>> We are using a single x1.16xlarge insntace with query parallelism >>>>>>>>> set to 32 >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Eugene >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Andrey V. Mashenkov >>>>>>>> >>>>>>>
