Hello! Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.
Regards. -- Ilya Kasnacheev чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[email protected]>: > Thanks Ilya, > > Tried it, no luck. It performs the same as when using category_id index > alone (slow). > Any combindation I try either uses AFFINITY_KEY or category index. When > it uses category index it runs slowers. > > Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query > parallelism settings ) when category_id is used, the jobs runs on one > thread most of the time (first few seconds it looks like more threads are > doing work). > > Please help on this. It seems like a very simple use case (using affinity > key and another index), either I am doing something extremly silly, or I > stumbled on a bug in Ignite that's effecting a lot of people. > > Cheers, > Eugene > > On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[email protected]> > wrote: > >> Hello! >> >> > 2) ga_customer_and_category_id: on customer_id and category_id >> >> Have you tried to do an index on category_id first, customer_id second? >> Note that Ignite will use only one index when joining two tables and that >> in your case it should start with category_id. >> >> You can also try adding affinity key to this index in various places, see >> if it helps further. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[email protected] >> >: >> >>> Hi Ilya, >>> >>> I created 4 indexs on the table: >>> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key >>> columns) >>> 2) ga_customer_and_category_id: on customer_id and category_id >>> 2) ga_customer_id: on customer_id >>> 4) ga_category_id: on category_id >>> >>> >>> For the first query (category in ()), the execution plan when using the >>> first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */ >>> When using #4 (alone or in combination with any of the other 3) >>> >>> 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /* >>> PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, >>> 101450) */ >>> 2. The query runs slower. >>> >>> For the second query (join on an inlined table) the behaviour is very >>> similar. Using the first 3 indexes results in the same plan - using /* >>> PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID >>> */. >>> When using #4 (alone or in combination with any of the other 3) >>> >>> 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = >>> CATS__Z1.CATEGORY_ID */ are used >>> 2. The query is much slower. >>> >>> >>> Theoretically the query seems pretty simple >>> >>> 1. Use affinity key to make sure the query runs in parallel and >>> there are no shuffles >>> 2. Filter rows that match category_id using the category_id index >>> 3. Used customer_id index for the group_by (not sure if this step >>> makes sense) >>> >>> But I cannot get it to work. >>> >>> Cheers, >>> Eugene >>> >>> >>> >>> >>> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev < >>> [email protected]> wrote: >>> >>>> Hello! >>>> >>>> I can see you try to use _key_PK as index. If your primary key is >>>> composite, it won't work properly for you. I recommend creating an explicit >>>> (category_id, customer_id) index. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky < >>>> [email protected]>: >>>> >>>>> Hi Ilya, >>>>> >>>>> The different query result was my mistake - one of the categoy_ids was >>>>> duplicate, so in the query that used join, it counted rows for that >>>>> category twice. My apologies. >>>>> >>>>> However, we are still having an issue with query time, and the index >>>>> not being applied to category_id. Would appreciate if you could take a >>>>> look. >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> Why don't you diff the results of those two queries, tell us what the >>>>>> difference is? >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Ilya Kasnacheev >>>>>> >>>>>> >>>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky < >>>>>> [email protected]>: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Just wanted to see if anybody had time to look into this. >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> >>>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Tried joining with an inlined table instead of IN as per the second >>>>>>>> suggestion, and it didn't quite work. >>>>>>>> >>>>>>>> Query1: >>>>>>>> >>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 use >>>>>>>> Index( ) where category_id in (9005, 175930, 175930, >>>>>>>> 175940,175945,101450, >>>>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR >>>>>>>> SUM(product_clicks_app) > 1 ) >>>>>>>> - exec time = 17s >>>>>>>> - *Result: 3105868* >>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash >>>>>>>> or customer_id index >>>>>>>> - Using an index on category_id increases the query time 33s >>>>>>>> >>>>>>>> Query2: >>>>>>>> >>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga >>>>>>>> use index (PUBLIC."_key_PK") inner join table(category_id int = >>>>>>>> (9005, >>>>>>>> 175930, 175930, 175940,175945,101450, 6453)) cats on >>>>>>>> cats.category_id = >>>>>>>> ga.category_id group by customer_id having SUM(product_views_app) >>>>>>>> > 2 OR >>>>>>>> SUM(product_clicks_app) > 1 ) >>>>>>>> - exec time = 38s >>>>>>>> - *Result: 3113921* >>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash >>>>>>>> or customer_id index or category_id index >>>>>>>> - Using an index on category_id doesnt change the run time >>>>>>>> >>>>>>>> Query plans are attached. >>>>>>>> >>>>>>>> 3 questions: >>>>>>>> >>>>>>>> 1. Why is the result differnt for the 2 queries - this is quite >>>>>>>> concerning. >>>>>>>> 2. Why is the 2nd query taking longer >>>>>>>> 3. Why category_id index doesn't work in case of query 2. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello! >>>>>>>>> >>>>>>>>> I don't think that we're able to use index with IN () clauses. >>>>>>>>> Please convert it into OR clauses. >>>>>>>>> >>>>>>>>> Please see >>>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> -- >>>>>>>>> Ilya Kasnacheev >>>>>>>>> >>>>>>>>> >>>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov < >>>>>>>>> [email protected]>: >>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> Actually, first query uses index on affinity key which looks more >>>>>>>>>> efficient than index on category_id column. >>>>>>>>>> The first query can process groups one by one and stream partial >>>>>>>>>> results from map phase to reduce phase as it use sorted index lookup, >>>>>>>>>> while second query should process full dataset on map phase >>>>>>>>>> before pass it for reducing. >>>>>>>>>> >>>>>>>>>> Try to use composite index (customer_id, category_id). >>>>>>>>>> >>>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to >>>>>>>>>> build more efficient plan when group by on collocated column is used. >>>>>>>>>> >>>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> Schema: >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.DT >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.CATEGORY_ID >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.VERTICAL_ID >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.SERVICE >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>>>>>>>>>> >>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>>>>>>>>>> - pkey = customer_id,dt >>>>>>>>>>> - affinityKey = customer >>>>>>>>>>> >>>>>>>>>>> Query: >>>>>>>>>>> >>>>>>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2 >>>>>>>>>>> where category_id in (175925, 101450, 9005, 175930, 175930, >>>>>>>>>>> 175940,175945,101450, 6453) group by customer_id having >>>>>>>>>>> SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 ) >>>>>>>>>>> >>>>>>>>>>> The table has 600M rows. >>>>>>>>>>> At first, the query took 1m, when we added an index on >>>>>>>>>>> category_id the query started taking 3m. >>>>>>>>>>> >>>>>>>>>>> The SQL execution plan for both queries is attached. >>>>>>>>>>> >>>>>>>>>>> We are using a single x1.16xlarge insntace with query >>>>>>>>>>> parallelism set to 32 >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Eugene >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best regards, >>>>>>>>>> Andrey V. Mashenkov >>>>>>>>>> >>>>>>>>>
