Hi Ilya, I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread.
>From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. Has anybody been able to run OLAP like queries in while using an index? Cheers, Eugene On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[email protected]> wrote: > Hello! > > I guess that using AFFINITY_KEY as index have something to do with the > fact that GROUP BY really wants to work per-partition. > > I have the following query for you: > > 1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select > customer_id from (Select customer_id, product_views_app, product_clict_app > from GA_DATA ga join table(category_id int = ( 117930, 175930, > 175940,175945,101450)) cats on cats.category_id = ga.category_id) data > group by customer_id having SUM(product_views_app) > 2 OR > SUM(product_clict_app) > 1); > PLAN SELECT > DATA__Z2.CUSTOMER_ID AS __C0_0, > SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1, > SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2 > FROM ( > SELECT > GA__Z0.CUSTOMER_ID, > GA__Z0.PRODUCT_VIEWS_APP, > GA__Z0.PRODUCT_CLICT_APP > FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, > 101450)) CATS__Z1 > INNER JOIN PUBLIC.GA_DATA GA__Z0 > ON 1=1 > WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID > ) DATA__Z2 > /* SELECT > GA__Z0.CUSTOMER_ID, > GA__Z0.PRODUCT_VIEWS_APP, > GA__Z0.PRODUCT_CLICT_APP > FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, > 101450)) CATS__Z1 > /++ function ++/ > INNER JOIN PUBLIC.GA_DATA GA__Z0 > /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/ > ON 1=1 > WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID > */ > GROUP BY DATA__Z2.CUSTOMER_ID > > PLAN SELECT > COUNT(*) > FROM ( > SELECT > __C0_0 AS CUSTOMER_ID > FROM PUBLIC.__T0 > GROUP BY __C0_0 > HAVING (SUM(__C0_1) > 2) > OR (SUM(__C0_2) > 1) > ) _18__Z3 > /* SELECT > __C0_0 AS CUSTOMER_ID > FROM PUBLIC.__T0 > /++ PUBLIC."merge_scan" ++/ > GROUP BY __C0_0 > HAVING (SUM(__C0_1) > 2) > OR (SUM(__C0_2) > 1) > */ > > However, I'm not sure it is "optimal" or not since I have no idea if it > will perform better or worse on real data. That's why I need a subset of > data which will make query execution speed readily visible. Unfortunately, > I can't deduce that from query plan alone. > > Regards, > -- > Ilya Kasnacheev > > > пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[email protected]>: > >> An easy way to reproduce would be to >> >> 1. Create table >> >> CREATE TABLE GA_DATA ( >> customer_id bigint, >> dt timestamp, >> category_id int, >> product_views_app int, >> product_clict_app int, >> product_clict_web int, >> product_clict_web int, >> PRIMARY KEY (customer_id, dt, category_id) >> ) WITH "template=ga_template, backups=0, affinityKey=customer_id"; >> >> 2. Create indexes >> >> - CREATE INDEX ga_customer_id ON GA_Data (customer_id) >> - CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id) >> - CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, >> customer_id) >> - CREATE INDEX ga_category_id ON GA_Data (category_id) >> >> 3. Run Explain on the following queries while trying forcing using >> different indexes >> >> - Select count(*) FROM( >> >> Select customer_id from GA_DATA use index (ga_category_id) >> where category_id in (117930, 175930, 175940,175945,101450) >> group by customer_id having SUM(product_views_app) > 2 OR >> SUM(product_clicks_app) > 1 ) >> >> >> - Select count(*) FROM( >> >> Select customer_id from GA_DATA ga use index (ga_pKey) >> join table(category_id int = ( 117930, 175930, 175940,175945,101450)) >> cats on cats.category_id = ga.category_id >> group by customer_id having SUM(product_views_app) > 2 OR >> SUM(product_clicks_app) > 1 >> ) >> >> The execution plans will be similar to what I have posted earler. In >> particular, only on of (a) affinty key index, (b) category_id index will be >> used. >> >> On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev < >> [email protected]> wrote: >> >>> Hello! >>> >>> Can you share a reproducer project which loads (or generates) data for >>> caches and then queries them? I could try and debug it if I had the >>> reproducer. >>> >>> Regards. >>> -- >>> Ilya Kasnacheev >>> >>> >>> чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[email protected] >>> >: >>> >>>> Thanks Ilya, >>>> >>>> Tried it, no luck. It performs the same as when using category_id index >>>> alone (slow). >>>> Any combindation I try either uses AFFINITY_KEY or category index. >>>> When it uses category index it runs slowers. >>>> >>>> Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my >>>> query parallelism settings ) when category_id is used, the jobs runs on one >>>> thread most of the time (first few seconds it looks like more threads are >>>> doing work). >>>> >>>> Please help on this. It seems like a very simple use case (using >>>> affinity key and another index), either I am doing something extremly >>>> silly, or I stumbled on a bug in Ignite that's effecting a lot of people. >>>> >>>> Cheers, >>>> Eugene >>>> >>>> On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev < >>>> [email protected]> wrote: >>>> >>>>> Hello! >>>>> >>>>> > 2) ga_customer_and_category_id: on customer_id and category_id >>>>> >>>>> Have you tried to do an index on category_id first, customer_id >>>>> second? Note that Ignite will use only one index when joining two tables >>>>> and that in your case it should start with category_id. >>>>> >>>>> You can also try adding affinity key to this index in various places, >>>>> see if it helps further. >>>>> >>>>> Regards, >>>>> -- >>>>> Ilya Kasnacheev >>>>> >>>>> >>>>> ср, 19 сент. 2018 г. в 21:27, eugene miretsky < >>>>> [email protected]>: >>>>> >>>>>> Hi Ilya, >>>>>> >>>>>> I created 4 indexs on the table: >>>>>> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key >>>>>> columns) >>>>>> 2) ga_customer_and_category_id: on customer_id and category_id >>>>>> 2) ga_customer_id: on customer_id >>>>>> 4) ga_category_id: on category_id >>>>>> >>>>>> >>>>>> For the first query (category in ()), the execution plan when using >>>>>> the first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */ >>>>>> When using #4 (alone or in combination with any of the other 3) >>>>>> >>>>>> 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /* >>>>>> PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, >>>>>> 101450) */ >>>>>> 2. The query runs slower. >>>>>> >>>>>> For the second query (join on an inlined table) the behaviour is very >>>>>> similar. Using the first 3 indexes results in the same plan - using /* >>>>>> PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID >>>>>> */. >>>>>> When using #4 (alone or in combination with any of the other 3) >>>>>> >>>>>> 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = >>>>>> CATS__Z1.CATEGORY_ID */ are used >>>>>> 2. The query is much slower. >>>>>> >>>>>> >>>>>> Theoretically the query seems pretty simple >>>>>> >>>>>> 1. Use affinity key to make sure the query runs in parallel and >>>>>> there are no shuffles >>>>>> 2. Filter rows that match category_id using the category_id index >>>>>> 3. Used customer_id index for the group_by (not sure if this step >>>>>> makes sense) >>>>>> >>>>>> But I cannot get it to work. >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> I can see you try to use _key_PK as index. If your primary key is >>>>>>> composite, it won't work properly for you. I recommend creating an >>>>>>> explicit >>>>>>> (category_id, customer_id) index. >>>>>>> >>>>>>> Regards, >>>>>>> -- >>>>>>> Ilya Kasnacheev >>>>>>> >>>>>>> >>>>>>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky < >>>>>>> [email protected]>: >>>>>>> >>>>>>>> Hi Ilya, >>>>>>>> >>>>>>>> The different query result was my mistake - one of the categoy_ids >>>>>>>> was duplicate, so in the query that used join, it counted rows for that >>>>>>>> category twice. My apologies. >>>>>>>> >>>>>>>> However, we are still having an issue with query time, and the >>>>>>>> index not being applied to category_id. Would appreciate if you could >>>>>>>> take >>>>>>>> a look. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Eugene >>>>>>>> >>>>>>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello! >>>>>>>>> >>>>>>>>> Why don't you diff the results of those two queries, tell us what >>>>>>>>> the difference is? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> -- >>>>>>>>> Ilya Kasnacheev >>>>>>>>> >>>>>>>>> >>>>>>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky < >>>>>>>>> [email protected]>: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Just wanted to see if anybody had time to look into this. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Eugene >>>>>>>>>> >>>>>>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> Tried joining with an inlined table instead of IN as per the >>>>>>>>>>> second suggestion, and it didn't quite work. >>>>>>>>>>> >>>>>>>>>>> Query1: >>>>>>>>>>> >>>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 >>>>>>>>>>> use Index( ) where category_id in (9005, 175930, 175930, >>>>>>>>>>> 175940,175945,101450, 6453) group by customer_id having >>>>>>>>>>> SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 ) >>>>>>>>>>> - exec time = 17s >>>>>>>>>>> - *Result: 3105868* >>>>>>>>>>> - Same exec time if using AFFINITY_KEY index or >>>>>>>>>>> "_key_PK_hash or customer_id index >>>>>>>>>>> - Using an index on category_id increases the query time 33s >>>>>>>>>>> >>>>>>>>>>> Query2: >>>>>>>>>>> >>>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga >>>>>>>>>>> use index (PUBLIC."_key_PK") inner join table(category_id int = >>>>>>>>>>> (9005, >>>>>>>>>>> 175930, 175930, 175940,175945,101450, 6453)) cats on >>>>>>>>>>> cats.category_id = >>>>>>>>>>> ga.category_id group by customer_id having >>>>>>>>>>> SUM(product_views_app) > 2 OR >>>>>>>>>>> SUM(product_clicks_app) > 1 ) >>>>>>>>>>> - exec time = 38s >>>>>>>>>>> - *Result: 3113921* >>>>>>>>>>> - Same exec time if using AFFINITY_KEY index or >>>>>>>>>>> "_key_PK_hash or customer_id index or category_id index >>>>>>>>>>> - Using an index on category_id doesnt change the run time >>>>>>>>>>> >>>>>>>>>>> Query plans are attached. >>>>>>>>>>> >>>>>>>>>>> 3 questions: >>>>>>>>>>> >>>>>>>>>>> 1. Why is the result differnt for the 2 queries - this is >>>>>>>>>>> quite concerning. >>>>>>>>>>> 2. Why is the 2nd query taking longer >>>>>>>>>>> 3. Why category_id index doesn't work in case of query 2. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello! >>>>>>>>>>>> >>>>>>>>>>>> I don't think that we're able to use index with IN () clauses. >>>>>>>>>>>> Please convert it into OR clauses. >>>>>>>>>>>> >>>>>>>>>>>> Please see >>>>>>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> -- >>>>>>>>>>>> Ilya Kasnacheev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov < >>>>>>>>>>>> [email protected]>: >>>>>>>>>>>> >>>>>>>>>>>>> Hi >>>>>>>>>>>>> >>>>>>>>>>>>> Actually, first query uses index on affinity key which looks >>>>>>>>>>>>> more efficient than index on category_id column. >>>>>>>>>>>>> The first query can process groups one by one and stream >>>>>>>>>>>>> partial results from map phase to reduce phase as it use sorted >>>>>>>>>>>>> index >>>>>>>>>>>>> lookup, >>>>>>>>>>>>> while second query should process full dataset on map phase >>>>>>>>>>>>> before pass it for reducing. >>>>>>>>>>>>> >>>>>>>>>>>>> Try to use composite index (customer_id, category_id). >>>>>>>>>>>>> >>>>>>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite >>>>>>>>>>>>> to build more efficient plan when group by on collocated column >>>>>>>>>>>>> is used. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Schema: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.DT >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.CATEGORY_ID >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.VERTICAL_ID >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.SERVICE >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP >>>>>>>>>>>>>> >>>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB >>>>>>>>>>>>>> - pkey = customer_id,dt >>>>>>>>>>>>>> - affinityKey = customer >>>>>>>>>>>>>> >>>>>>>>>>>>>> Query: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2 >>>>>>>>>>>>>> where category_id in (175925, 101450, 9005, 175930, 175930, >>>>>>>>>>>>>> 175940,175945,101450, 6453) group by customer_id having >>>>>>>>>>>>>> SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 ) >>>>>>>>>>>>>> >>>>>>>>>>>>>> The table has 600M rows. >>>>>>>>>>>>>> At first, the query took 1m, when we added an index on >>>>>>>>>>>>>> category_id the query started taking 3m. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The SQL execution plan for both queries is attached. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are using a single x1.16xlarge insntace with query >>>>>>>>>>>>>> parallelism set to 32 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> Eugene >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Andrey V. Mashenkov >>>>>>>>>>>>> >>>>>>>>>>>>
