Hello!
I guess that using AFFINITY_KEY as index have something to do with the fact
that GROUP BY really wants to work per-partition.
I have the following query for you:
1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select
customer_id from (Select customer_id, product_views_app, product_clict_app
from GA_DATA ga join table(category_id int = ( 117930, 175930,
175940,175945,101450)) cats on cats.category_id = ga.category_id) data
group by customer_id having SUM(product_views_app) > 2 OR
SUM(product_clict_app) > 1);
PLAN SELECT
DATA__Z2.CUSTOMER_ID AS __C0_0,
SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
SELECT
GA__Z0.CUSTOMER_ID,
GA__Z0.PRODUCT_VIEWS_APP,
GA__Z0.PRODUCT_CLICT_APP
FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945,
101450)) CATS__Z1
INNER JOIN PUBLIC.GA_DATA GA__Z0
ON 1=1
WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
/* SELECT
GA__Z0.CUSTOMER_ID,
GA__Z0.PRODUCT_VIEWS_APP,
GA__Z0.PRODUCT_CLICT_APP
FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945,
101450)) CATS__Z1
/++ function ++/
INNER JOIN PUBLIC.GA_DATA GA__Z0
/++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
ON 1=1
WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
*/
GROUP BY DATA__Z2.CUSTOMER_ID
PLAN SELECT
COUNT(*)
FROM (
SELECT
__C0_0 AS CUSTOMER_ID
FROM PUBLIC.__T0
GROUP BY __C0_0
HAVING (SUM(__C0_1) > 2)
OR (SUM(__C0_2) > 1)
) _18__Z3
/* SELECT
__C0_0 AS CUSTOMER_ID
FROM PUBLIC.__T0
/++ PUBLIC."merge_scan" ++/
GROUP BY __C0_0
HAVING (SUM(__C0_1) > 2)
OR (SUM(__C0_2) > 1)
*/
However, I'm not sure it is "optimal" or not since I have no idea if it
will perform better or worse on real data. That's why I need a subset of
data which will make query execution speed readily visible. Unfortunately,
I can't deduce that from query plan alone.
Regards,
--
Ilya Kasnacheev
пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[email protected]>:
> An easy way to reproduce would be to
>
> 1. Create table
>
> CREATE TABLE GA_DATA (
> customer_id bigint,
> dt timestamp,
> category_id int,
> product_views_app int,
> product_clict_app int,
> product_clict_web int,
> product_clict_web int,
> PRIMARY KEY (customer_id, dt, category_id)
> ) WITH "template=ga_template, backups=0, affinityKey=customer_id";
>
> 2. Create indexes
>
> - CREATE INDEX ga_customer_id ON GA_Data (customer_id)
> - CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
> - CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id,
> customer_id)
> - CREATE INDEX ga_category_id ON GA_Data (category_id)
>
> 3. Run Explain on the following queries while trying forcing using
> different indexes
>
> - Select count(*) FROM(
>
> Select customer_id from GA_DATA use index (ga_category_id)
> where category_id in (117930, 175930, 175940,175945,101450)
> group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1 )
>
>
> - Select count(*) FROM(
>
> Select customer_id from GA_DATA ga use index (ga_pKey)
> join table(category_id int = ( 117930, 175930, 175940,175945,101450))
> cats on cats.category_id = ga.category_id
> group by customer_id having SUM(product_views_app) > 2 OR
> SUM(product_clicks_app) > 1
> )
>
> The execution plans will be similar to what I have posted earler. In
> particular, only on of (a) affinty key index, (b) category_id index will be
> used.
>
> On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[email protected]>
> wrote:
>
>> Hello!
>>
>> Can you share a reproducer project which loads (or generates) data for
>> caches and then queries them? I could try and debug it if I had the
>> reproducer.
>>
>> Regards.
>> --
>> Ilya Kasnacheev
>>
>>
>> чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[email protected]
>> >:
>>
>>> Thanks Ilya,
>>>
>>> Tried it, no luck. It performs the same as when using category_id index
>>> alone (slow).
>>> Any combindation I try either uses AFFINITY_KEY or category index.
>>> When it uses category index it runs slowers.
>>>
>>> Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my
>>> query parallelism settings ) when category_id is used, the jobs runs on one
>>> thread most of the time (first few seconds it looks like more threads are
>>> doing work).
>>>
>>> Please help on this. It seems like a very simple use case (using
>>> affinity key and another index), either I am doing something extremly
>>> silly, or I stumbled on a bug in Ignite that's effecting a lot of people.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <
>>> [email protected]> wrote:
>>>
>>>> Hello!
>>>>
>>>> > 2) ga_customer_and_category_id: on customer_id and category_id
>>>>
>>>> Have you tried to do an index on category_id first, customer_id second?
>>>> Note that Ignite will use only one index when joining two tables and that
>>>> in your case it should start with category_id.
>>>>
>>>> You can also try adding affinity key to this index in various places,
>>>> see if it helps further.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> ср, 19 сент. 2018 г. в 21:27, eugene miretsky <
>>>> [email protected]>:
>>>>
>>>>> Hi Ilya,
>>>>>
>>>>> I created 4 indexs on the table:
>>>>> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key
>>>>> columns)
>>>>> 2) ga_customer_and_category_id: on customer_id and category_id
>>>>> 2) ga_customer_id: on customer_id
>>>>> 4) ga_category_id: on category_id
>>>>>
>>>>>
>>>>> For the first query (category in ()), the execution plan when using
>>>>> the first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */
>>>>> When using #4 (alone or in combination with any of the other 3)
>>>>>
>>>>> 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /*
>>>>> PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945,
>>>>> 101450) */
>>>>> 2. The query runs slower.
>>>>>
>>>>> For the second query (join on an inlined table) the behaviour is very
>>>>> similar. Using the first 3 indexes results in the same plan - using /*
>>>>> PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
>>>>> */.
>>>>> When using #4 (alone or in combination with any of the other 3)
>>>>>
>>>>> 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
>>>>> CATS__Z1.CATEGORY_ID */ are used
>>>>> 2. The query is much slower.
>>>>>
>>>>>
>>>>> Theoretically the query seems pretty simple
>>>>>
>>>>> 1. Use affinity key to make sure the query runs in parallel and
>>>>> there are no shuffles
>>>>> 2. Filter rows that match category_id using the category_id index
>>>>> 3. Used customer_id index for the group_by (not sure if this step
>>>>> makes sense)
>>>>>
>>>>> But I cannot get it to work.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I can see you try to use _key_PK as index. If your primary key is
>>>>>> composite, it won't work properly for you. I recommend creating an
>>>>>> explicit
>>>>>> (category_id, customer_id) index.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Hi Ilya,
>>>>>>>
>>>>>>> The different query result was my mistake - one of the categoy_ids
>>>>>>> was duplicate, so in the query that used join, it counted rows for that
>>>>>>> category twice. My apologies.
>>>>>>>
>>>>>>> However, we are still having an issue with query time, and the index
>>>>>>> not being applied to category_id. Would appreciate if you could take a
>>>>>>> look.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> Why don't you diff the results of those two queries, tell us what
>>>>>>>> the difference is?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> --
>>>>>>>> Ilya Kasnacheev
>>>>>>>>
>>>>>>>>
>>>>>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <
>>>>>>>> [email protected]>:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Just wanted to see if anybody had time to look into this.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Tried joining with an inlined table instead of IN as per the
>>>>>>>>>> second suggestion, and it didn't quite work.
>>>>>>>>>>
>>>>>>>>>> Query1:
>>>>>>>>>>
>>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 use
>>>>>>>>>> Index( ) where category_id in (9005, 175930, 175930,
>>>>>>>>>> 175940,175945,101450,
>>>>>>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>>>>>> SUM(product_clicks_app) > 1 )
>>>>>>>>>> - exec time = 17s
>>>>>>>>>> - *Result: 3105868*
>>>>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>>> or customer_id index
>>>>>>>>>> - Using an index on category_id increases the query time 33s
>>>>>>>>>>
>>>>>>>>>> Query2:
>>>>>>>>>>
>>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga
>>>>>>>>>> use index (PUBLIC."_key_PK") inner join table(category_id int =
>>>>>>>>>> (9005,
>>>>>>>>>> 175930, 175930, 175940,175945,101450, 6453)) cats on
>>>>>>>>>> cats.category_id =
>>>>>>>>>> ga.category_id group by customer_id having
>>>>>>>>>> SUM(product_views_app) > 2 OR
>>>>>>>>>> SUM(product_clicks_app) > 1 )
>>>>>>>>>> - exec time = 38s
>>>>>>>>>> - *Result: 3113921*
>>>>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>>> or customer_id index or category_id index
>>>>>>>>>> - Using an index on category_id doesnt change the run time
>>>>>>>>>>
>>>>>>>>>> Query plans are attached.
>>>>>>>>>>
>>>>>>>>>> 3 questions:
>>>>>>>>>>
>>>>>>>>>> 1. Why is the result differnt for the 2 queries - this is
>>>>>>>>>> quite concerning.
>>>>>>>>>> 2. Why is the 2nd query taking longer
>>>>>>>>>> 3. Why category_id index doesn't work in case of query 2.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello!
>>>>>>>>>>>
>>>>>>>>>>> I don't think that we're able to use index with IN () clauses.
>>>>>>>>>>> Please convert it into OR clauses.
>>>>>>>>>>>
>>>>>>>>>>> Please see
>>>>>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> --
>>>>>>>>>>> Ilya Kasnacheev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <
>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> Actually, first query uses index on affinity key which looks
>>>>>>>>>>>> more efficient than index on category_id column.
>>>>>>>>>>>> The first query can process groups one by one and stream
>>>>>>>>>>>> partial results from map phase to reduce phase as it use sorted
>>>>>>>>>>>> index
>>>>>>>>>>>> lookup,
>>>>>>>>>>>> while second query should process full dataset on map phase
>>>>>>>>>>>> before pass it for reducing.
>>>>>>>>>>>>
>>>>>>>>>>>> Try to use composite index (customer_id, category_id).
>>>>>>>>>>>>
>>>>>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite
>>>>>>>>>>>> to build more efficient plan when group by on collocated column is
>>>>>>>>>>>> used.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Schema:
>>>>>>>>>>>>>
>>>>>>>>>>>>> -
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.DT
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.CATEGORY_ID
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.VERTICAL_ID
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.SERVICE
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>>>>>>>>>>>
>>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>>>>>>>>>>> - pkey = customer_id,dt
>>>>>>>>>>>>> - affinityKey = customer
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2
>>>>>>>>>>>>> where category_id in (175925, 101450, 9005, 175930, 175930,
>>>>>>>>>>>>> 175940,175945,101450, 6453) group by customer_id having
>>>>>>>>>>>>> SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 )
>>>>>>>>>>>>>
>>>>>>>>>>>>> The table has 600M rows.
>>>>>>>>>>>>> At first, the query took 1m, when we added an index on
>>>>>>>>>>>>> category_id the query started taking 3m.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The SQL execution plan for both queries is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are using a single x1.16xlarge insntace with query
>>>>>>>>>>>>> parallelism set to 32
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Eugene
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Andrey V. Mashenkov
>>>>>>>>>>>>
>>>>>>>>>>>