An easy way to reproduce would be to
1. Create table
CREATE TABLE GA_DATA (
customer_id bigint,
dt timestamp,
category_id int,
product_views_app int,
product_clict_app int,
product_clict_web int,
product_clict_web int,
PRIMARY KEY (customer_id, dt, category_id)
) WITH "template=ga_template, backups=0, affinityKey=customer_id";
2. Create indexes
- CREATE INDEX ga_customer_id ON GA_Data (customer_id)
- CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
- CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id,
customer_id)
- CREATE INDEX ga_category_id ON GA_Data (category_id)
3. Run Explain on the following queries while trying forcing using
different indexes
- Select count(*) FROM(
Select customer_id from GA_DATA use index (ga_category_id)
where category_id in (117930, 175930, 175940,175945,101450)
group by customer_id having SUM(product_views_app) > 2 OR
SUM(product_clicks_app) > 1 )
- Select count(*) FROM(
Select customer_id from GA_DATA ga use index (ga_pKey)
join table(category_id int = ( 117930, 175930, 175940,175945,101450))
cats on cats.category_id = ga.category_id
group by customer_id having SUM(product_views_app) > 2 OR
SUM(product_clicks_app) > 1
)
The execution plans will be similar to what I have posted earler. In
particular, only on of (a) affinty key index, (b) category_id index will be
used.
On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[email protected]>
wrote:
> Hello!
>
> Can you share a reproducer project which loads (or generates) data for
> caches and then queries them? I could try and debug it if I had the
> reproducer.
>
> Regards.
> --
> Ilya Kasnacheev
>
>
> чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[email protected]>:
>
>> Thanks Ilya,
>>
>> Tried it, no luck. It performs the same as when using category_id index
>> alone (slow).
>> Any combindation I try either uses AFFINITY_KEY or category index. When
>> it uses category index it runs slowers.
>>
>> Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my
>> query parallelism settings ) when category_id is used, the jobs runs on one
>> thread most of the time (first few seconds it looks like more threads are
>> doing work).
>>
>> Please help on this. It seems like a very simple use case (using affinity
>> key and another index), either I am doing something extremly silly, or I
>> stumbled on a bug in Ignite that's effecting a lot of people.
>>
>> Cheers,
>> Eugene
>>
>> On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <
>> [email protected]> wrote:
>>
>>> Hello!
>>>
>>> > 2) ga_customer_and_category_id: on customer_id and category_id
>>>
>>> Have you tried to do an index on category_id first, customer_id second?
>>> Note that Ignite will use only one index when joining two tables and that
>>> in your case it should start with category_id.
>>>
>>> You can also try adding affinity key to this index in various places,
>>> see if it helps further.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[email protected]
>>> >:
>>>
>>>> Hi Ilya,
>>>>
>>>> I created 4 indexs on the table:
>>>> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key
>>>> columns)
>>>> 2) ga_customer_and_category_id: on customer_id and category_id
>>>> 2) ga_customer_id: on customer_id
>>>> 4) ga_category_id: on category_id
>>>>
>>>>
>>>> For the first query (category in ()), the execution plan when using the
>>>> first 3 index is exactly the same - using /* PUBLIC.AFFINITY_KEY */
>>>> When using #4 (alone or in combination with any of the other 3)
>>>>
>>>> 1. /* PUBLIC.AFFINITY_KEY */ is replaced with /*
>>>> PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945,
>>>> 101450) */
>>>> 2. The query runs slower.
>>>>
>>>> For the second query (join on an inlined table) the behaviour is very
>>>> similar. Using the first 3 indexes results in the same plan - using /*
>>>> PUBLIC.AFFINITY_KEY */ and /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
>>>> */.
>>>> When using #4 (alone or in combination with any of the other 3)
>>>>
>>>> 1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
>>>> CATS__Z1.CATEGORY_ID */ are used
>>>> 2. The query is much slower.
>>>>
>>>>
>>>> Theoretically the query seems pretty simple
>>>>
>>>> 1. Use affinity key to make sure the query runs in parallel and
>>>> there are no shuffles
>>>> 2. Filter rows that match category_id using the category_id index
>>>> 3. Used customer_id index for the group_by (not sure if this step
>>>> makes sense)
>>>>
>>>> But I cannot get it to work.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I can see you try to use _key_PK as index. If your primary key is
>>>>> composite, it won't work properly for you. I recommend creating an
>>>>> explicit
>>>>> (category_id, customer_id) index.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <
>>>>> [email protected]>:
>>>>>
>>>>>> Hi Ilya,
>>>>>>
>>>>>> The different query result was my mistake - one of the categoy_ids
>>>>>> was duplicate, so in the query that used join, it counted rows for that
>>>>>> category twice. My apologies.
>>>>>>
>>>>>> However, we are still having an issue with query time, and the index
>>>>>> not being applied to category_id. Would appreciate if you could take a
>>>>>> look.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> Why don't you diff the results of those two queries, tell us what
>>>>>>> the difference is?
>>>>>>>
>>>>>>> Regards,
>>>>>>> --
>>>>>>> Ilya Kasnacheev
>>>>>>>
>>>>>>>
>>>>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Just wanted to see if anybody had time to look into this.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Tried joining with an inlined table instead of IN as per the
>>>>>>>>> second suggestion, and it didn't quite work.
>>>>>>>>>
>>>>>>>>> Query1:
>>>>>>>>>
>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 use
>>>>>>>>> Index( ) where category_id in (9005, 175930, 175930,
>>>>>>>>> 175940,175945,101450,
>>>>>>>>> 6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>>>>> SUM(product_clicks_app) > 1 )
>>>>>>>>> - exec time = 17s
>>>>>>>>> - *Result: 3105868*
>>>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>> or customer_id index
>>>>>>>>> - Using an index on category_id increases the query time 33s
>>>>>>>>>
>>>>>>>>> Query2:
>>>>>>>>>
>>>>>>>>> - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga
>>>>>>>>> use index (PUBLIC."_key_PK") inner join table(category_id int =
>>>>>>>>> (9005,
>>>>>>>>> 175930, 175930, 175940,175945,101450, 6453)) cats on
>>>>>>>>> cats.category_id =
>>>>>>>>> ga.category_id group by customer_id having
>>>>>>>>> SUM(product_views_app) > 2 OR
>>>>>>>>> SUM(product_clicks_app) > 1 )
>>>>>>>>> - exec time = 38s
>>>>>>>>> - *Result: 3113921*
>>>>>>>>> - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>> or customer_id index or category_id index
>>>>>>>>> - Using an index on category_id doesnt change the run time
>>>>>>>>>
>>>>>>>>> Query plans are attached.
>>>>>>>>>
>>>>>>>>> 3 questions:
>>>>>>>>>
>>>>>>>>> 1. Why is the result differnt for the 2 queries - this is
>>>>>>>>> quite concerning.
>>>>>>>>> 2. Why is the 2nd query taking longer
>>>>>>>>> 3. Why category_id index doesn't work in case of query 2.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hello!
>>>>>>>>>>
>>>>>>>>>> I don't think that we're able to use index with IN () clauses.
>>>>>>>>>> Please convert it into OR clauses.
>>>>>>>>>>
>>>>>>>>>> Please see
>>>>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> --
>>>>>>>>>> Ilya Kasnacheev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <
>>>>>>>>>> [email protected]>:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> Actually, first query uses index on affinity key which looks
>>>>>>>>>>> more efficient than index on category_id column.
>>>>>>>>>>> The first query can process groups one by one and stream partial
>>>>>>>>>>> results from map phase to reduce phase as it use sorted index
>>>>>>>>>>> lookup,
>>>>>>>>>>> while second query should process full dataset on map phase
>>>>>>>>>>> before pass it for reducing.
>>>>>>>>>>>
>>>>>>>>>>> Try to use composite index (customer_id, category_id).
>>>>>>>>>>>
>>>>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to
>>>>>>>>>>> build more efficient plan when group by on collocated column is
>>>>>>>>>>> used.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Schema:
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.CUSTOMER_ID
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.DT
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.CATEGORY_ID
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.VERTICAL_ID
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.SERVICE
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>>>>>>>>>>
>>>>>>>>>>>> PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>>>>>>>>>> - pkey = customer_id,dt
>>>>>>>>>>>> - affinityKey = customer
>>>>>>>>>>>>
>>>>>>>>>>>> Query:
>>>>>>>>>>>>
>>>>>>>>>>>> - select COUNT(*) FROM( Select customer_id from GATABLE2
>>>>>>>>>>>> where category_id in (175925, 101450, 9005, 175930, 175930,
>>>>>>>>>>>> 175940,175945,101450, 6453) group by customer_id having
>>>>>>>>>>>> SUM(product_views_app) > 2 OR SUM(product_clicks_app) > 1 )
>>>>>>>>>>>>
>>>>>>>>>>>> The table has 600M rows.
>>>>>>>>>>>> At first, the query took 1m, when we added an index on
>>>>>>>>>>>> category_id the query started taking 3m.
>>>>>>>>>>>>
>>>>>>>>>>>> The SQL execution plan for both queries is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> We are using a single x1.16xlarge insntace with query
>>>>>>>>>>>> parallelism set to 32
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Eugene
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Andrey V. Mashenkov
>>>>>>>>>>>
>>>>>>>>>>