Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second?
Note that Ignite will use only one index when joining two tables and that
in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see
if it helps further.

Regards,
-- 
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[email protected]>:

> Hi Ilya,
>
> I created 4 indexs on the table:
> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key
> columns)
> 2) ga_customer_and_category_id: on customer_id and category_id
> 2) ga_customer_id: on customer_id
> 4) ga_category_id: on category_id
>
>
> For the first query (category in ()), the execution plan when using the
> first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
> When using #4 (alone or in combination with any of the other 3)
>
>    1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /*
>    PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945,
>    101450) */
>    2. The query runs slower.
>
> For the second query (join on an inlined table) the behaviour is very
> similar. Using the first 3 indexes results in the same plan - using  /*
> PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
> */.
> When using #4 (alone or in combination with any of the other 3)
>
>    1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
>    CATS__Z1.CATEGORY_ID */ are used
>    2. The query is much slower.
>
>
> Theoretically the query seems pretty simple
>
>    1. Use affinity key  to make sure the query runs in parallel and there
>    are no shuffles
>    2. Filter rows that match category_id using the category_id index
>    3. Used customer_id index for the group_by (not sure if this step
>    makes sense)
>
> But I cannot get it to work.
>
> Cheers,
> Eugene
>
>
>
>
> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <
> [email protected]> wrote:
>
>> Hello!
>>
>> I can see you try to use _key_PK as index. If your primary key is
>> composite, it won't work properly for you. I recommend creating an explicit
>> (category_id, customer_id) index.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[email protected]
>> >:
>>
>>> Hi Ilya,
>>>
>>> The different query result was my mistake - one of the categoy_ids was
>>> duplicate, so in the query that used join, it counted rows for that
>>> category twice. My apologies.
>>>
>>> However, we are still having an issue with query time, and the index not
>>> being applied to category_id. Would appreciate if you could take a look.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <
>>> [email protected]> wrote:
>>>
>>>> Hello!
>>>>
>>>> Why don't you diff the results of those two queries, tell us what the
>>>> difference is?
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <
>>>> [email protected]>:
>>>>
>>>>> Hello,
>>>>>
>>>>> Just wanted to see if anybody had time to look into this.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Tried joining with an inlined table instead of IN as per the second
>>>>>> suggestion, and it didn't quite work.
>>>>>>
>>>>>> Query1:
>>>>>>
>>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3  use
>>>>>>    Index( ) where category_id in (9005, 175930, 175930, 
>>>>>> 175940,175945,101450,
>>>>>>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>    - exec time = 17s
>>>>>>    - *Result: 3105868*
>>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
>>>>>>    customer_id index
>>>>>>    - Using an index on category_id increases the query time 33s
>>>>>>
>>>>>> Query2:
>>>>>>
>>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use
>>>>>>    index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 
>>>>>> 175930,
>>>>>>    175930, 175940,175945,101450, 6453)) cats on cats.category_id =
>>>>>>    ga.category_id   group by customer_id having SUM(product_views_app) > 
>>>>>> 2 OR
>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>    - exec time = 38s
>>>>>>    - *Result: 3113921*
>>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
>>>>>>    customer_id index or category_id index
>>>>>>    - Using an index on category_id doesnt change the run time
>>>>>>
>>>>>> Query plans are attached.
>>>>>>
>>>>>> 3 questions:
>>>>>>
>>>>>>    1. Why is the result differnt for the 2 queries - this is quite
>>>>>>    concerning.
>>>>>>    2. Why is the 2nd query taking longer
>>>>>>    3. Why  category_id index doesn't work in case of query 2.
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> I don't think that we're able to use index with IN () clauses.
>>>>>>> Please convert it into OR clauses.
>>>>>>>
>>>>>>> Please see
>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>>>>>>>
>>>>>>> Regards,
>>>>>>> --
>>>>>>> Ilya Kasnacheev
>>>>>>>
>>>>>>>
>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Actually, first query uses index on affinity key which looks more
>>>>>>>> efficient than index on category_id column.
>>>>>>>> The first query can process groups one by one and stream partial
>>>>>>>> results from map phase to reduce phase as it use sorted index lookup,
>>>>>>>> while second query should process full dataset on map phase before
>>>>>>>> pass it for reducing.
>>>>>>>>
>>>>>>>> Try to use composite index (customer_id, category_id).
>>>>>>>>
>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to
>>>>>>>> build more efficient plan when group by on collocated column is used.
>>>>>>>>
>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Schema:
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.CUSTOMER_ID
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.DT
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.CATEGORY_ID
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.VERTICAL_ID
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.SERVICE
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>>>>>>>
>>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>>>>>>>    - pkey = customer_id,dt
>>>>>>>>>    - affinityKey = customer
>>>>>>>>>
>>>>>>>>> Query:
>>>>>>>>>
>>>>>>>>>    - select COUNT(*) FROM( Select customer_id from GATABLE2 where
>>>>>>>>>    category_id in (175925, 101450, 9005, 175930, 175930, 
>>>>>>>>> 175940,175945,101450,
>>>>>>>>>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>>>>
>>>>>>>>> The table has 600M rows.
>>>>>>>>> At first, the query took 1m, when we added an index on category_id
>>>>>>>>> the query started taking 3m.
>>>>>>>>>
>>>>>>>>> The SQL execution plan for both queries is attached.
>>>>>>>>>
>>>>>>>>> We are using a single x1.16xlarge insntace with query parallelism
>>>>>>>>> set to 32
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Andrey V. Mashenkov
>>>>>>>>
>>>>>>>

Reply via email to