Re: Query 3x slower with index

Ilya Kasnacheev Fri, 21 Sep 2018 05:50:30 -0700

Hello!

Can you share a reproducer project which loads (or generates) data for
caches and then queries them? I could try and debug it if I had the
reproducer.


Regards.
-- 
Ilya Kasnacheev


чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[email protected]>:

> Thanks Ilya,
>
> Tried it, no luck. It performs the same as when using category_id index
> alone (slow).
>   Any combindation I try either uses AFFINITY_KEY or category index. When
> it uses category index it runs slowers.
>
> Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query
> parallelism settings ) when category_id is used, the jobs runs on one
> thread most of the time (first few seconds it looks like more threads are
> doing work).
>
> Please help on this. It seems like a very simple use case (using affinity
> key and another index), either I am doing something extremly silly, or I
> stumbled on a bug in Ignite that's effecting a lot of people.
>
> Cheers,
> Eugene
>
> On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[email protected]>
> wrote:
>
>> Hello!
>>
>> > 2) ga_customer_and_category_id: on customer_id and category_id
>>
>> Have you tried to do an index on category_id first, customer_id second?
>> Note that Ignite will use only one index when joining two tables and that
>> in your case it should start with category_id.
>>
>> You can also try adding affinity key to this index in various places, see
>> if it helps further.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[email protected]
>> >:
>>
>>> Hi Ilya,
>>>
>>> I created 4 indexs on the table:
>>> 1) ga_pKey: on customer_id, dt, category_id (that's our primary key
>>> columns)
>>> 2) ga_customer_and_category_id: on customer_id and category_id
>>> 2) ga_customer_id: on customer_id
>>> 4) ga_category_id: on category_id
>>>
>>>
>>> For the first query (category in ()), the execution plan when using the
>>> first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
>>> When using #4 (alone or in combination with any of the other 3)
>>>
>>>    1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /*
>>>    PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945,
>>>    101450) */
>>>    2. The query runs slower.
>>>
>>> For the second query (join on an inlined table) the behaviour is very
>>> similar. Using the first 3 indexes results in the same plan - using  /*
>>> PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID
>>> */.
>>> When using #4 (alone or in combination with any of the other 3)
>>>
>>>    1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID =
>>>    CATS__Z1.CATEGORY_ID */ are used
>>>    2. The query is much slower.
>>>
>>>
>>> Theoretically the query seems pretty simple
>>>
>>>    1. Use affinity key  to make sure the query runs in parallel and
>>>    there are no shuffles
>>>    2. Filter rows that match category_id using the category_id index
>>>    3. Used customer_id index for the group_by (not sure if this step
>>>    makes sense)
>>>
>>> But I cannot get it to work.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>>
>>>
>>> On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <
>>> [email protected]> wrote:
>>>
>>>> Hello!
>>>>
>>>> I can see you try to use _key_PK as index. If your primary key is
>>>> composite, it won't work properly for you. I recommend creating an explicit
>>>> (category_id, customer_id) index.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> вт, 18 сент. 2018 г. в 17:47, eugene miretsky <
>>>> [email protected]>:
>>>>
>>>>> Hi Ilya,
>>>>>
>>>>> The different query result was my mistake - one of the categoy_ids was
>>>>> duplicate, so in the query that used join, it counted rows for that
>>>>> category twice. My apologies.
>>>>>
>>>>> However, we are still having an issue with query time, and the index
>>>>> not being applied to category_id. Would appreciate if you could take a
>>>>> look.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Why don't you diff the results of those two queries, tell us what the
>>>>>> difference is?
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 17 сент. 2018 г. в 16:08, eugene miretsky <
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Just wanted to see if anybody had time to look into this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Tried joining with an inlined table instead of IN as per the second
>>>>>>>> suggestion, and it didn't quite work.
>>>>>>>>
>>>>>>>> Query1:
>>>>>>>>
>>>>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3  use
>>>>>>>>    Index( ) where category_id in (9005, 175930, 175930, 
>>>>>>>> 175940,175945,101450,
>>>>>>>>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>>>    - exec time = 17s
>>>>>>>>    - *Result: 3105868*
>>>>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>    or customer_id index
>>>>>>>>    - Using an index on category_id increases the query time 33s
>>>>>>>>
>>>>>>>> Query2:
>>>>>>>>
>>>>>>>>    - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga
>>>>>>>>    use index (PUBLIC."_key_PK") inner join table(category_id int = 
>>>>>>>> (9005,
>>>>>>>>    175930, 175930, 175940,175945,101450, 6453)) cats on 
>>>>>>>> cats.category_id =
>>>>>>>>    ga.category_id   group by customer_id having SUM(product_views_app) 
>>>>>>>> > 2 OR
>>>>>>>>    SUM(product_clicks_app) > 1 )
>>>>>>>>    - exec time = 38s
>>>>>>>>    - *Result: 3113921*
>>>>>>>>    - Same exec time if using AFFINITY_KEY index or "_key_PK_hash
>>>>>>>>    or customer_id index or category_id index
>>>>>>>>    - Using an index on category_id doesnt change the run time
>>>>>>>>
>>>>>>>> Query plans are attached.
>>>>>>>>
>>>>>>>> 3 questions:
>>>>>>>>
>>>>>>>>    1. Why is the result differnt for the 2 queries - this is quite
>>>>>>>>    concerning.
>>>>>>>>    2. Why is the 2nd query taking longer
>>>>>>>>    3. Why  category_id index doesn't work in case of query 2.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello!
>>>>>>>>>
>>>>>>>>> I don't think that we're able to use index with IN () clauses.
>>>>>>>>> Please convert it into OR clauses.
>>>>>>>>>
>>>>>>>>> Please see
>>>>>>>>> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> --
>>>>>>>>> Ilya Kasnacheev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <
>>>>>>>>> [email protected]>:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> Actually, first query uses index on affinity key which looks more
>>>>>>>>>> efficient than index on category_id column.
>>>>>>>>>> The first query can process groups one by one and stream partial
>>>>>>>>>> results from map phase to reduce phase as it use sorted index lookup,
>>>>>>>>>> while second query should process full dataset on map phase
>>>>>>>>>> before pass it for reducing.
>>>>>>>>>>
>>>>>>>>>> Try to use composite index (customer_id, category_id).
>>>>>>>>>>
>>>>>>>>>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to
>>>>>>>>>> build more efficient plan when group by on collocated column is used.
>>>>>>>>>>
>>>>>>>>>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Schema:
>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.CUSTOMER_ID
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.DT
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.CATEGORY_ID
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.VERTICAL_ID
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.SERVICE
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>>>>>>>>>
>>>>>>>>>>>    PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>>>>>>>>>    - pkey = customer_id,dt
>>>>>>>>>>>    - affinityKey = customer
>>>>>>>>>>>
>>>>>>>>>>> Query:
>>>>>>>>>>>
>>>>>>>>>>>    - select COUNT(*) FROM( Select customer_id from GATABLE2
>>>>>>>>>>>    where category_id in (175925, 101450, 9005, 175930, 175930,
>>>>>>>>>>>    175940,175945,101450, 6453) group by customer_id having
>>>>>>>>>>>    SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
>>>>>>>>>>>
>>>>>>>>>>> The table has 600M rows.
>>>>>>>>>>> At first, the query took 1m, when we added an index on
>>>>>>>>>>> category_id the query started taking 3m.
>>>>>>>>>>>
>>>>>>>>>>> The SQL execution plan for both queries is attached.
>>>>>>>>>>>
>>>>>>>>>>> We are using a single x1.16xlarge insntace with query
>>>>>>>>>>> parallelism set to 32
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Eugene
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Andrey V. Mashenkov
>>>>>>>>>>
>>>>>>>>>

Re: Query 3x slower with index

Reply via email to