Thanks!

Tried joining with an inlined table instead of IN as per the second
suggestion, and it didn't quite work.

Query1:

   - Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( )
   where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453)
   group by customer_id having SUM(product_views_app) > 2 OR
   SUM(product_clicks_app) > 1 )
   - exec time = 17s
   - *Result: 3105868*
   - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
   customer_id index
   - Using an index on category_id increases the query time 33s

Query2:

   - Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index
   (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930,
   175930, 175940,175945,101450, 6453)) cats on cats.category_id =
   ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR
   SUM(product_clicks_app) > 1 )
   - exec time = 38s
   - *Result: 3113921*
   - Same exec time if using AFFINITY_KEY index or "_key_PK_hash or
   customer_id index or category_id index
   - Using an index on category_id doesnt change the run time

Query plans are attached.

3 questions:

   1. Why is the result differnt for the 2 queries - this is quite
   concerning.
   2. Why is the 2nd query taking longer
   3. Why  category_id index doesn't work in case of query 2.


On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[email protected]>
wrote:

> Hello!
>
> I don't think that we're able to use index with IN () clauses. Please
> convert it into OR clauses.
>
> Please see
> https://apacheignite-sql.readme.io/docs/performance-and-debugging#section-sql-performance-and-usability-considerations
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[email protected]
> >:
>
>> Hi
>>
>> Actually, first query uses index on affinity key which looks more
>> efficient than index on category_id column.
>> The first query can process groups one by one and stream partial results
>> from map phase to reduce phase as it use sorted index lookup,
>> while second query should process full dataset on map phase before pass
>> it for reducing.
>>
>> Try to use composite index (customer_id, category_id).
>>
>> Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build
>> more efficient plan when group by on collocated column is used.
>>
>> On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> Schema:
>>>
>>>    -
>>>
>>>    PUBLIC.GATABLE2.CUSTOMER_ID
>>>
>>>    PUBLIC.GATABLE2.DT
>>>
>>>    PUBLIC.GATABLE2.CATEGORY_ID
>>>
>>>    PUBLIC.GATABLE2.VERTICAL_ID
>>>
>>>    PUBLIC.GATABLE2.SERVICE
>>>
>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP
>>>
>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP
>>>
>>>    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB
>>>
>>>    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB
>>>
>>>    PUBLIC.GATABLE2.PDP_SESSIONS_APP
>>>
>>>    PUBLIC.GATABLE2.PDP_SESSIONS_WEB
>>>    - pkey = customer_id,dt
>>>    - affinityKey = customer
>>>
>>> Query:
>>>
>>>    - select COUNT(*) FROM( Select customer_id from GATABLE2 where
>>>    category_id in (175925, 101450, 9005, 175930, 175930, 
>>> 175940,175945,101450,
>>>    6453) group by customer_id having SUM(product_views_app) > 2 OR
>>>    SUM(product_clicks_app) > 1 )
>>>
>>> The table has 600M rows.
>>> At first, the query took 1m, when we added an index on category_id the
>>> query started taking 3m.
>>>
>>> The SQL execution plan for both queries is attached.
>>>
>>> We are using a single x1.16xlarge insntace with query parallelism set
>>> to 32
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>

Attachment: Query1_pKeyIdx
Description: Binary data

Attachment: Query1_categoryIdIdx
Description: Binary data

Attachment: Query2_categoryIdx
Description: Binary data

Attachment: Query2_pKeyIdx
Description: Binary data

Reply via email to