1. /* PUBLIC.AFFINITY_KEY */ means index on affinity column is used. Full
index will be scanned against date condition.
As I wrote you can create composite index to speedup index scan.
2. "group sorted" means index is used for grouping. Looks like H2 have
optimization for this and grouping can applied on fly.
Unsorted grouping would means that we have to fetch full dataset and only
then grouping.

On Wed, Aug 22, 2018 at 5:21 PM eugene miretsky <eugene.miret...@gmail.com>
wrote:

> Just as a reference, bellow are 2 execution plans with and without the
> index on a very similar table.
>
> Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted */.
> 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
> setting customer_id as an affinity key. Is there a way to verify that?
> 2) Is it possible that the removal of /* group sorted */ indicates that
> the result of group_by must be sorted? (hence taking a long time)
>
> *Query*
> Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
> '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) >
> 2 AND MAX(ru_total_web_sessions_count) < 1)
>
> *Without an index*
>
> SELECT
>
> __Z0.CUSTOMER_ID AS __C0_0,
>
> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>
> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>
> FROM PUBLIC.GAL2RU __Z0
>
> /* PUBLIC.AFFINITY_KEY */
>
> WHERE __Z0.DT > '2018-06-12'
>
> GROUP BY __Z0.CUSTOMER_ID
>
> /* group sorted */
>
>
> SELECT
>
> COUNT(*)
>
> FROM (
>
> SELECT
>
> __C0_0 AS CUSTOMER_ID
>
> FROM PUBLIC.__T0
>
> GROUP BY __C0_0
>
> HAVING (SUM(__C0_1) > 2)
>
> AND (MAX(__C0_2) < 1)
>
>
> *With an index*
>
> SELECT
>
> __Z0.CUSTOMER_ID AS __C0_0,
>
> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>
> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>
> FROM PUBLIC.GAL2RU __Z0
>
> /* PUBLIC.DT_IDX2: DT > '2018-06-12' */
>
> WHERE __Z0.DT > '2018-06-12'
>
> GROUP BY __Z0.CUSTOMER_ID
>
>
> SELECT
>
> COUNT(*)
>
> FROM (
>
> SELECT
>
> __C0_0 AS CUSTOMER_ID
>
> FROM PUBLIC.__T0
>
> GROUP BY __C0_0
>
> HAVING (SUM(__C0_1) > 2)
>
> AND (MAX(__C0_2) < 1)
>
> ) _0__Z1
>
>
> On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <
> eugene.miret...@gmail.com> wrote:
>
>> Thanks Andrey,
>>
>> We are using the Ignite notebook, any idea if there is a way to provide
>> these flags and hints directly from SQL?
>>
>> From your description, it seems like the query is executed in the
>> following order
>> 1) Group by customer_id
>> 2) For each group, perform the filtering on date using the index and
>> aggregates
>>
>> My impressions was that the order is
>> 1)  On each node, filter rows by date (using the index)
>> 2)  On each node, group by the remaining rows by customer id, and then
>> perform the aggrate
>>
>> That's why we created the index on the dt field, as opposed to
>> customer_id field.
>>
>> Cheers,
>> Eugene
>>
>>
>> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
>> andrey.mashen...@gmail.com> wrote:
>>
>>> Eugene,
>>>
>>> 1. Note that queryParallelism splits indices and Ignite work similar way
>>> as if index data resides on several nodes. These index part can be looked
>>> up in parallel threads.
>>> 2. It is not a simple query as you data distributed among partitions and
>>> is not collocated and aggregate function are used.
>>> HAVING clause here is a reason, Ignite can apply it on reduce phase only
>>> as HAVING requires aggregate value from all index parts.
>>> 3. If you data already collocated on customer_id then you can hit Ignite
>>> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
>>> optimize grouping and push down aggregates to map phase.
>>> 4. In query plan you attached you can see H2 uses DT_IDX
>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>> It is not effective. With this index H2 have to process all data to
>>> calculate aggregate for group. Index on affinity field may be more
>>> effective as data can be processed group by group.
>>> once all group data is process then result can be passed to reducer.
>>> Hope, H2 is smart enough to do such streaming.
>>>
>>> Also, you can try to use composite index on (customer_id, date) columns.
>>> Most likely. hint will needed [2].
>>>
>>> See also about collocated flag [1] and Hits [2]
>>>
>>> [1]
>>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>>> [2]
>>> https://apacheignite.readme.io/v2.0/docs/sql-performance-and-debugging#index-hints
>>>
>>>
>>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>>> eugene.miret...@gmail.com> wrote:
>>>
>>>> Thanks Andrey,
>>>>
>>>> Right now we are testing with only one big node, so the reduce step
>>>> should not take any time.
>>>>
>>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>>> working. Anything else could be preventing multiple cores from working on
>>>> the job?
>>>> 2) Why would the reduce phase need to look at all the data? It seems
>>>> like a fairly simple query
>>>> 3) We are already collocating data  by customer_id (though as I
>>>> mentioned, right now there is only one node)
>>>> 4) We already using collocation and tried using an index, and other
>>>> advice? Is there a way to check what Ignite is actually doing? How are
>>>> indexs used (by Ignite or H2)?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>>> andrey.mashen...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> 1. Possible there are too much data should be looked for the query.
>>>>> With single node and parallelism=1 query will always run in single thread.
>>>>>  You can try to add more nodes or increase query parallelism to
>>>>> utilize more CPU cores.
>>>>>
>>>>> 2. Index on date field may be not effective as reduce phase should
>>>>> look all the data for further grouping.
>>>>> Try add index on customer_id or use collocation in customer_id
>>>>> (usually more preferable way).
>>>>>
>>>>> Also it is possible the bottleneck is the reduce phase.
>>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>>> This collocation will allow you use collocated flag [1] and Ignite will 
>>>>> use
>>>>> more optimal plan.
>>>>>
>>>>> 4. The main techniques is trying to reduce amount to data to be looked
>>>>> up on every phase with using data collocation and indices
>>>>> Ignite provide 2 plans for distributed queries: map and reduce. You
>>>>> can analyse and check these queries separately to understand how much data
>>>>> are processed on map phase and on reduce.
>>>>> Map query process node local data (until distributed joins on), while
>>>>> reduce fetch data from remote node that may costs. .
>>>>>
>>>>>
>>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>>> eugene.miret...@gmail.com> wrote:
>>>>>
>>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>>
>>>>>> SELECT
>>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>>> SELECT
>>>>>> COUNT(*)
>>>>>> FROM (
>>>>>> SELECT
>>>>>> __C0_0 AS CUSTOMER_ID
>>>>>> FROM PUBLIC.__T0
>>>>>> GROUP BY __C0_0
>>>>>> HAVING (SUM(__C0_1) > 2)
>>>>>> AND (MAX(__C0_2) < 1)
>>>>>> ) _0__Z1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>>> eugene.miret...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have a cache called GAL3EC1, it has
>>>>>>>
>>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>>    2. An Index on the date column
>>>>>>>    3. 300 sparse columns
>>>>>>>
>>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>>
>>>>>>> The following query takes 8min to finish
>>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>>> '2018-05-12' GROUP BY customer_id having
>>>>>>> SUM(ec1_bknt_total_product_views_app) > 2 AND
>>>>>>> MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>>
>>>>>>> I have a few questions:
>>>>>>>
>>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the
>>>>>>>    32 CPUs is used). How can I get the query to use all 32 CPUs? I have 
>>>>>>> tried
>>>>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>>>>    2. Adding the index on date column seems to have slowed down the
>>>>>>>    query. The 8min time from above was without the index, with the 
>>>>>>> index the
>>>>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>>    not use the system resource properly, can that be related?
>>>>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>>>>    problems in Ignite?
>>>>>>>
>>>>>>>
>>>>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>>>>> etc.).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrey V. Mashenkov
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best regards,
>>> Andrey V. Mashenkov
>>>
>>
>>
>

-- 
Best regards,
Andrey V. Mashenkov

Reply via email to