Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Fawze Abujaber Sat, 19 May 2018 13:57:06 -0700

Hi Mustafa,

I cann't run compute stats on the dv tables and i think the reason that
these tables data writen using spark sql.



   - r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
   incompatible Parquet schema for column
   'default.dv_engagementendevent_prq_local.dsptlocation'. Column type:
   STRING, Parquet schema: optional struct EngagementEndEvent [i:-1 d:1 r:0] {
   optional int64 engagement_end_time [i:201 d:2 r:0] optional int64
   end_reason [i:202 d:2 r:0] optional byte_array agent_id [i:203 d:2 r:0]
   optional byte_array service_queue_id [i:204 d:2 r:0] optional int64
   skill_id [i:205 d:2 r:0] optional int32 interactive_indication [i:206 d:2
   r:0] optional int64 engagement_set_sequence [i:207 d:2 r:0] optional int64
   engagement_sequence [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2
   r:0] optional int64 millis_in_queue [i:210 d:2 r:0] optional int64
   target_skill_id [i:211 d:2 r:0] optional byte_array target_agent_id [i:212
   d:2 r:0] optional byte_array transcript_identifier [i:213 d:2 r:0] optional
   int64 call_connect_time [i:214 d:2 r:0] optional int64
   conversation_start_time [i:215 d:2 r:0] optional int64
   precall_ivr_connect_time [i:216 d:2 r:0] optional int64 s_call_request_time
   [i:217 d:2 r:0] optional int64 s_call_scheduled_time [i:218 d:2 r:0]
   optional byte_array units [i:219 d:2 r:0] optional byte_array
   from_phone_number [i:220 d:2 r:0] optional byte_array to_phone_number
   [i:221 d:2 r:0] optional byte_array did_number [i:222 d:2 r:0] optional
   int32 from_participant_kind [i:223 d:2 r:0] optional int32
   to_participant_kind [i:224 d:2 r:0] optional byte_array
   technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
   call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
   r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
   agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
   optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
   target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
   [i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
   optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
   agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
   r:0] }


The other tables isn't partitioned.

On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <[email protected]>
wrote:

> Please run “compute stats table_name” against the tables missing stats in
> the warnings section.
> Then rerun the query and send the new profiles.
> With stats the planner should produce more efficient query plans with less
> peak memory.
>
> Thanks
> Mostafa
>
> On May 19, 2018, at 9:38 AM, Fawze Abujaber <[email protected]> wrote:
>
> Attached the query profile
>
> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <[email protected]>
> wrote:
>
>> Please include the full query profile in your original email.
>>
>> Thanks
>> Mostafa
>>
>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <[email protected]> wrote:
>>
>> Hi Community,
>>
>> I'm investigating the peak times where the impala daeomns memory were
>> consumed so i can distribute my queries in the right way.
>>
>> While looking into one scenario, i see one query with the below stats:
>>
>> *** The query has several joins and this is the reason why it take much
>> time.
>>
>> Duration: 4.1m
>> Rows Produced: 30102
>> Aggregate Peak Memory Usage: 6.5 GiB
>> Per Node Peak Memory Usage: 6.5 GiB
>> The total number of bytes read from HDFS :727 MiB
>> Memory Accrual: 86228522168 byte seconds
>> Pool: default-pool
>> Query State: FINISHED
>> Threads: CPU Time: 8m
>>
>> ====================
>>
>> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node
>> Peak Memory Usage are identical, while looking in the query profile i see
>> it ran several fragemments on different nodes.
>>
>> Also i see that this query all the times it ran, it has the mentioned 2
>> parmeters with identical values:( Daily scheduled query)
>>
>> What i'm trying to understand:
>>
>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>> different?
>>
>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>> may cause all the data to be read from single node?
>>
>> 3) Since i see that the node that was with peak memory is the
>> coordinator, and i see that the 2 parameters are identical, is it mean that
>> the cordinator also executed most of  the query?
>>
>> 4) while thinking when these 2 metrics can have the same value is that at
>> a particular time there was one node participating in the query
>> coordinations and execution which is for sure will be the coordinator node
>> and at that point the query has the highest aggregarte memory consumption.
>>
>> Is my assumption true?
>>
>> 5) If my previous assumption is true, then, is there anyway to force the
>> coordinator to un participate in the query consumption? ( Since have 2-3
>> queries running at the same time with such scenrio will fail)
>>
>> 6) While looking in the fragments i see the following on one of the
>> fragements:
>>
>> PeakMemoryUsage: 141.1 MiB
>> PerHostPeakMemUsage: 6.5 GiB
>>
>> IS the peakMemoryUsage is refer to the node that ran the specific
>> fragment and the perHostPeakMemUsage refers to the node with the peak
>> memory cross the cluster in this soecifc query?
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>
> <profile.txt>
>
>


-- 
Take Care
Fawze Abujaber

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Reply via email to