Hi Mustafa,
I cann't run compute stats on the dv tables and i think the reason that
these tables data writen using spark sql.
- r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
incompatible Parquet schema for column
'default.dv_engagementendevent_prq_local.dsptlocation'. Column type:
STRING, Parquet schema: optional struct EngagementEndEvent [i:-1 d:1 r:0] {
optional int64 engagement_end_time [i:201 d:2 r:0] optional int64
end_reason [i:202 d:2 r:0] optional byte_array agent_id [i:203 d:2 r:0]
optional byte_array service_queue_id [i:204 d:2 r:0] optional int64
skill_id [i:205 d:2 r:0] optional int32 interactive_indication [i:206 d:2
r:0] optional int64 engagement_set_sequence [i:207 d:2 r:0] optional int64
engagement_sequence [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2
r:0] optional int64 millis_in_queue [i:210 d:2 r:0] optional int64
target_skill_id [i:211 d:2 r:0] optional byte_array target_agent_id [i:212
d:2 r:0] optional byte_array transcript_identifier [i:213 d:2 r:0] optional
int64 call_connect_time [i:214 d:2 r:0] optional int64
conversation_start_time [i:215 d:2 r:0] optional int64
precall_ivr_connect_time [i:216 d:2 r:0] optional int64 s_call_request_time
[i:217 d:2 r:0] optional int64 s_call_scheduled_time [i:218 d:2 r:0]
optional byte_array units [i:219 d:2 r:0] optional byte_array
from_phone_number [i:220 d:2 r:0] optional byte_array to_phone_number
[i:221 d:2 r:0] optional byte_array did_number [i:222 d:2 r:0] optional
int32 from_participant_kind [i:223 d:2 r:0] optional int32
to_participant_kind [i:224 d:2 r:0] optional byte_array
technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
[i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
r:0] }
The other tables isn't partitioned.
On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <[email protected]>
wrote:
> Please run “compute stats table_name” against the tables missing stats in
> the warnings section.
> Then rerun the query and send the new profiles.
> With stats the planner should produce more efficient query plans with less
> peak memory.
>
> Thanks
> Mostafa
>
> On May 19, 2018, at 9:38 AM, Fawze Abujaber <[email protected]> wrote:
>
> Attached the query profile
>
> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <[email protected]>
> wrote:
>
>> Please include the full query profile in your original email.
>>
>> Thanks
>> Mostafa
>>
>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <[email protected]> wrote:
>>
>> Hi Community,
>>
>> I'm investigating the peak times where the impala daeomns memory were
>> consumed so i can distribute my queries in the right way.
>>
>> While looking into one scenario, i see one query with the below stats:
>>
>> *** The query has several joins and this is the reason why it take much
>> time.
>>
>> Duration: 4.1m
>> Rows Produced: 30102
>> Aggregate Peak Memory Usage: 6.5 GiB
>> Per Node Peak Memory Usage: 6.5 GiB
>> The total number of bytes read from HDFS :727 MiB
>> Memory Accrual: 86228522168 byte seconds
>> Pool: default-pool
>> Query State: FINISHED
>> Threads: CPU Time: 8m
>>
>> ====================
>>
>> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node
>> Peak Memory Usage are identical, while looking in the query profile i see
>> it ran several fragemments on different nodes.
>>
>> Also i see that this query all the times it ran, it has the mentioned 2
>> parmeters with identical values:( Daily scheduled query)
>>
>> What i'm trying to understand:
>>
>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>> different?
>>
>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>> may cause all the data to be read from single node?
>>
>> 3) Since i see that the node that was with peak memory is the
>> coordinator, and i see that the 2 parameters are identical, is it mean that
>> the cordinator also executed most of the query?
>>
>> 4) while thinking when these 2 metrics can have the same value is that at
>> a particular time there was one node participating in the query
>> coordinations and execution which is for sure will be the coordinator node
>> and at that point the query has the highest aggregarte memory consumption.
>>
>> Is my assumption true?
>>
>> 5) If my previous assumption is true, then, is there anyway to force the
>> coordinator to un participate in the query consumption? ( Since have 2-3
>> queries running at the same time with such scenrio will fail)
>>
>> 6) While looking in the fragments i see the following on one of the
>> fragements:
>>
>> PeakMemoryUsage: 141.1 MiB
>> PerHostPeakMemUsage: 6.5 GiB
>>
>> IS the peakMemoryUsage is refer to the node that ran the specific
>> fragment and the perHostPeakMemUsage refers to the node with the peak
>> memory cross the cluster in this soecifc query?
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>
> <profile.txt>
>
>
--
Take Care
Fawze Abujaber