Hi Mostafa,

Thanks for you response.

We are using Spark to crete hdfs parquet files and we defined external
tables on top of these parquet files.

Since we are running 100s of scheduled queries on these tables, i'm not
sure that running compute stats can change the the scenario here.

Maybe i would like to separate these issues and first understand what
happened and understand the numbers i see and where it come from, and then
maybe try to enhance the analyzer.

Lastly, i need to find a way to run compute stats on the tables.


On Sun, May 20, 2018 at 2:11 AM, Mostafa Mokhtar <mmokh...@cloudera.com>
wrote:

> If using Spark to load data to be queried by Impala please
> use spark.sql.parquet.writeLegacyFormat.
>
> Please set the number of rows to tables and distinct count to the columns
> using
> https://www.cloudera.com/documentation/enterprise/5-8-
> x/topics/impala_perf_stats.html
>
>
>
> On Sat, May 19, 2018 at 1:56 PM, Fawze Abujaber <fawz...@gmail.com> wrote:
>
>> Hi Mustafa,
>>
>> I cann't run compute stats on the dv tables and i think the reason that
>> these tables data writen using spark sql.
>>
>>
>>    - r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
>>    incompatible Parquet schema for column 
>> 'default.dv_engagementendevent_prq_local.dsptlocation'.
>>    Column type: STRING, Parquet schema: optional struct EngagementEndEvent
>>    [i:-1 d:1 r:0] { optional int64 engagement_end_time [i:201 d:2 r:0]
>>    optional int64 end_reason [i:202 d:2 r:0] optional byte_array agent_id
>>    [i:203 d:2 r:0] optional byte_array service_queue_id [i:204 d:2 r:0]
>>    optional int64 skill_id [i:205 d:2 r:0] optional int32
>>    interactive_indication [i:206 d:2 r:0] optional int64
>>    engagement_set_sequence [i:207 d:2 r:0] optional int64 engagement_sequence
>>    [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2 r:0] optional int64
>>    millis_in_queue [i:210 d:2 r:0] optional int64 target_skill_id [i:211 d:2
>>    r:0] optional byte_array target_agent_id [i:212 d:2 r:0] optional
>>    byte_array transcript_identifier [i:213 d:2 r:0] optional int64
>>    call_connect_time [i:214 d:2 r:0] optional int64 conversation_start_time
>>    [i:215 d:2 r:0] optional int64 precall_ivr_connect_time [i:216 d:2 r:0]
>>    optional int64 s_call_request_time [i:217 d:2 r:0] optional int64
>>    s_call_scheduled_time [i:218 d:2 r:0] optional byte_array units [i:219 d:2
>>    r:0] optional byte_array from_phone_number [i:220 d:2 r:0] optional
>>    byte_array to_phone_number [i:221 d:2 r:0] optional byte_array did_number
>>    [i:222 d:2 r:0] optional int32 from_participant_kind [i:223 d:2 r:0]
>>    optional int32 to_participant_kind [i:224 d:2 r:0] optional byte_array
>>    technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
>>    call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
>>    r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
>>    agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
>>    optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
>>    target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
>>    [i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
>>    optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
>>    agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
>>    r:0] }
>>
>>
>> The other tables isn't partitioned.
>>
>> On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <mmokh...@cloudera.com>
>> wrote:
>>
>>> Please run “compute stats table_name” against the tables missing stats
>>> in the warnings section.
>>> Then rerun the query and send the new profiles.
>>> With stats the planner should produce more efficient query plans with
>>> less peak memory.
>>>
>>> Thanks
>>> Mostafa
>>>
>>> On May 19, 2018, at 9:38 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
>>>
>>> Attached the query profile
>>>
>>> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mmokh...@cloudera.com>
>>> wrote:
>>>
>>>> Please include the full query profile in your original email.
>>>>
>>>> Thanks
>>>> Mostafa
>>>>
>>>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
>>>>
>>>> Hi Community,
>>>>
>>>> I'm investigating the peak times where the impala daeomns memory were
>>>> consumed so i can distribute my queries in the right way.
>>>>
>>>> While looking into one scenario, i see one query with the below stats:
>>>>
>>>> *** The query has several joins and this is the reason why it take much
>>>> time.
>>>>
>>>> Duration: 4.1m
>>>> Rows Produced: 30102
>>>> Aggregate Peak Memory Usage: 6.5 GiB
>>>> Per Node Peak Memory Usage: 6.5 GiB
>>>> The total number of bytes read from HDFS :727 MiB
>>>> Memory Accrual: 86228522168 byte seconds
>>>> Pool: default-pool
>>>> Query State: FINISHED
>>>> Threads: CPU Time: 8m
>>>>
>>>> ====================
>>>>
>>>> I'm intersting to understand why Aggregate Peak Memory Usage and Per
>>>> Node Peak Memory Usage are identical, while looking in the query profile i
>>>> see it ran several fragemments on different nodes.
>>>>
>>>> Also i see that this query all the times it ran, it has the mentioned 2
>>>> parmeters with identical values:( Daily scheduled query)
>>>>
>>>> What i'm trying to understand:
>>>>
>>>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>>>> different?
>>>>
>>>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>>>> may cause all the data to be read from single node?
>>>>
>>>> 3) Since i see that the node that was with peak memory is the
>>>> coordinator, and i see that the 2 parameters are identical, is it mean that
>>>> the cordinator also executed most of  the query?
>>>>
>>>> 4) while thinking when these 2 metrics can have the same value is that
>>>> at a particular time there was one node participating in the query
>>>> coordinations and execution which is for sure will be the coordinator node
>>>> and at that point the query has the highest aggregarte memory consumption.
>>>>
>>>> Is my assumption true?
>>>>
>>>> 5) If my previous assumption is true, then, is there anyway to force
>>>> the coordinator to un participate in the query consumption? ( Since have
>>>> 2-3 queries running at the same time with such scenrio will fail)
>>>>
>>>> 6) While looking in the fragments i see the following on one of the
>>>> fragements:
>>>>
>>>> PeakMemoryUsage: 141.1 MiB
>>>> PerHostPeakMemUsage: 6.5 GiB
>>>>
>>>> IS the peakMemoryUsage is refer to the node that ran the specific
>>>> fragment and the perHostPeakMemUsage refers to the node with the peak
>>>> memory cross the cluster in this soecifc query?
>>>>
>>>>
>>>> --
>>>> Take Care
>>>> Fawze Abujaber
>>>>
>>>>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>> <profile.txt>
>>>
>>>
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>
>


-- 
Take Care
Fawze Abujaber

Reply via email to