Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

Boris Tyukin Wed, 28 Feb 2018 14:03:12 -0800

point taken! thanks Todd! you saved us from days of frustration, so really
thankful for that!


Boris

On Wed, Feb 28, 2018 at 2:07 PM, Todd Lipcon <[email protected]> wrote:

> Hi Boris,
>
> Thanks for the feedback and sharing your experience. Like you said, this
> is more of an issue with downstream documentation so it's probably not
> appropriate to continue discussing this in the context of the Apache Kudu
> open source project. For feedback on CDH it's best to direct that towards
> the associated vendor (Cloudera). Even though I happen to be employed by
> them, I and other employees participate here in the Apache Kudu project as
> individuals and it's important to keep the distinction separate. Kudu is a
> product of the ASF non-profit organization, not a product of any commercial
> vendor.
>
> -Todd
>
>
> On Wed, Feb 28, 2018 at 6:17 AM, Boris Tyukin <[email protected]>
> wrote:
>
>> first of all, thanks for a very quick response. It means a lot to know
>> you guys stand behind Kudu and work with users like myself. In fact, we get
>> a better support here than through the official channels :)
>>
>> That said, I needed to vent a bit :) Granted this should be probably
>> directed at Impala devs but I think Kudu has been impacted the most.
>>
>> we ran a cluster health check command and saw some warning in there. But
>> we proceeded with running insert statement as Todd suggested with hints and
>> now we are back to our time we've used to get with Kudu 1.3 / Impala 2.8
>>
>> it is a bit frustrating I must say that these changes impacted
>> performance so dramatically. In my opinion, Kudu and CDH release notes must
>> have stuff like that in BOLD RED colors so it does not catch users by
>> surprise. And not everyone so organized like us - we knew exactly how much
>> time it took with Kudu 1.3 /Impala 2.8.
>>
>> I've tracked down all the related JIRAs and I did not see a word about a
>> dramatic performance degradation. I did see words like 'optimized' and
>> 'improved'.
>>
>> Since it is part of the official CDH distro, I would expect a little bit
>> more proactive warning.
>>
>> If you want to open a JIRA on this, would be happy to do it...We see this
>> degradation all across our tables and as you can see from my example query,
>> it is a really straight select from a table - no joins, no predicates and
>> no complex calculations.
>>
>> Thanks again,
>> Boris
>>
>> On Thu, Feb 22, 2018 at 2:44 PM, Todd Lipcon <[email protected]> wrote:
>>
>>> In addition to what Hao suggests, I think it's worth noting that the
>>> insert query plan created by Impala changed a bit over time.
>>>
>>> It sounds like you upgraded from Impala 2.8 (in CDH 5.11), which used a
>>> very straightforward insert plan - each node separately inserted rows in
>>> whatever order the rows were consumed. This plan worked well for smaller
>>> inserts but could cause timeouts with larger workloads.
>>>
>>> In Impala 2.9, the plan was changed so that Impala performs some
>>> shuffling and sorting before inserting into Kudu. This makes the Kudu
>>> insert pattern more reliable and efficient, but could cause a degradation
>>> for some workloads since Impala's sorts are single-threaded.
>>>
>>> Impala 2.10 (which I guess you are running) improved a bit over 2.9 in
>>> ensuring that the sorts can be "partial" which resolved some of the
>>> performance degradation, but it's possible your workload is still affected
>>> negatively.
>>>
>>> To disable the new behavior you can use the insert hints 'noshuffle'
>>> and/or 'noclustered', such as:
>>>
>>> upsert into my_table /* +noclustered,noshuffle */ select * from
>>> my_other_table;
>>>
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao <[email protected]> wrote:
>>>
>>>> Did you happen to check the health of the cluster after the upgrade by 
>>>> 'kudu
>>>> cluster ksck'?
>>>>
>>>> Best,
>>>> Hao
>>>>
>>>> On Thu, Feb 22, 2018 at 6:31 AM, Boris Tyukin <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> we just upgraded our dev cluster from Kudu 1.3 to kudu
>>>>> 1.5.0-cdh5.13.1 and noticed quite severe performance degradation. We did
>>>>> CTAS from Impala parquet table which has not changed a bit since the
>>>>> upgrade (even the same # of rows) to Kudu using the follow query below.
>>>>>
>>>>> It used to take 11-11.5 hours on Kudu 1.3 and now taking 50-55 hours.
>>>>>
>>>>> Of course Impala version was also bumped with CDH 5.13.
>>>>>
>>>>> Any clue why it takes so much time now?
>>>>>
>>>>> Table has 5.5B rows..
>>>>>
>>>>> create TABLE kudutest_ts.clinical_event_nots
>>>>>
>>>>> PRIMARY KEY (clinical_event_id)
>>>>>
>>>>> PARTITION BY HASH(clinical_event_id) PARTITIONS 120
>>>>>
>>>>> STORED AS KUDU
>>>>>
>>>>> AS
>>>>>
>>>>> SELECT
>>>>>
>>>>>   clinical_event_id,
>>>>>
>>>>>   encntr_id,
>>>>>
>>>>>   person_id,
>>>>>
>>>>>   encntr_financial_id,
>>>>>
>>>>>   event_id,
>>>>>
>>>>>   event_title_text,
>>>>>
>>>>>   CAST(view_level as string) as view_level,
>>>>>
>>>>>   order_id,
>>>>>
>>>>>   catalog_cd,
>>>>>
>>>>>   series_ref_nbr,
>>>>>
>>>>>   accession_nbr,
>>>>>
>>>>>   contributor_system_cd,
>>>>>
>>>>>   reference_nbr,
>>>>>
>>>>>   parent_event_id,
>>>>>
>>>>>   event_reltn_cd,
>>>>>
>>>>>   event_class_cd,
>>>>>
>>>>>   event_cd,
>>>>>
>>>>>   event_tag,
>>>>>
>>>>>   CAST(event_end_dt_tm_os as BIGINT) as event_end_dt_tm_os,
>>>>>
>>>>>   result_val,
>>>>>
>>>>>   result_units_cd,
>>>>>
>>>>>   result_time_units_cd,
>>>>>
>>>>>   task_assay_cd,
>>>>>
>>>>>   record_status_cd,
>>>>>
>>>>>   result_status_cd,
>>>>>
>>>>>   CAST(authentic_flag as STRING) authentic_flag,
>>>>>
>>>>>   CAST(publish_flag as STRING) publish_flag,
>>>>>
>>>>>   qc_review_cd,
>>>>>
>>>>>   normalcy_cd,
>>>>>
>>>>>   normalcy_method_cd,
>>>>>
>>>>>   inquire_security_cd,
>>>>>
>>>>>   resource_group_cd,
>>>>>
>>>>>   resource_cd,
>>>>>
>>>>>   CAST(subtable_bit_map as STRING) subtable_bit_map,
>>>>>
>>>>>   collating_seq,
>>>>>
>>>>>   verified_prsnl_id,
>>>>>
>>>>>   performed_prsnl_id,
>>>>>
>>>>>   updt_id,
>>>>>
>>>>>   CAST(updt_task as STRING) updt_task,
>>>>>
>>>>>   updt_cnt,
>>>>>
>>>>>   CAST(updt_applctx as STRING) updt_applctx,
>>>>>
>>>>>   normal_low,
>>>>>
>>>>>   normal_high,
>>>>>
>>>>>   critical_low,
>>>>>
>>>>>   critical_high,
>>>>>
>>>>>   CAST(event_tag_set_flag as STRING) event_tag_set_flag,
>>>>>
>>>>>   CAST(note_importance_bit_map as STRING) note_importance_bit_map,
>>>>>
>>>>>   CAST(order_action_sequence as STRING) order_action_sequence,
>>>>>
>>>>>   entry_mode_cd,
>>>>>
>>>>>   source_cd,
>>>>>
>>>>>   clinical_seq,
>>>>>
>>>>>   CAST(event_end_tz as STRING) event_end_tz,
>>>>>
>>>>>   CAST(event_start_tz as STRING) event_start_tz,
>>>>>
>>>>>   CAST(performed_tz as STRING) performed_tz,
>>>>>
>>>>>   CAST(verified_tz as STRING) verified_tz,
>>>>>
>>>>>   task_assay_version_nbr,
>>>>>
>>>>>   modifier_long_text_id,
>>>>>
>>>>>   ce_dynamic_label_id,
>>>>>
>>>>>   CAST(nomen_string_flag as STRING) nomen_string_flag,
>>>>>
>>>>>   src_event_id,
>>>>>
>>>>>   CAST(last_utc_ts as BIGINT) last_utc_ts,
>>>>>
>>>>>   device_free_txt,
>>>>>
>>>>>   CAST(trait_bit_map as STRING) trait_bit_map,
>>>>>
>>>>>   CAST(clu_subkey1_flag as STRING) clu_subkey1_flag,
>>>>>
>>>>>   CAST(clinsig_updt_dt_tm as BIGINT) clinsig_updt_dt_tm,
>>>>>
>>>>>   CAST(event_end_dt_tm as BIGINT) event_end_dt_tm,
>>>>>
>>>>>   CAST(event_start_dt_tm as BIGINT) event_start_dt_tm,
>>>>>
>>>>>   CAST(expiration_dt_tm as BIGINT) expiration_dt_tm,
>>>>>
>>>>>   CAST(verified_dt_tm as BIGINT) verified_dt_tm,
>>>>>
>>>>>   CAST(src_clinsig_updt_dt_tm as BIGINT) src_clinsig_updt_dt_tm,
>>>>>
>>>>>   CAST(updt_dt_tm as BIGINT) updt_dt_tm,
>>>>>
>>>>>   CAST(valid_from_dt_tm as BIGINT) valid_from_dt_tm,
>>>>>
>>>>>   CAST(valid_until_dt_tm as BIGINT) valid_until_dt_tm,
>>>>>
>>>>>   CAST(performed_dt_tm as BIGINT) performed_dt_tm,
>>>>>
>>>>>   txn_id_text,
>>>>>
>>>>>   CAST(ingest_dt_tm as BIGINT) ingest_dt_tm
>>>>>
>>>>> FROM v500.clinical_event
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

Reply via email to