Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

Todd Lipcon Wed, 28 Feb 2018 11:07:52 -0800

Hi Boris,

Thanks for the feedback and sharing your experience. Like you said, this is
more of an issue with downstream documentation so it's probably not
appropriate to continue discussing this in the context of the Apache Kudu
open source project. For feedback on CDH it's best to direct that towards
the associated vendor (Cloudera). Even though I happen to be employed by
them, I and other employees participate here in the Apache Kudu project as
individuals and it's important to keep the distinction separate. Kudu is a
product of the ASF non-profit organization, not a product of any commercial
vendor.


-Todd

On Wed, Feb 28, 2018 at 6:17 AM, Boris Tyukin <[email protected]> wrote:

> first of all, thanks for a very quick response. It means a lot to know you
> guys stand behind Kudu and work with users like myself. In fact, we get a
> better support here than through the official channels :)
>
> That said, I needed to vent a bit :) Granted this should be probably
> directed at Impala devs but I think Kudu has been impacted the most.
>
> we ran a cluster health check command and saw some warning in there. But
> we proceeded with running insert statement as Todd suggested with hints and
> now we are back to our time we've used to get with Kudu 1.3 / Impala 2.8
>
> it is a bit frustrating I must say that these changes impacted performance
> so dramatically. In my opinion, Kudu and CDH release notes must have stuff
> like that in BOLD RED colors so it does not catch users by surprise. And
> not everyone so organized like us - we knew exactly how much time it took
> with Kudu 1.3 /Impala 2.8.
>
> I've tracked down all the related JIRAs and I did not see a word about a
> dramatic performance degradation. I did see words like 'optimized' and
> 'improved'.
>
> Since it is part of the official CDH distro, I would expect a little bit
> more proactive warning.
>
> If you want to open a JIRA on this, would be happy to do it...We see this
> degradation all across our tables and as you can see from my example query,
> it is a really straight select from a table - no joins, no predicates and
> no complex calculations.
>
> Thanks again,
> Boris
>
> On Thu, Feb 22, 2018 at 2:44 PM, Todd Lipcon <[email protected]> wrote:
>
>> In addition to what Hao suggests, I think it's worth noting that the
>> insert query plan created by Impala changed a bit over time.
>>
>> It sounds like you upgraded from Impala 2.8 (in CDH 5.11), which used a
>> very straightforward insert plan - each node separately inserted rows in
>> whatever order the rows were consumed. This plan worked well for smaller
>> inserts but could cause timeouts with larger workloads.
>>
>> In Impala 2.9, the plan was changed so that Impala performs some
>> shuffling and sorting before inserting into Kudu. This makes the Kudu
>> insert pattern more reliable and efficient, but could cause a degradation
>> for some workloads since Impala's sorts are single-threaded.
>>
>> Impala 2.10 (which I guess you are running) improved a bit over 2.9 in
>> ensuring that the sorts can be "partial" which resolved some of the
>> performance degradation, but it's possible your workload is still affected
>> negatively.
>>
>> To disable the new behavior you can use the insert hints 'noshuffle'
>> and/or 'noclustered', such as:
>>
>> upsert into my_table /* +noclustered,noshuffle */ select * from
>> my_other_table;
>>
>>
>> Hope that helps
>> -Todd
>>
>> On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao <[email protected]> wrote:
>>
>>> Did you happen to check the health of the cluster after the upgrade by 'kudu
>>> cluster ksck'?
>>>
>>> Best,
>>> Hao
>>>
>>> On Thu, Feb 22, 2018 at 6:31 AM, Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1
>>>> and noticed quite severe performance degradation. We did CTAS from Impala
>>>> parquet table which has not changed a bit since the upgrade (even the same
>>>> # of rows) to Kudu using the follow query below.
>>>>
>>>> It used to take 11-11.5 hours on Kudu 1.3 and now taking 50-55 hours.
>>>>
>>>> Of course Impala version was also bumped with CDH 5.13.
>>>>
>>>> Any clue why it takes so much time now?
>>>>
>>>> Table has 5.5B rows..
>>>>
>>>> create TABLE kudutest_ts.clinical_event_nots
>>>>
>>>> PRIMARY KEY (clinical_event_id)
>>>>
>>>> PARTITION BY HASH(clinical_event_id) PARTITIONS 120
>>>>
>>>> STORED AS KUDU
>>>>
>>>> AS
>>>>
>>>> SELECT
>>>>
>>>>   clinical_event_id,
>>>>
>>>>   encntr_id,
>>>>
>>>>   person_id,
>>>>
>>>>   encntr_financial_id,
>>>>
>>>>   event_id,
>>>>
>>>>   event_title_text,
>>>>
>>>>   CAST(view_level as string) as view_level,
>>>>
>>>>   order_id,
>>>>
>>>>   catalog_cd,
>>>>
>>>>   series_ref_nbr,
>>>>
>>>>   accession_nbr,
>>>>
>>>>   contributor_system_cd,
>>>>
>>>>   reference_nbr,
>>>>
>>>>   parent_event_id,
>>>>
>>>>   event_reltn_cd,
>>>>
>>>>   event_class_cd,
>>>>
>>>>   event_cd,
>>>>
>>>>   event_tag,
>>>>
>>>>   CAST(event_end_dt_tm_os as BIGINT) as event_end_dt_tm_os,
>>>>
>>>>   result_val,
>>>>
>>>>   result_units_cd,
>>>>
>>>>   result_time_units_cd,
>>>>
>>>>   task_assay_cd,
>>>>
>>>>   record_status_cd,
>>>>
>>>>   result_status_cd,
>>>>
>>>>   CAST(authentic_flag as STRING) authentic_flag,
>>>>
>>>>   CAST(publish_flag as STRING) publish_flag,
>>>>
>>>>   qc_review_cd,
>>>>
>>>>   normalcy_cd,
>>>>
>>>>   normalcy_method_cd,
>>>>
>>>>   inquire_security_cd,
>>>>
>>>>   resource_group_cd,
>>>>
>>>>   resource_cd,
>>>>
>>>>   CAST(subtable_bit_map as STRING) subtable_bit_map,
>>>>
>>>>   collating_seq,
>>>>
>>>>   verified_prsnl_id,
>>>>
>>>>   performed_prsnl_id,
>>>>
>>>>   updt_id,
>>>>
>>>>   CAST(updt_task as STRING) updt_task,
>>>>
>>>>   updt_cnt,
>>>>
>>>>   CAST(updt_applctx as STRING) updt_applctx,
>>>>
>>>>   normal_low,
>>>>
>>>>   normal_high,
>>>>
>>>>   critical_low,
>>>>
>>>>   critical_high,
>>>>
>>>>   CAST(event_tag_set_flag as STRING) event_tag_set_flag,
>>>>
>>>>   CAST(note_importance_bit_map as STRING) note_importance_bit_map,
>>>>
>>>>   CAST(order_action_sequence as STRING) order_action_sequence,
>>>>
>>>>   entry_mode_cd,
>>>>
>>>>   source_cd,
>>>>
>>>>   clinical_seq,
>>>>
>>>>   CAST(event_end_tz as STRING) event_end_tz,
>>>>
>>>>   CAST(event_start_tz as STRING) event_start_tz,
>>>>
>>>>   CAST(performed_tz as STRING) performed_tz,
>>>>
>>>>   CAST(verified_tz as STRING) verified_tz,
>>>>
>>>>   task_assay_version_nbr,
>>>>
>>>>   modifier_long_text_id,
>>>>
>>>>   ce_dynamic_label_id,
>>>>
>>>>   CAST(nomen_string_flag as STRING) nomen_string_flag,
>>>>
>>>>   src_event_id,
>>>>
>>>>   CAST(last_utc_ts as BIGINT) last_utc_ts,
>>>>
>>>>   device_free_txt,
>>>>
>>>>   CAST(trait_bit_map as STRING) trait_bit_map,
>>>>
>>>>   CAST(clu_subkey1_flag as STRING) clu_subkey1_flag,
>>>>
>>>>   CAST(clinsig_updt_dt_tm as BIGINT) clinsig_updt_dt_tm,
>>>>
>>>>   CAST(event_end_dt_tm as BIGINT) event_end_dt_tm,
>>>>
>>>>   CAST(event_start_dt_tm as BIGINT) event_start_dt_tm,
>>>>
>>>>   CAST(expiration_dt_tm as BIGINT) expiration_dt_tm,
>>>>
>>>>   CAST(verified_dt_tm as BIGINT) verified_dt_tm,
>>>>
>>>>   CAST(src_clinsig_updt_dt_tm as BIGINT) src_clinsig_updt_dt_tm,
>>>>
>>>>   CAST(updt_dt_tm as BIGINT) updt_dt_tm,
>>>>
>>>>   CAST(valid_from_dt_tm as BIGINT) valid_from_dt_tm,
>>>>
>>>>   CAST(valid_until_dt_tm as BIGINT) valid_until_dt_tm,
>>>>
>>>>   CAST(performed_dt_tm as BIGINT) performed_dt_tm,
>>>>
>>>>   txn_id_text,
>>>>
>>>>   CAST(ingest_dt_tm as BIGINT) ingest_dt_tm
>>>>
>>>> FROM v500.clinical_event
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation

Reply via email to