Thanks Alot,

Just to close the loop for my self for SYN_DDL i can add  "SYNC_DDL=1" in
"Impala Daemon Query Options Advanced Configuration Snippet in Cloudera
manager and i shouldn't add it anywhere in the queries ( even the DDL was
done in hive or impala) , right?

For the invalidate and REFRESH, i think i got the point and it much clear
for me.

Quanlong, so much appreciate your help here.



On Wed, Jul 4, 2018 at 5:34 PM, Quanlong Huang <[email protected]>
wrote:

> > For your 2nd job, you don't need to run INVALIDATE METADATA statements
> since DDL and DML in impala will refresh the metadata automatically.
> > -- For the invalidate i almost sure i don't need, but what about
> the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it
> after the ALTER statement?
>
> You mentioned that you have load balancing for impalads. So you should add
> "SET SYNC_DDL=1" *before* the ALTER statement and in the same session.
> For example, in your 2nd job:
> impala-shell -i "$LPIMPALA" -q "*SET SYNC_DDL=1;* ALTER TABLE table_name2
> DROP IF EXISTS PARTITION (dt<'$RetentionDate');"
>
> > For your 1st job, again, you don't need to run the INVALIDATE METADATA
> statement after your ALTER TABLE statement in Impala. After the INSERT
> OVERWRITE statement in hive, you just need to run a REFRESH statement
> instead of the INVALIDATE METADATA statement.
> > -- So you mean i don't need after dropping some partitions to run
> refresh statement even? after the insert i need the refresh.
>
> Yes. The REFRESH statement is for modifications outside Impala (e.g. in
> hive). If you run ALTER TABLE inside Impala, you don't need the REFRESH
> statement.
>
> > What about  SET SYNC_DDL=1? where i should set it, is it something that
> i can do this globally at impala and not bu query?
>
> If you're accessing Impala through a load-balancing proxy, you should SET
> SYNC_DDL=1 before any DDL or DML statements. There's a more detail doc
> explaining this: https://www.cloudera.com/documentation/enterprise/
> 5-14-x/topics/impala_sync_ddl.html
> You can set it globally by adding --default_query_options=SYNC_DDL=1 in
> the startup options. If you're using Cloudera Manager, you can add
> "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet
> (Safety Valve)".
>
> > Do you recommend me to use REFRSH or ALTER table x Recover partitions?
>
> I'm not quite sure about this. But I'm sure that if you're just adding new
> partitions into a table, you can use ALTER TABLE x Recover partitions.
> Otherwise, I believe you should not use it.
>
> At 2018-07-04 21:39:29,"Fawze Abujaber" <[email protected]> wrote:
>
> Hi Quanlong,
>
> Thanks for your quick response.
>
> This is an ETLs that i inherent from other engineers, so trying to make
> sure the changes i'm planning to do will not impact the current work.
>
> Please see my responses inline.
>
>
>
> On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <[email protected]>
> wrote:
>
>> Hi Fawze,
>>
>> As you need a quick consult, let me answer your questions first. The
>> official Cloudera folks will correct me if something wrong.
>>
>> For your 2nd job, you don't need to run INVALIDATE METADATA statements
>> since DDL and DML in impala will refresh the metadata automatically.
>> -- For the invalidate i almost sure i don't need, but what about the SET
>> SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the
>> ALTER statement?
>> For your 1st job, again, you don't need to run the INVALIDATE METADATA
>> statement after your ALTER TABLE statement in Impala. After the INSERT
>> OVERWRITE statement in hive, you just need to run a REFRESH statement
>> instead of the INVALIDATE METADATA statement.
>> -- So you mean i don't need after dropping some partitions to run refresh
>> statement even? after the insert i need the refresh.
>>
> What about  SET SYNC_DDL=1? where i should set it, is it something that i
> can do this globally at impala and not bu query?
> Do you recommend me to use REFRSH or ALTER table x Recover partitions?
>
>> BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL
>> option, it's recommended to set it to true if you're using load balancer
>> above of the impala daemons.
>>
>> If you're interested in the difference between INVALIDATE METADATA and
>> REFRESH, there's a discussion thread:
>> https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d
>> 5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E
>>
>> In a nutshell, there're only two cases you should use the INVALIDATE
>> METADATA statement:
>> - you create a table in hive and want to use it in Impala
>> - the block locations of the underlying hive table files changed by HDFS
>> balancer
>> You can ignore the second case since it's just for performance, not for
>> correctness. In other cases like adding/dropping files/partitions in hive,
>> you just need the REFRESH statement.
>>
>> Regards,
>> Quanlong
>>
>>
>> At 2018-07-04 18:11:41, "Fawze Abujaber" <[email protected]> wrote:
>>
>> Hi Community,
>>
>> I need a quick consult for using the SET DDL while i'm using impala VIP
>> for the load balancing.
>>
>> Below are 2 ETL jobs, the first one is using hive and the second is
>> impala and both have DDL operations.
>>
>> I will start with the 2nd as it should stright forward since the data
>> inserted into the table_name2 by impala, so i think i don't need the
>> invalidate statement but still need the DDL statement, is it right?
>> should i add the set DDL after ALTER table statement? do you really need
>> it as the insert isn't based on the ALTER table statement since it's
>> dropping old partitions.
>>
>> For the 1st one where i'm using hive, I know i have to use invalidate
>> metadata, i was thinking to use the set DDL=1 after each invalidate
>> metadata statement, should i use after both statements, or it's enough to
>> be used only in the last one?
>> I read the documntation about this but wasn't sure if get this point
>> right, so this is why i'm asking for consultation here, note a side i still
>> think that i should use  refresh statement after the insert while using
>> invalidate after the ALTER statement.
>>
>> Thanks for your help in advance.
>>
>>
>>
>> RetentionDate=$(date --date="183 days ago" +%Y%m%d)
>> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS
>> PARTITION (concat(year,month,day)<'$RetentionDate');"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
>>
>> hive -e"
>> set parquet.compression=SNAPPY;
>> set hive.merge.mapredfiles=true;
>> set hive.mapred.mode=nonstrict;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>>
>>
>> INSERT OVERWRITE TABLE table_name PARTITION (year, month, day)
>>
>> ....
>> ....
>> ....
>> ....
>>
>> ;"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;"
>> impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;"
>>
>>
>>
>>
>> =======================================================
>>
>> 2nd one.
>>
>> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS
>> PARTITION (dt<'$RetentionDate');"
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>>
>> impala-shell -B -i "$LPIMPALA" -q"
>>
>>
>> INSERT OVERWRITE TABLE table_name2 PARTITION (dt)
>> ....
>> ....
>> ....
>> ....
>> "
>>
>> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;"
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>
>


-- 
Take Care
Fawze Abujaber

Reply via email to