Thanks Alot, Just to close the loop for my self for SYN_DDL i can add "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet in Cloudera manager and i shouldn't add it anywhere in the queries ( even the DDL was done in hive or impala) , right?
For the invalidate and REFRESH, i think i got the point and it much clear for me. Quanlong, so much appreciate your help here. On Wed, Jul 4, 2018 at 5:34 PM, Quanlong Huang <[email protected]> wrote: > > For your 2nd job, you don't need to run INVALIDATE METADATA statements > since DDL and DML in impala will refresh the metadata automatically. > > -- For the invalidate i almost sure i don't need, but what about > the SET SYNC_DDL=1, should i use it in the 2nd case, and should i add it > after the ALTER statement? > > You mentioned that you have load balancing for impalads. So you should add > "SET SYNC_DDL=1" *before* the ALTER statement and in the same session. > For example, in your 2nd job: > impala-shell -i "$LPIMPALA" -q "*SET SYNC_DDL=1;* ALTER TABLE table_name2 > DROP IF EXISTS PARTITION (dt<'$RetentionDate');" > > > For your 1st job, again, you don't need to run the INVALIDATE METADATA > statement after your ALTER TABLE statement in Impala. After the INSERT > OVERWRITE statement in hive, you just need to run a REFRESH statement > instead of the INVALIDATE METADATA statement. > > -- So you mean i don't need after dropping some partitions to run > refresh statement even? after the insert i need the refresh. > > Yes. The REFRESH statement is for modifications outside Impala (e.g. in > hive). If you run ALTER TABLE inside Impala, you don't need the REFRESH > statement. > > > What about SET SYNC_DDL=1? where i should set it, is it something that > i can do this globally at impala and not bu query? > > If you're accessing Impala through a load-balancing proxy, you should SET > SYNC_DDL=1 before any DDL or DML statements. There's a more detail doc > explaining this: https://www.cloudera.com/documentation/enterprise/ > 5-14-x/topics/impala_sync_ddl.html > You can set it globally by adding --default_query_options=SYNC_DDL=1 in > the startup options. If you're using Cloudera Manager, you can add > "SYNC_DDL=1" in "Impala Daemon Query Options Advanced Configuration Snippet > (Safety Valve)". > > > Do you recommend me to use REFRSH or ALTER table x Recover partitions? > > I'm not quite sure about this. But I'm sure that if you're just adding new > partitions into a table, you can use ALTER TABLE x Recover partitions. > Otherwise, I believe you should not use it. > > At 2018-07-04 21:39:29,"Fawze Abujaber" <[email protected]> wrote: > > Hi Quanlong, > > Thanks for your quick response. > > This is an ETLs that i inherent from other engineers, so trying to make > sure the changes i'm planning to do will not impact the current work. > > Please see my responses inline. > > > > On Wed, Jul 4, 2018 at 4:23 PM, Quanlong Huang <[email protected]> > wrote: > >> Hi Fawze, >> >> As you need a quick consult, let me answer your questions first. The >> official Cloudera folks will correct me if something wrong. >> >> For your 2nd job, you don't need to run INVALIDATE METADATA statements >> since DDL and DML in impala will refresh the metadata automatically. >> -- For the invalidate i almost sure i don't need, but what about the SET >> SYNC_DDL=1, should i use it in the 2nd case, and should i add it after the >> ALTER statement? >> For your 1st job, again, you don't need to run the INVALIDATE METADATA >> statement after your ALTER TABLE statement in Impala. After the INSERT >> OVERWRITE statement in hive, you just need to run a REFRESH statement >> instead of the INVALIDATE METADATA statement. >> -- So you mean i don't need after dropping some partitions to run refresh >> statement even? after the insert i need the refresh. >> > What about SET SYNC_DDL=1? where i should set it, is it something that i > can do this globally at impala and not bu query? > Do you recommend me to use REFRSH or ALTER table x Recover partitions? > >> BTW, what do you mean by "set DDL=1"? If you're mentioning the SYNC_DDL >> option, it's recommended to set it to true if you're using load balancer >> above of the impala daemons. >> >> If you're interested in the difference between INVALIDATE METADATA and >> REFRESH, there's a discussion thread: >> https://lists.apache.org/thread.html/44b4f223b49e8d7cb07865d >> 5dfba50378d6e41fc173369b56c36b652@%3Cdev.impala.apache.org%3E >> >> In a nutshell, there're only two cases you should use the INVALIDATE >> METADATA statement: >> - you create a table in hive and want to use it in Impala >> - the block locations of the underlying hive table files changed by HDFS >> balancer >> You can ignore the second case since it's just for performance, not for >> correctness. In other cases like adding/dropping files/partitions in hive, >> you just need the REFRESH statement. >> >> Regards, >> Quanlong >> >> >> At 2018-07-04 18:11:41, "Fawze Abujaber" <[email protected]> wrote: >> >> Hi Community, >> >> I need a quick consult for using the SET DDL while i'm using impala VIP >> for the load balancing. >> >> Below are 2 ETL jobs, the first one is using hive and the second is >> impala and both have DDL operations. >> >> I will start with the 2nd as it should stright forward since the data >> inserted into the table_name2 by impala, so i think i don't need the >> invalidate statement but still need the DDL statement, is it right? >> should i add the set DDL after ALTER table statement? do you really need >> it as the insert isn't based on the ALTER table statement since it's >> dropping old partitions. >> >> For the 1st one where i'm using hive, I know i have to use invalidate >> metadata, i was thinking to use the set DDL=1 after each invalidate >> metadata statement, should i use after both statements, or it's enough to >> be used only in the last one? >> I read the documntation about this but wasn't sure if get this point >> right, so this is why i'm asking for consultation here, note a side i still >> think that i should use refresh statement after the insert while using >> invalidate after the ALTER statement. >> >> Thanks for your help in advance. >> >> >> >> RetentionDate=$(date --date="183 days ago" +%Y%m%d) >> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name DROP IF EXISTS >> PARTITION (concat(year,month,day)<'$RetentionDate');" >> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;" >> >> hive -e" >> set parquet.compression=SNAPPY; >> set hive.merge.mapredfiles=true; >> set hive.mapred.mode=nonstrict; >> set hive.exec.dynamic.partition.mode=nonstrict; >> >> >> INSERT OVERWRITE TABLE table_name PARTITION (year, month, day) >> >> .... >> .... >> .... >> .... >> >> ;" >> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name;" >> impala-shell -i "$LPIMPALA" -q "compute INCREMENTAL STATS table_name;" >> >> >> >> >> ======================================================= >> >> 2nd one. >> >> impala-shell -i "$LPIMPALA" -q "ALTER TABLE table_name2 DROP IF EXISTS >> PARTITION (dt<'$RetentionDate');" >> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;" >> >> impala-shell -B -i "$LPIMPALA" -q" >> >> >> INSERT OVERWRITE TABLE table_name2 PARTITION (dt) >> .... >> .... >> .... >> .... >> " >> >> impala-shell -i "$LPIMPALA" -q "INVALIDATE METADATA table_name2;" >> >> -- >> Take Care >> Fawze Abujaber >> >> > > > -- > Take Care > Fawze Abujaber > > -- Take Care Fawze Abujaber
