I already checked this and the data is not longer in HDFS and in the show partitions yes it shown in the location
On Tue, May 26, 2020 at 10:01 AM Quanlong Huang <huangquanl...@gmail.com> wrote: > for example if i have a table with 3 partitions year=2018,2019 and 2020, >> and hdfs files only for 2019 and 2020, when running drop >> partition year<=2018, yes then i see only partitions of 2019 and 2020 but >> if i run recover partitions i will see 2018, and i assume this related to >> the cache. >> > This should not happen if you have deleted the partition directory of > year=2018. Could you make sure that the location of partition (year=2018) > is no longer exists? The locations are shown in the last column of SHOW > PARTITIONS. > > On Tue, May 26, 2020 at 11:46 AM Fawze Abujaber <fawz...@gmail.com> wrote: > >> Hi Quanlong, >> >> Thanks for your response, >> >> Yes we are using the DROP partition with expression on few impala >> internal tables. >> >> We have more than 200 external tables all are partitions by year,month >> and day and each has different retention policy and we are managing the >> retention in the HDFS side by deleting the files. >> seems like no straight forward solution for solving my issue, also i >> noticed that running DROP PARTITION and then recover partitions will get me >> back to the same situation even the hdfs files are not exist. >> >> for example if i have a table with 3 partitions year=2018,2019 and 2020, >> and hdfs files only for 2019 and 2020, when running drop >> partition year<=2018, yes then i see only partitions of 2019 and 2020 but >> if i run recover partitions i will see 2018, and i assume this related to >> the cache. >> >> On Mon, May 25, 2020 at 5:55 PM Quanlong Huang <huangquanl...@gmail.com> >> wrote: >> >>> Hi Fawze, >>> >>> 1) are these partitions impacting the hive metastore performance and >>>> memory used by the metastore? >>> >>> Do you drop the partitions after you clean the parquet files? If not, >>> the partition metadata still exists in hive metastore, which may impact its >>> performance when the partition number piles up. >>> >>> 2) Is there a way to drop these partitions without running ALTER TABLE >>>> xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table can make >>>> this change but this is a heavy solution. >>>> >>> I think explicitly dropping these partitions is the simplest way. If >>> your requirement is doing this by one statement, you can use expressions in >>> the partition clause to specify several partitions at once, e.g. ALTER >>> TABLE xxx DROP IF EXISTS PARTITION (year = 2018 AND month = 4 AND day < 15). >>> >>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly created >>>> partitions? >>>> >>> Yes, it compares partitions with cached meta and all partition >>> directories on HDFS to get the diff. Then create thoes partitions in Hive >>> and load their meta. >>> >>> Quanlong >>> >>> On Fri, May 22, 2020 at 5:09 PM Fawze Abujaber <fawz...@gmail.com> >>> wrote: >>> >>>> Hello community, >>>> >>>> I have a spark job that writes parquet files partitioned by year, month >>>> and day, in top of these parquet files i'm creating different impala >>>> external tables, i have a retention job that cleans the parquet files in a >>>> daily basis, in the impala side i'm running a daily ALTER TABLE xxxxxx >>>> recover partitions and hourly REFRESH TABLE xxxx, when i'm running show >>>> partitions i see old partitions and they shows up as Zero size and number >>>> of files. >>>> >>>> 1) are these partitions impacting the hive metastore performance and >>>> memory used by the metastore? >>>> >>>> 2) Is there a way to drop these partitions without running ALTER TABLE >>>> xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table can make >>>> this change but this is a heavy solution. >>>> >>>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly created >>>> partitions? >>>> >>>> -- >>>> Take Care >>>> Fawze Abujaber >>>> >>> >> >> -- >> Take Care >> Fawze Abujaber >> > -- Take Care Fawze Abujaber