Yes, I'm using the master branch. So your version matters. Here are my steps:
[localhost:21000] default> create external table my_tbl (id int) partitioned by (year int) stored as parquet; Query: create external table my_tbl (id int) partitioned by (year int) stored as parquet +-------------------------+ | summary | +-------------------------+ | Table has been created. | +-------------------------+ Fetched 1 row(s) in 0.08s [localhost:21000] default> insert into my_tbl partition(year) values (0, 2018), (1, 2019), (2, 2020); Query: insert into my_tbl partition(year) values (0, 2018), (1, 2019), (2, 2020) Query submitted at: 2020-05-26 22:15:53 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=08452567e478c88a:b8a5509c00000000 Modified 3 row(s) in 4.32s [localhost:21000] default> show partitions my_tbl; Query: show partitions my_tbl +-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | 2018 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2018 | | 2019 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 | | 2020 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 | | Total | -1 | 3 | 1.03KB | 0B | | | | | +-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ Fetched 4 row(s) in 0.01s [localhost:21000] default> alter table my_tbl drop partition (year <= 2018); Query: alter table my_tbl drop partition (year <= 2018) +-------------------------+ | summary | +-------------------------+ | Dropped 1 partition(s). | +-------------------------+ Fetched 1 row(s) in 0.10s [localhost:21000] default> show partitions my_tbl; Query: show partitions my_tbl +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | 2019 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 | | 2020 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 | | Total | -1 | 2 | 704B | 0B | | | | | +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ Fetched 3 row(s) in 0.02s [localhost:21000] default> Goodbye quanlong quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$ hdfs dfs -ls -R hdfs://localhost:20500/test-warehouse/my_tbl drwxr-xr-x - quanlong supergroup 0 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/_impala_insert_staging drwxr-xr-x - quanlong supergroup 0 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2018 -rw-r--r-- 3 quanlong supergroup 352 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2018/08452567e478c88a-b8a5509c00000000_2146257843_data.0.parq drwxr-xr-x - quanlong supergroup 0 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 -rw-r--r-- 3 quanlong supergroup 352 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2019/08452567e478c88a-b8a5509c00000000_627277683_data.0.parq drwxr-xr-x - quanlong supergroup 0 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 -rw-r--r-- 3 quanlong supergroup 352 2020-05-26 22:15 hdfs://localhost:20500/test-warehouse/my_tbl/year=2020/08452567e478c88a-b8a5509c00000000_1856363806_data.0.parq quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$ hdfs dfs -rm -r hdfs://localhost:20500/test-warehouse/my_tbl/year=2018 20/05/26 22:17:18 INFO fs.TrashPolicyDefault: Moved: 'hdfs://localhost:20500/test-warehouse/my_tbl/year=2018' to trash at: hdfs://localhost:20500/user/quanlong/.Trash/Current/test-warehouse/my_tbl/year=20181590502638841 quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$ bin/impala-shell.sh Starting Impala Shell with no authentication using Python 2.7.16 WARNING: Unable to import readline module (disabling impala-shell command history): No module named readline Opened TCP connection to localhost:21000 Connected to localhost:21000 Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build 50c7ced71109a224463e7dcb03c9177edd577658) *********************************************************************************** Welcome to the Impala shell. (impala shell build version not available) Every command must be terminated by a ';'. *********************************************************************************** [localhost:21000] default> alter table my_tbl recover partitions; Query: alter table my_tbl recover partitions +---------------------------------+ | summary | +---------------------------------+ | Partitions have been recovered. | +---------------------------------+ Fetched 1 row(s) in 0.06s [localhost:21000] default> show partitions my_tbl; Query: show partitions my_tbl +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ | 2019 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 | | 2020 | -1 | 1 | 352B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 | | Total | -1 | 2 | 704B | 0B | | | | | +-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+ Fetched 3 row(s) in 0.02s On Tue, May 26, 2020 at 5:34 PM Fawze Abujaber <fawz...@gmail.com> wrote: > Did you test it on external table? > > On Tue, 26 May 2020 at 10:56 Quanlong Huang <huangquanl...@gmail.com> > wrote: > >> hmm... I can't reproduce this in my local env. Could you create a JIRA >> with your steps and version? >> >> On Tue, May 26, 2020 at 3:49 PM Fawze Abujaber <fawz...@gmail.com> wrote: >> >>> I already checked this and the data is not longer in HDFS and in the >>> show partitions yes it shown in the location >>> >>> On Tue, May 26, 2020 at 10:01 AM Quanlong Huang <huangquanl...@gmail.com> >>> wrote: >>> >>>> for example if i have a table with 3 partitions year=2018,2019 and >>>>> 2020, and hdfs files only for 2019 and 2020, when running drop >>>>> partition year<=2018, yes then i see only partitions of 2019 and 2020 but >>>>> if i run recover partitions i will see 2018, and i assume this related to >>>>> the cache. >>>>> >>>> This should not happen if you have deleted the partition directory of >>>> year=2018. Could you make sure that the location of partition (year=2018) >>>> is no longer exists? The locations are shown in the last column of SHOW >>>> PARTITIONS. >>>> >>>> On Tue, May 26, 2020 at 11:46 AM Fawze Abujaber <fawz...@gmail.com> >>>> wrote: >>>> >>>>> Hi Quanlong, >>>>> >>>>> Thanks for your response, >>>>> >>>>> Yes we are using the DROP partition with expression on few impala >>>>> internal tables. >>>>> >>>>> We have more than 200 external tables all are partitions by year,month >>>>> and day and each has different retention policy and we are managing the >>>>> retention in the HDFS side by deleting the files. >>>>> seems like no straight forward solution for solving my issue, also i >>>>> noticed that running DROP PARTITION and then recover partitions will get >>>>> me >>>>> back to the same situation even the hdfs files are not exist. >>>>> >>>>> for example if i have a table with 3 partitions year=2018,2019 and >>>>> 2020, and hdfs files only for 2019 and 2020, when running drop >>>>> partition year<=2018, yes then i see only partitions of 2019 and 2020 but >>>>> if i run recover partitions i will see 2018, and i assume this related to >>>>> the cache. >>>>> >>>>> On Mon, May 25, 2020 at 5:55 PM Quanlong Huang < >>>>> huangquanl...@gmail.com> wrote: >>>>> >>>>>> Hi Fawze, >>>>>> >>>>>> 1) are these partitions impacting the hive metastore performance and >>>>>>> memory used by the metastore? >>>>>> >>>>>> Do you drop the partitions after you clean the parquet files? If not, >>>>>> the partition metadata still exists in hive metastore, which may impact >>>>>> its >>>>>> performance when the partition number piles up. >>>>>> >>>>>> 2) Is there a way to drop these partitions without running ALTER >>>>>>> TABLE xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table >>>>>>> can make this change but this is a heavy solution. >>>>>>> >>>>>> I think explicitly dropping these partitions is the simplest way. If >>>>>> your requirement is doing this by one statement, you can use expressions >>>>>> in >>>>>> the partition clause to specify several partitions at once, e.g. ALTER >>>>>> TABLE xxx DROP IF EXISTS PARTITION (year = 2018 AND month = 4 AND day < >>>>>> 15). >>>>>> >>>>>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly >>>>>>> created partitions? >>>>>>> >>>>>> Yes, it compares partitions with cached meta and all partition >>>>>> directories on HDFS to get the diff. Then create thoes partitions in Hive >>>>>> and load their meta. >>>>>> >>>>>> Quanlong >>>>>> >>>>>> On Fri, May 22, 2020 at 5:09 PM Fawze Abujaber <fawz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello community, >>>>>>> >>>>>>> I have a spark job that writes parquet files partitioned by year, >>>>>>> month and day, in top of these parquet files i'm creating different >>>>>>> impala >>>>>>> external tables, i have a retention job that cleans the parquet files >>>>>>> in a >>>>>>> daily basis, in the impala side i'm running a daily ALTER TABLE xxxxxx >>>>>>> recover partitions and hourly REFRESH TABLE xxxx, when i'm running show >>>>>>> partitions i see old partitions and they shows up as Zero size and >>>>>>> number >>>>>>> of files. >>>>>>> >>>>>>> 1) are these partitions impacting the hive metastore performance and >>>>>>> memory used by the metastore? >>>>>>> >>>>>>> 2) Is there a way to drop these partitions without running ALTER >>>>>>> TABLE xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table >>>>>>> can make this change but this is a heavy solution. >>>>>>> >>>>>>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly >>>>>>> created partitions? >>>>>>> >>>>>>> -- >>>>>>> Take Care >>>>>>> Fawze Abujaber >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Take Care >>>>> Fawze Abujaber >>>>> >>>> >>> >>> -- >>> Take Care >>> Fawze Abujaber >>> >> -- > Take Care > Fawze Abujaber >