Yes, I'm using the master branch. So your version matters. Here are my
steps:

[localhost:21000] default> create external table my_tbl (id int)
partitioned by (year int) stored as parquet;
Query: create external table my_tbl (id int) partitioned by (year int)
stored as parquet
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.08s
[localhost:21000] default> insert into my_tbl partition(year) values (0,
2018), (1, 2019), (2, 2020);
Query: insert into my_tbl partition(year) values (0, 2018), (1, 2019), (2,
2020)
Query submitted at: 2020-05-26 22:15:53 (Coordinator:
http://quanlong-OptiPlex-BJ:25000)
Query progress can be monitored at:
http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=08452567e478c88a:b8a5509c00000000
Modified 3 row(s) in 4.32s
[localhost:21000] default> show partitions my_tbl;
Query: show partitions my_tbl
+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| year  | #Rows | #Files | Size   | Bytes Cached | Cache Replication |
Format  | Incremental stats | Location
          |
+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| 2018  | -1    | 1      | 352B   | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2018 |
| 2019  | -1    | 1      | 352B   | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 |
| 2020  | -1    | 1      | 352B   | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 |
| Total | -1    | 3      | 1.03KB | 0B           |                   |
    |                   |
     |
+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
Fetched 4 row(s) in 0.01s
[localhost:21000] default> alter table my_tbl drop partition (year <= 2018);
Query: alter table my_tbl drop partition (year <= 2018)
+-------------------------+
| summary                 |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.10s
[localhost:21000] default> show partitions my_tbl;
Query: show partitions my_tbl
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format
 | Incremental stats | Location
  |
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| 2019  | -1    | 1      | 352B | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 |
| 2020  | -1    | 1      | 352B | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 |
| Total | -1    | 2      | 704B | 0B           |                   |
  |                   |
   |
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
Fetched 3 row(s) in 0.02s
[localhost:21000] default> Goodbye quanlong
quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$ hdfs dfs -ls -R
hdfs://localhost:20500/test-warehouse/my_tbl
drwxr-xr-x   - quanlong supergroup          0 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/_impala_insert_staging
drwxr-xr-x   - quanlong supergroup          0 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2018
-rw-r--r--   3 quanlong supergroup        352 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2018/08452567e478c88a-b8a5509c00000000_2146257843_data.0.parq
drwxr-xr-x   - quanlong supergroup          0 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2019
-rw-r--r--   3 quanlong supergroup        352 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2019/08452567e478c88a-b8a5509c00000000_627277683_data.0.parq
drwxr-xr-x   - quanlong supergroup          0 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2020
-rw-r--r--   3 quanlong supergroup        352 2020-05-26 22:15
hdfs://localhost:20500/test-warehouse/my_tbl/year=2020/08452567e478c88a-b8a5509c00000000_1856363806_data.0.parq
quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$ hdfs dfs -rm -r
hdfs://localhost:20500/test-warehouse/my_tbl/year=2018
20/05/26 22:17:18 INFO fs.TrashPolicyDefault: Moved:
'hdfs://localhost:20500/test-warehouse/my_tbl/year=2018' to trash at:
hdfs://localhost:20500/user/quanlong/.Trash/Current/test-warehouse/my_tbl/year=20181590502638841
quanlong@quanlong-OptiPlex-BJ:~/workspace/Impala (master)$
bin/impala-shell.sh
Starting Impala Shell with no authentication using Python 2.7.16
WARNING: Unable to import readline module (disabling impala-shell command
history): No module named readline
Opened TCP connection to localhost:21000
Connected to localhost:21000
Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build
50c7ced71109a224463e7dcb03c9177edd577658)
***********************************************************************************
Welcome to the Impala shell.
(impala shell build version not available)

Every command must be terminated by a ';'.
***********************************************************************************
[localhost:21000] default> alter table my_tbl recover partitions;
Query: alter table my_tbl recover partitions
+---------------------------------+
| summary                         |
+---------------------------------+
| Partitions have been recovered. |
+---------------------------------+
Fetched 1 row(s) in 0.06s
[localhost:21000] default> show partitions my_tbl;
Query: show partitions my_tbl
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format
 | Incremental stats | Location
  |
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
| 2019  | -1    | 1      | 352B | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2019 |
| 2020  | -1    | 1      | 352B | NOT CACHED   | NOT CACHED        |
PARQUET | false             |
hdfs://localhost:20500/test-warehouse/my_tbl/year=2020 |
| Total | -1    | 2      | 704B | 0B           |                   |
  |                   |
   |
+-------+-------+--------+------+--------------+-------------------+---------+-------------------+--------------------------------------------------------+
Fetched 3 row(s) in 0.02s


On Tue, May 26, 2020 at 5:34 PM Fawze Abujaber <fawz...@gmail.com> wrote:

> Did you test it on external table?
>
> On Tue, 26 May 2020 at 10:56 Quanlong Huang <huangquanl...@gmail.com>
> wrote:
>
>> hmm... I can't reproduce this in my local env. Could you create a JIRA
>> with your steps and version?
>>
>> On Tue, May 26, 2020 at 3:49 PM Fawze Abujaber <fawz...@gmail.com> wrote:
>>
>>> I already checked this and the data is not longer in HDFS and in the
>>> show partitions yes it shown in the location
>>>
>>> On Tue, May 26, 2020 at 10:01 AM Quanlong Huang <huangquanl...@gmail.com>
>>> wrote:
>>>
>>>> for example if i have a table with 3 partitions year=2018,2019 and
>>>>> 2020, and hdfs files only for 2019 and 2020, when running drop
>>>>> partition year<=2018, yes then i see only  partitions of 2019 and 2020 but
>>>>> if i run recover partitions i will see 2018, and i assume this related to
>>>>> the cache.
>>>>>
>>>> This should not happen if you have deleted the partition directory of
>>>> year=2018. Could you make sure that the location of partition (year=2018)
>>>> is no longer exists? The locations are shown in the last column of SHOW
>>>> PARTITIONS.
>>>>
>>>> On Tue, May 26, 2020 at 11:46 AM Fawze Abujaber <fawz...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Quanlong,
>>>>>
>>>>> Thanks for your response,
>>>>>
>>>>> Yes we are using the DROP partition with expression on few impala
>>>>> internal tables.
>>>>>
>>>>> We have more than 200 external tables all are partitions by year,month
>>>>> and day and each has different retention policy and we are managing the
>>>>> retention in the HDFS side by deleting the files.
>>>>> seems like no straight forward solution for solving my issue, also i
>>>>> noticed that running DROP PARTITION and then recover partitions will get 
>>>>> me
>>>>> back to the same situation even the hdfs files are not exist.
>>>>>
>>>>> for example if i have a table with 3 partitions year=2018,2019 and
>>>>> 2020, and hdfs files only for 2019 and 2020, when running drop
>>>>> partition year<=2018, yes then i see only  partitions of 2019 and 2020 but
>>>>> if i run recover partitions i will see 2018, and i assume this related to
>>>>> the cache.
>>>>>
>>>>> On Mon, May 25, 2020 at 5:55 PM Quanlong Huang <
>>>>> huangquanl...@gmail.com> wrote:
>>>>>
>>>>>> Hi Fawze,
>>>>>>
>>>>>> 1) are these partitions impacting the hive metastore performance and
>>>>>>> memory used by the metastore?
>>>>>>
>>>>>> Do you drop the partitions after you clean the parquet files? If not,
>>>>>> the partition metadata still exists in hive metastore, which may impact 
>>>>>> its
>>>>>> performance when the partition number piles up.
>>>>>>
>>>>>> 2) Is there a way to drop these partitions without running ALTER
>>>>>>> TABLE xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table
>>>>>>> can make this change but this is a heavy solution.
>>>>>>>
>>>>>> I think explicitly dropping these partitions is the simplest way. If
>>>>>> your requirement is doing this by one statement, you can use expressions 
>>>>>> in
>>>>>> the partition clause to specify several partitions at once, e.g. ALTER
>>>>>> TABLE xxx DROP IF EXISTS PARTITION (year = 2018 AND month = 4 AND day < 
>>>>>> 15).
>>>>>>
>>>>>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly
>>>>>>> created partitions?
>>>>>>>
>>>>>> Yes, it compares partitions with cached meta and all partition
>>>>>> directories on HDFS to get the diff. Then create thoes partitions in Hive
>>>>>> and load their meta.
>>>>>>
>>>>>> Quanlong
>>>>>>
>>>>>> On Fri, May 22, 2020 at 5:09 PM Fawze Abujaber <fawz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello community,
>>>>>>>
>>>>>>> I have a spark job that writes parquet files partitioned by year,
>>>>>>> month and day, in top of these parquet files i'm creating different 
>>>>>>> impala
>>>>>>> external tables, i have a retention job that cleans the parquet files 
>>>>>>> in a
>>>>>>> daily basis, in the impala side i'm running a daily ALTER TABLE xxxxxx
>>>>>>> recover partitions and hourly REFRESH TABLE xxxx, when i'm running show
>>>>>>> partitions i see old partitions and they shows up as Zero size and 
>>>>>>> number
>>>>>>> of files.
>>>>>>>
>>>>>>> 1) are these partitions impacting the hive metastore performance and
>>>>>>> memory used by the metastore?
>>>>>>>
>>>>>>> 2) Is there a way to drop these partitions without running ALTER
>>>>>>> TABLE xxx DROP IF EXISTS PARTITION, i believe also DROP and CREATE table
>>>>>>> can make this change but this is a heavy solution.
>>>>>>>
>>>>>>> 3) Is ALTER TABLE xxxx recover partitions looking on the newly
>>>>>>> created partitions?
>>>>>>>
>>>>>>> --
>>>>>>> Take Care
>>>>>>> Fawze Abujaber
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Take Care
>>>>> Fawze Abujaber
>>>>>
>>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>> --
> Take Care
> Fawze Abujaber
>

Reply via email to