Re: Impala table partitions

Quanlong Huang Sat, 15 Dec 2018 01:24:44 -0800

Hi Fawze,

A hive partition can only have one unique location. So partition
(year=2018/month=01/day=01) can't point to
both /tmp/account=aaaa/year=2018/month=01/day=01
and /tmp/account=bbbb/year=2018/month=01/day=01 together.

For your ploblem, you need to reorganize the directory hierarchy to match
the partition definition: move all files in
/tmp/account=*/year=2018/month=01/day=01 into
/somewhere/year=2018/month=01/day=01.
As you mentioned you have millions of accounts, you may need to do this in
parallel via a Map-only MapReduce job or a Spark job.

For example, to write a MapReduce job for this:
(1) Create a text file with all these directory names.
(2) Using NLineInputFormat as the InputFormat and the text file as input.
(3) Each mapper will process N directories. They move files in
/tmp/account=*/year=YYYY/month=MM/day=DD into
/somewhere/year=YYYY/month=MM/day=DD (create the dir if not exists)

HTH
Quanlong

On Sat, Dec 15, 2018 at 3:36 PM Fawze Abujaber <fawz...@gmail.com> wrote:

> Hi Community.
>
> I would like to create an external table on top of these hdfs files with
> parition year, month and day, is there a possible to create one table on
> top of these files
>
> /tmp/account=aaaa/year=2018/month=01/day=01
> /tmp/account=aaaa/year=2018/month=01/day=02
> /tmp/account=bbbb/year=2018/month=01/day=01
> /tmp/account=bbbb/year=2018/month=01/day=02
>
> Creating a table with:
> PARTITIONED BY (
>
>   year INT,
>
>   month INT,
>
>   day INT
>
> )
>
>
>
>
> STORED AS PARQUET
>
>
>
> LOCATION '/tmp'
>
> Is not working for me.
>
> Adding the account to the partition creatng me millions of partitions and
> i want to avoid this, in the background i have a compaction job that
> compact the small files under the parition day.
> --
> Take Care
> Fawze Abujaber
>

Re: Impala table partitions

Reply via email to