How to partition data is dependent on how you want to access your data. If
you can foresee that most of the queries use year and month, then go-ahead
and partition the data on those 2 columns. You can do that like below

create table partitioned_data partition by (yr, mnth) as select
extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
........ from mydata;

For partitioning to have any benefit, your queries should have filters on
month and year columns.

- Rahul

On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> wrote:

> Hi all,
>
> Trying to understand parquet partitioning works.
>
> What is the recommended partitioning scheme for event data that will be
> queried primarily by date. I assume that partitioning by year and month
> would be optimal?
>
> Lets say I have data that looks like:
>
> application,status,date,message
> kafka,down,2017-03023 04:53,zookeeper is not available
>
>
> Would I have to create new columns for year and month?
>
> e.g.
> application,status,date,message,year,month
> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
>
> and then perform a CTAS using the year and month columns as the 'partition
> by'?
>
> Thanks
>

Reply via email to