Hi all,

Trying to understand parquet partitioning works.

What is the recommended partitioning scheme for event data that will be
queried primarily by date. I assume that partitioning by year and month
would be optimal?

Lets say I have data that looks like:

application,status,date,message
kafka,down,2017-03023 04:53,zookeeper is not available


Would I have to create new columns for year and month?

e.g.
application,status,date,message,year,month
kafka,down,2017-03023 04:53,zookeeper is not available,2017,03

and then perform a CTAS using the year and month columns as the 'partition
by'?

Thanks

Reply via email to