RE: Partitioning for parquet

Lee, David Wed, 31 May 2017 14:13:21 -0700

In addition to partitioning I would also make sub directories by year and then 
month if that is what you are partitioning against.. Apache Spark doesn't use 
parquet metadata and depends on subdirectory names for its partitioning scheme 
if you want to use your parquet files for multiple platforms.

http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Table partitioning is a common optimization approach used in systems like Hive. 
In a partitioned table, data are usually stored in different directories, with 
partitioning column values encoded in the path of each partition directory.

-----Original Message-----
From: rahul challapalli [mailto:[email protected]] 
Sent: Wednesday, May 31, 2017 1:50 PM
To: user <[email protected]>
Subject: Re: Partitioning for parquet

How to partition data is dependent on how you want to access your data. If you 
can foresee that most of the queries use year and month, then go-ahead and 
partition the data on those 2 columns. You can do that like below

create table partitioned_data partition by (yr, mnth) as select extract(year 
from `date`) yr, extract(month from `date`) mnth, `date`, ........ from mydata;

For partitioning to have any benefit, your queries should have filters on month 
and year columns.

- Rahul

On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__raz.baluchi-40gmail.com&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=2yS8X9TOHMLAGMnDCdIRBCagwS8AaSZqG0BEVrdPzEw&s=EBqzVngBsYG9ZZ4T-eMLqEEOGH8HpY4HbNuK5UAFLgo&e=
 > wrote:

> Hi all,
>
> Trying to understand parquet partitioning works.
>
> What is the recommended partitioning scheme for event data that will 
> be queried primarily by date. I assume that partitioning by year and 
> month would be optimal?
>
> Lets say I have data that looks like:
>
> application,status,date,message
> kafka,down,2017-03023 04:53,zookeeper is not available
>
>
> Would I have to create new columns for year and month?
>
> e.g.
> application,status,date,message,year,month
> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
>
> and then perform a CTAS using the year and month columns as the 
> 'partition by'?
>
> Thanks
>

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2017 BlackRock, Inc. All rights reserved.

RE: Partitioning for parquet

Reply via email to