RE: Partitioned Parquet based external table

Chandra Mohan, Ananda Vel Murugan Thu, 12 Nov 2015 20:29:47 -0800

My primary interface to access the data is going to be Hive. I am planning to 
use spark to ingest data (in future I will use spark streaming, but for now it 
is just spark sql). Another group will analyze this data using Hive queries.  
For this scenario, earlier suggestion seems to work.

Regards,
Anand.C

From: Michael Armbrust [mailto:[email protected]]
Sent: Friday, November 13, 2015 2:25 AM
To: Chandra Mohan, Ananda Vel Murugan
Cc: Michal Klos; user
Subject: Re: Partitioned Parquet based external table

Note that if you read in the table using sqlContext.read.parquet(...) or if you 
use saveAsTable(...) the partitions will be auto-discovered.  However, this is 
not compatible with Hive if you also want to be able to read the data there.

On Thu, Nov 12, 2015 at 6:23 AM, Chandra Mohan, Ananda Vel Murugan 
<[email protected]<mailto:[email protected]>> wrote:
Thank you. It works perfectly fine. I enabled dynamic partition in my table and 
then fired “msck repair table your_table” and it works now

Regards,
Anand.C

From: Michal Klos 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, November 12, 2015 6:32 PM
To: Chandra Mohan, Ananda Vel Murugan
Cc: user
Subject: Re: Partitioned Parquet based external table

You must add the partitions to the Hive table with something like "alter table 
your_table add if not exists partition (country='us');".

If you have dynamic partitioning turned on,  you can do 'msck repair table 
your_table' to recover the partitions.

I would recommend reviewing the Hive documentation on partitions

M

On Nov 12, 2015, at 6:38 AM, Chandra Mohan, Ananda Vel Murugan 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I am using Spark 1.5.1.

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java.
 I have slightly modified this example to create partitioned parquet file

Instead of this line

schemaPeople.write().parquet("people.parquet");

I use this line

schemaPeople.write().partitionBy("country").parquet("/user/Ananda/people.parquet");

I have also updated the Person class and added country attribute. I have also 
updated my input file accordingly.

When I run this code in spark, it seems to work. I could see partitioned folder 
and parquet file inside it in HDFS where I store this parquet file.

But when I create a external table in Hive, it does not work. When I do “select 
 *  from person5”, it returns no rows.

This is how I create the table

CREATE EXTERNAL TABLE person5(name string, age int,city string)
PARTITIONED BY (country string)
STORED AS PARQUET
LOCATION '/user/ananda/people.parquet/';

When I create a non partitioned table, it works fine.

Please help if you have any idea.

Regards,
Anand.C

RE: Partitioned Parquet based external table

Reply via email to