My primary interface to access the data is going to be Hive. I am planning to use spark to ingest data (in future I will use spark streaming, but for now it is just spark sql). Another group will analyze this data using Hive queries. For this scenario, earlier suggestion seems to work.
Regards, Anand.C From: Michael Armbrust [mailto:[email protected]] Sent: Friday, November 13, 2015 2:25 AM To: Chandra Mohan, Ananda Vel Murugan Cc: Michal Klos; user Subject: Re: Partitioned Parquet based external table Note that if you read in the table using sqlContext.read.parquet(...) or if you use saveAsTable(...) the partitions will be auto-discovered. However, this is not compatible with Hive if you also want to be able to read the data there. On Thu, Nov 12, 2015 at 6:23 AM, Chandra Mohan, Ananda Vel Murugan <[email protected]<mailto:[email protected]>> wrote: Thank you. It works perfectly fine. I enabled dynamic partition in my table and then fired “msck repair table your_table” and it works now Regards, Anand.C From: Michal Klos [mailto:[email protected]<mailto:[email protected]>] Sent: Thursday, November 12, 2015 6:32 PM To: Chandra Mohan, Ananda Vel Murugan Cc: user Subject: Re: Partitioned Parquet based external table You must add the partitions to the Hive table with something like "alter table your_table add if not exists partition (country='us');". If you have dynamic partitioning turned on, you can do 'msck repair table your_table' to recover the partitions. I would recommend reviewing the Hive documentation on partitions M On Nov 12, 2015, at 6:38 AM, Chandra Mohan, Ananda Vel Murugan <[email protected]<mailto:[email protected]>> wrote: Hi, I am using Spark 1.5.1. https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java. I have slightly modified this example to create partitioned parquet file Instead of this line schemaPeople.write().parquet("people.parquet"); I use this line schemaPeople.write().partitionBy("country").parquet("/user/Ananda/people.parquet"); I have also updated the Person class and added country attribute. I have also updated my input file accordingly. When I run this code in spark, it seems to work. I could see partitioned folder and parquet file inside it in HDFS where I store this parquet file. But when I create a external table in Hive, it does not work. When I do “select * from person5”, it returns no rows. This is how I create the table CREATE EXTERNAL TABLE person5(name string, age int,city string) PARTITIONED BY (country string) STORED AS PARQUET LOCATION '/user/ananda/people.parquet/'; When I create a non partitioned table, it works fine. Please help if you have any idea. Regards, Anand.C
