Thanks for your response Jorge and apologies for my delay in replying. I took your advice with case 5 and declared the column names explicitly instead of the wildcard. This did the trick and I can now add partitions to an existing table. I also tried removing the 'partitionBy("id")' call as suggested, by this gave me a NoSuchMethodError. I suspect this would work if I were running a newer version of the Hive metastore. Oddly I'm still unable to create a new partitioned table, although I have since found a somewhat confusing warning while running case 4:
Persisting partitioned data source relation `raboof` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): If you have any thoughts, please let me know. Thanks - Elliot. On 13 January 2016 at 20:56, Jorge Machado <jom...@me.com> wrote: > Hi Elliot, > > I´m no Expert to but for the case 5 can it be that you changed the order > on the second insert ? > And why do you give the command partitionBy again if the table was > already create with partition ? > > > > insert into table foobar PARTITION (id) > > values ("xxx", 1), ("yyy", 2); > > > > hive (default)> insert into table new_record_source > > values (3, "zzz"); > > > > Regards > > > On 11/01/2016, at 13:36, Elliot West <tea...@gmail.com> wrote: > > Hello, > > I am in the process of evaluating Spark (1.5.2) for a wide range of use > cases. In particular I'm keen to understand the depth of the integration > with HCatalog (aka the Hive Metastore). I am very encouraged when browsing > the source contained within the org.apache.spark.sql.hive package. My goals > are to evaluate how effectively Spark handles the following scenarios: > > 1. Reading from an unpartitioned HCatalog table. > 2. Reading from a partitioned HCatalog table with partition pruning > from filter pushdown. > 3. Writing to a new unpartitioned HCatalog table. > 4. Writing to a new partitioned HCatalog table. > 5. Adding a partition to a partitioned HCatalog table. > > I found that the first three cases appear to function beautifully. > However, I cannot seem to effectively create new HCatalog aware partitions > either in a new table or on and existing table (cases 4 & 5). I suspect > this may be due to my inexperience with Spark so wonder if you could advise > me on what to try next. Here's what I have: > > *Case 4: Writing to a new partitioned HCatalog table* > > Create a source in Hive (could be plain data file also): > > > hive (default)> create table foobar ( id int, name string ); > hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz"); > > Read the source with Spark, partition the data, and write to a new table: > > sqlContext.sql("select * > from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof") > > > Check for the new table in Hive, it is partitioned correctly although the > formats and schema are unexpected: > > hive (default)> show table extended like 'raboof'; > OK > tab_name > tableName: raboof > location:hdfs://host:port/user/hive/warehouse/raboof > inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat > outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > columns:struct columns { list<string> col} > partitioned:true > partitionColumns:struct partition_columns { i32 id} > > > Check for correctly partitioned data on HDFS, it appears to be there: > > [me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof > /user/hive/warehouse/raboof/_SUCCESS > /user/hive/warehouse/raboof/id=1 > /user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc > /user/hive/warehouse/raboof/id=2 > /user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc > > > Something is wrong however, no data is returned from this query and the > column names appear incorrect: > > hive (default)> select * from default.raboof; > OK > col id > > HCatalog reports no partitions for the table: > > hive (default)> show partitions default.raboof; > OK > partition > > *Case 5: Adding a partition to a partitioned HCatalog table* > > Created partitioned source table in Hive: > > hive (default)> create table foobar ( name string ) > > partitioned by ( id int ) > > stored as orc; > hive (default)> insert into table foobar PARTITION (id) > > values ("xxx", 1), ("yyy", 2); > > > Created a source for a new record to add to new_record_source: > > hive (default)> create table new_record_source ( id int, name string ) > > stored as orc; > hive (default)> insert into table new_record_source > > values (3, "zzz"); > > > Trying to add a partition with: > > sqlContext.sql("select * > from > new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar") > > > This almost did what I wanted: > > hive (default)> show partitions default.foobar; > partition > id=1 > id=2 > id=__HIVE_DEFAULT_PARTITION__ > > hive (default)> select * from default.foobar; > name id > xxx 1 > yyy 2 > 3 NULL > > Any assistance would be greatly appreciated. > > Many thanks - Elliot. > > >