Is this perhaps not currently supported? // TODO: Support persisting partitioned data source relations in Hive compatible format
From: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L328 On 25 January 2016 at 19:45, Elliot West <tea...@gmail.com> wrote: > Thanks for your response Jorge and apologies for my delay in replying. I > took your advice with case 5 and declared the column names explicitly > instead of the wildcard. This did the trick and I can now add partitions to > an existing table. I also tried removing the 'partitionBy("id")' call as > suggested, by this gave me a NoSuchMethodError. I suspect this would work > if I were running a newer version of the Hive metastore. Oddly I'm still > unable to create a new partitioned table, although I have since found a > somewhat confusing warning while running case 4: > > Persisting partitioned data source relation `raboof` into Hive metastore > in Spark SQL specific format, which is NOT compatible with Hive. Input > path(s): > > > If you have any thoughts, please let me know. > > Thanks - Elliot. > > > On 13 January 2016 at 20:56, Jorge Machado <jom...@me.com> wrote: > >> Hi Elliot, >> >> I´m no Expert to but for the case 5 can it be that you changed the order >> on the second insert ? >> And why do you give the command partitionBy again if the table was >> already create with partition ? >> >> >> >> insert into table foobar PARTITION (id) >> > values ("xxx", 1), ("yyy", 2); >> >> >> >> hive (default)> insert into table new_record_source >> > values (3, "zzz"); >> >> >> >> Regards >> >> >> On 11/01/2016, at 13:36, Elliot West <tea...@gmail.com> wrote: >> >> Hello, >> >> I am in the process of evaluating Spark (1.5.2) for a wide range of use >> cases. In particular I'm keen to understand the depth of the integration >> with HCatalog (aka the Hive Metastore). I am very encouraged when browsing >> the source contained within the org.apache.spark.sql.hive package. My goals >> are to evaluate how effectively Spark handles the following scenarios: >> >> 1. Reading from an unpartitioned HCatalog table. >> 2. Reading from a partitioned HCatalog table with partition pruning >> from filter pushdown. >> 3. Writing to a new unpartitioned HCatalog table. >> 4. Writing to a new partitioned HCatalog table. >> 5. Adding a partition to a partitioned HCatalog table. >> >> I found that the first three cases appear to function beautifully. >> However, I cannot seem to effectively create new HCatalog aware partitions >> either in a new table or on and existing table (cases 4 & 5). I suspect >> this may be due to my inexperience with Spark so wonder if you could advise >> me on what to try next. Here's what I have: >> >> *Case 4: Writing to a new partitioned HCatalog table* >> >> Create a source in Hive (could be plain data file also): >> >> >> hive (default)> create table foobar ( id int, name string ); >> hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz"); >> >> Read the source with Spark, partition the data, and write to a new table: >> >> sqlContext.sql("select * >> from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof") >> >> >> Check for the new table in Hive, it is partitioned correctly although the >> formats and schema are unexpected: >> >> hive (default)> show table extended like 'raboof'; >> OK >> tab_name >> tableName: raboof >> location:hdfs://host:port/user/hive/warehouse/raboof >> inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat >> outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat >> columns:struct columns { list<string> col} >> partitioned:true >> partitionColumns:struct partition_columns { i32 id} >> >> >> Check for correctly partitioned data on HDFS, it appears to be there: >> >> [me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof >> /user/hive/warehouse/raboof/_SUCCESS >> /user/hive/warehouse/raboof/id=1 >> /user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc >> /user/hive/warehouse/raboof/id=2 >> /user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc >> >> >> Something is wrong however, no data is returned from this query and the >> column names appear incorrect: >> >> hive (default)> select * from default.raboof; >> OK >> col id >> >> HCatalog reports no partitions for the table: >> >> hive (default)> show partitions default.raboof; >> OK >> partition >> >> *Case 5: Adding a partition to a partitioned HCatalog table* >> >> Created partitioned source table in Hive: >> >> hive (default)> create table foobar ( name string ) >> > partitioned by ( id int ) >> > stored as orc; >> hive (default)> insert into table foobar PARTITION (id) >> > values ("xxx", 1), ("yyy", 2); >> >> >> Created a source for a new record to add to new_record_source: >> >> hive (default)> create table new_record_source ( id int, name string ) >> > stored as orc; >> hive (default)> insert into table new_record_source >> > values (3, "zzz"); >> >> >> Trying to add a partition with: >> >> sqlContext.sql("select * >> from >> new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar") >> >> >> This almost did what I wanted: >> >> hive (default)> show partitions default.foobar; >> partition >> id=1 >> id=2 >> id=__HIVE_DEFAULT_PARTITION__ >> >> hive (default)> select * from default.foobar; >> name id >> xxx 1 >> yyy 2 >> 3 NULL >> >> Any assistance would be greatly appreciated. >> >> Many thanks - Elliot. >> >> >> >