Re: Spark integration with HCatalog (specifically regarding partitions)

Elliot West Thu, 28 Jan 2016 07:48:17 -0800

Is this perhaps not currently supported?

// TODO: Support persisting partitioned data source relations in Hive
compatible format


From:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L328

On 25 January 2016 at 19:45, Elliot West <tea...@gmail.com> wrote:

> Thanks for your response Jorge and apologies for my delay in replying. I
> took your advice with case 5 and declared the column names explicitly
> instead of the wildcard. This did the trick and I can now add partitions to
> an existing table. I also tried removing the 'partitionBy("id")' call as
> suggested, by this gave me a NoSuchMethodError. I suspect this would work
> if I were running a newer version of the Hive metastore. Oddly I'm still
> unable to create a new partitioned table, although I have since found a
> somewhat confusing warning while running case 4:
>
> Persisting partitioned data source relation `raboof` into Hive metastore
> in Spark SQL specific format, which is NOT compatible with Hive. Input
> path(s):
>
>
> If you have any thoughts, please let me know.
>
> Thanks - Elliot.
>
>
> On 13 January 2016 at 20:56, Jorge Machado <jom...@me.com> wrote:
>
>> Hi Elliot,
>>
>> I´m no Expert to but for the case 5 can it be that you changed the order
>> on the second insert ?
>> And why do you give the command partitionBy again if the table was
>> already create with partition ?
>>
>>
>>
>> insert into table foobar PARTITION (id)
>>               > values ("xxx", 1), ("yyy", 2);
>>
>>
>>
>> hive (default)> insert into table new_record_source
>>               > values (3, "zzz");
>>
>>
>>
>> Regards
>>
>>
>> On 11/01/2016, at 13:36, Elliot West <tea...@gmail.com> wrote:
>>
>> Hello,
>>
>> I am in the process of evaluating Spark (1.5.2) for a wide range of use
>> cases. In particular I'm keen to understand the depth of the integration
>> with HCatalog (aka the Hive Metastore). I am very encouraged when browsing
>> the source contained within the org.apache.spark.sql.hive package. My goals
>> are to evaluate how effectively Spark handles the following scenarios:
>>
>>    1. Reading from an unpartitioned HCatalog table.
>>    2. Reading from a partitioned HCatalog table with partition pruning
>>    from filter pushdown.
>>    3. Writing to a new unpartitioned HCatalog table.
>>    4. Writing to a new partitioned HCatalog table.
>>    5. Adding a partition to a partitioned HCatalog table.
>>
>> I found that the first three cases appear to function beautifully.
>> However, I cannot seem to effectively create new HCatalog aware partitions
>> either in a new table or on and existing table (cases 4 & 5). I suspect
>> this may be due to my inexperience with Spark so wonder if you could advise
>> me on what to try next. Here's what I have:
>>
>> *Case 4: Writing to a new partitioned HCatalog table*
>>
>> Create a source in Hive (could be plain data file also):
>>
>>
>> hive (default)> create table foobar ( id int, name string );
>> hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz");
>>
>> Read the source with Spark, partition the data, and write to a new table:
>>
>> sqlContext.sql("select *
>> from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof")
>>
>>
>> Check for the new table in Hive, it is partitioned correctly although the
>> formats and schema are unexpected:
>>
>> hive (default)> show table extended like 'raboof';
>> OK
>> tab_name
>> tableName: raboof
>> location:hdfs://host:port/user/hive/warehouse/raboof
>> inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat
>> outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> columns:struct columns { list<string> col}
>> partitioned:true
>> partitionColumns:struct partition_columns { i32 id}
>>
>>
>> Check for correctly partitioned data on HDFS, it appears to be there:
>>
>> [me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof
>> /user/hive/warehouse/raboof/_SUCCESS
>> /user/hive/warehouse/raboof/id=1
>> /user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc
>> /user/hive/warehouse/raboof/id=2
>> /user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc
>>
>>
>> Something is wrong however, no data is returned from this query and the
>> column names appear incorrect:
>>
>> hive (default)> select * from default.raboof;
>> OK
>> col id
>>
>> HCatalog reports no partitions for the table:
>>
>> hive (default)> show partitions default.raboof;
>> OK
>> partition
>>
>> *Case 5: Adding a partition to a partitioned HCatalog table*
>>
>> Created partitioned source table in Hive:
>>
>> hive (default)> create table foobar ( name string )
>>               > partitioned by ( id int )
>>               > stored as orc;
>> hive (default)> insert into table foobar PARTITION (id)
>>               > values ("xxx", 1), ("yyy", 2);
>>
>>
>> Created a source for a new record to add to new_record_source:
>>
>> hive (default)> create table new_record_source ( id int, name string )
>>               > stored as orc;
>> hive (default)> insert into table new_record_source
>>               > values (3, "zzz");
>>
>>
>> Trying to add a partition with:
>>
>> sqlContext.sql("select *
>> from 
>> new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar")
>>
>>
>> This almost did what I wanted:
>>
>> hive (default)> show partitions default.foobar;
>> partition
>> id=1
>> id=2
>> id=__HIVE_DEFAULT_PARTITION__
>>
>> hive (default)> select * from default.foobar;
>> name id
>> xxx 1
>> yyy 2
>> 3 NULL
>>
>> Any assistance would be greatly appreciated.
>>
>> Many thanks - Elliot.
>>
>>
>>
>

Re: Spark integration with HCatalog (specifically regarding partitions)

Reply via email to