Re: Spark integration with HCatalog (specifically regarding partitions)

Elliot West Mon, 25 Jan 2016 11:45:56 -0800

Thanks for your response Jorge and apologies for my delay in replying. I
took your advice with case 5 and declared the column names explicitly
instead of the wildcard. This did the trick and I can now add partitions to
an existing table. I also tried removing the 'partitionBy("id")' call as
suggested, by this gave me a NoSuchMethodError. I suspect this would work
if I were running a newer version of the Hive metastore. Oddly I'm still
unable to create a new partitioned table, although I have since found a
somewhat confusing warning while running case 4:


Persisting partitioned data source relation `raboof` into Hive metastore in
Spark SQL specific format, which is NOT compatible with Hive. Input
path(s):


If you have any thoughts, please let me know.

Thanks - Elliot.


On 13 January 2016 at 20:56, Jorge Machado <jom...@me.com> wrote:

> Hi Elliot,
>
> I´m no Expert to but for the case 5 can it be that you changed the order
> on the second insert ?
> And why do you give the command partitionBy again if the table was
> already create with partition ?
>
>
>
> insert into table foobar PARTITION (id)
>               > values ("xxx", 1), ("yyy", 2);
>
>
>
> hive (default)> insert into table new_record_source
>               > values (3, "zzz");
>
>
>
> Regards
>
>
> On 11/01/2016, at 13:36, Elliot West <tea...@gmail.com> wrote:
>
> Hello,
>
> I am in the process of evaluating Spark (1.5.2) for a wide range of use
> cases. In particular I'm keen to understand the depth of the integration
> with HCatalog (aka the Hive Metastore). I am very encouraged when browsing
> the source contained within the org.apache.spark.sql.hive package. My goals
> are to evaluate how effectively Spark handles the following scenarios:
>
>    1. Reading from an unpartitioned HCatalog table.
>    2. Reading from a partitioned HCatalog table with partition pruning
>    from filter pushdown.
>    3. Writing to a new unpartitioned HCatalog table.
>    4. Writing to a new partitioned HCatalog table.
>    5. Adding a partition to a partitioned HCatalog table.
>
> I found that the first three cases appear to function beautifully.
> However, I cannot seem to effectively create new HCatalog aware partitions
> either in a new table or on and existing table (cases 4 & 5). I suspect
> this may be due to my inexperience with Spark so wonder if you could advise
> me on what to try next. Here's what I have:
>
> *Case 4: Writing to a new partitioned HCatalog table*
>
> Create a source in Hive (could be plain data file also):
>
>
> hive (default)> create table foobar ( id int, name string );
> hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz");
>
> Read the source with Spark, partition the data, and write to a new table:
>
> sqlContext.sql("select *
> from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof")
>
>
> Check for the new table in Hive, it is partitioned correctly although the
> formats and schema are unexpected:
>
> hive (default)> show table extended like 'raboof';
> OK
> tab_name
> tableName: raboof
> location:hdfs://host:port/user/hive/warehouse/raboof
> inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat
> outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> columns:struct columns { list<string> col}
> partitioned:true
> partitionColumns:struct partition_columns { i32 id}
>
>
> Check for correctly partitioned data on HDFS, it appears to be there:
>
> [me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof
> /user/hive/warehouse/raboof/_SUCCESS
> /user/hive/warehouse/raboof/id=1
> /user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc
> /user/hive/warehouse/raboof/id=2
> /user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc
>
>
> Something is wrong however, no data is returned from this query and the
> column names appear incorrect:
>
> hive (default)> select * from default.raboof;
> OK
> col id
>
> HCatalog reports no partitions for the table:
>
> hive (default)> show partitions default.raboof;
> OK
> partition
>
> *Case 5: Adding a partition to a partitioned HCatalog table*
>
> Created partitioned source table in Hive:
>
> hive (default)> create table foobar ( name string )
>               > partitioned by ( id int )
>               > stored as orc;
> hive (default)> insert into table foobar PARTITION (id)
>               > values ("xxx", 1), ("yyy", 2);
>
>
> Created a source for a new record to add to new_record_source:
>
> hive (default)> create table new_record_source ( id int, name string )
>               > stored as orc;
> hive (default)> insert into table new_record_source
>               > values (3, "zzz");
>
>
> Trying to add a partition with:
>
> sqlContext.sql("select *
> from 
> new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar")
>
>
> This almost did what I wanted:
>
> hive (default)> show partitions default.foobar;
> partition
> id=1
> id=2
> id=__HIVE_DEFAULT_PARTITION__
>
> hive (default)> select * from default.foobar;
> name id
> xxx 1
> yyy 2
> 3 NULL
>
> Any assistance would be greatly appreciated.
>
> Many thanks - Elliot.
>
>
>

Re: Spark integration with HCatalog (specifically regarding partitions)

Reply via email to