Re: Writing an RDD to Hive

Matei Zaharia Sat, 07 Dec 2013 12:23:18 -0800

Hi Philip,

There are a few things you can do:


- If you want to avoid the data copy with a CREATE TABLE statement, you can use 
CREATE EXTERNAL TABLE, which points to an existing file or directory.

- If you always reuse the same table, you could CREATE TABLE only once and then 
simply place files in its directory, with whatever format Hive expects (for 
simplicity make it comma-delimited or something like that).

- In Shark 0.8.1, there will be an RDDTable class that lets you save an RDD 
directly as a table, and basically does both the file creation and CREATE TABLE 
for you. However it’s true that you’ll have to publish Shark to your local 
Maven repo (you can do this with sbt publish-local in Shark). We hope to put it 
in a global repo sometime too but it’s not there yet.

Matei

On Dec 6, 2013, at 5:06 PM, Philip Ogren <[email protected]> wrote:

> I have a simple scenario that I'm struggling to implement.  I would like to 
> take a fairly simple RDD generated from a large log file, perform some 
> transformations on it, and write the results out such that I can perform a 
> Hive query either from Hive (via Hue) or Shark.  I'm having troubles with the 
> last step.  I am able to write my data out to HDFS and then execute a Hive 
> create table statement followed by a load data statement as a separate step.  
> I really dislike this separate manual step and would like to be able to have 
> it all accomplished in my Spark application.  To this end, I have 
> investigated two possible approaches as detailed below - it's probably too 
> much information so I'll ask my more basic question first:
> 
> Does anyone have a basic recipe/approach for loading data in an RDD to a Hive 
> table from a Spark application?
> 
> 1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset.  There is a 
> nice detailed email on how to do this here.  I didn't get very far thought 
> because as soon as I added an hbase dependency (corresponding to the version 
> of hbase we are running) to my pom.xml file, I had an slf4j dependency 
> conflict that caused my     current application to explode.  I tried the 
> latest released version and the slf4j dependency problem went away but then 
> the deprecated class TableOutputFormat no longer exists.  Even if loading the 
> data into hbase were trivially easy (and the detailed email suggests 
> otherwise) I would then need to query HBase from Hive which seems a little 
> clunky.  
> 
> 2) So, I decided that Shark might be an easier option.  All the examples 
> provided in their documentation seem to assume that you are using Shark as an 
> interactive application from a shell.  Various threads I've seen seem to 
> indicate that Shark isn't really intended to be used as dependency in your 
> Spark code (see this and that.)      It follows then that one can't add a 
> Shark dependency to a pom.xml file because Shark isn't released via Maven 
> Central (that I can tell.... perhaps it's in some other repo?)  Of course, 
> there are ways of creating a local dependency in maven but it starts to feel 
> very hacky.  
> 
> I realize that I've given sufficient detail to expose my ignorance in a 
> myriad of ways.  Please feel free to shine light on any of my misconceptions!
> 
> Thanks,
> Philip
>

Re: Writing an RDD to Hive

Reply via email to