Writing an RDD to Hive

Philip Ogren Fri, 06 Dec 2013 17:07:01 -0800

I have a simple scenario that I'm struggling to implement. I would liketo take a fairly simple RDD generated from a large log file, performsome transformations on it, and write the results out such that I canperform a Hive query either from Hive (via Hue) or Shark. I'm havingtroubles with the last step. I am able to write my data out to HDFS andthen execute a Hive create table statement followed by a load datastatement as a separate step. I really dislike this separate manualstep and would like to be able to have it all accomplished in my Sparkapplication. To this end, I have investigated two possible approachesas detailed below - it's probably too much information so I'll ask mymore basic question first:

Does anyone have a basic recipe/approach for loading data in an RDD to aHive table from a Spark application?

1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There isa nice detailed email on how to do this here<http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E>.I didn't get very far thought because as soon as I added an hbasedependency (corresponding to the version of hbase we are running) to mypom.xml file, I had an slf4j dependency conflict that caused my currentapplication to explode. I tried the latest released version and theslf4j dependency problem went away but then the deprecated classTableOutputFormat no longer exists. Even if loading the data into hbasewere trivially easy (and the detailed email suggests otherwise) I wouldthen need to query HBase from Hive which seems a little clunky.

2) So, I decided that Shark might be an easier option. All the examplesprovided in their documentation seem to assume that you are using Sharkas an interactive application from a shell. Various threads I've seenseem to indicate that Shark isn't really intended to be used asdependency in your Spark code (see this<https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion>and that<https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion>.)It follows then that one can't add a Shark dependency to a pom.xml filebecause Shark isn't released via Maven Central (that I can tell....perhaps it's in some other repo?) Of course, there are ways of creatinga local dependency in maven but it starts to feel very hacky.

I realize that I've given sufficient detail to expose my ignorance in amyriad of ways. Please feel free to shine light on any of mymisconceptions!


Thanks,
Philip

Writing an RDD to Hive

Reply via email to