Re: Writing an RDD to Hive

Christopher Nguyen Sun, 08 Dec 2013 01:46:23 -0800

Philip, fwiw we do go with including Shark as a dependency for our needs,
making a fat jar, and it works very well. It was quite a bit of pain what
with the Hadoop/Hive transitive dependencies, but for us it was worth it.


I hope that serves as an existence proof that says Mt Everest has been
climbed, likely by more than just ourselves. Going forward this should be
getting easier.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren <[email protected]>wrote:

>  I have a simple scenario that I'm struggling to implement.  I would like
> to take a fairly simple RDD generated from a large log file, perform some
> transformations on it, and write the results out such that I can perform a
> Hive query either from Hive (via Hue) or Shark.  I'm having troubles with
> the last step.  I am able to write my data out to HDFS and then execute a
> Hive create table statement followed by a load data statement as a separate
> step.  I really dislike this separate manual step and would like to be able
> to have it all accomplished in my Spark application.  To this end, I have
> investigated two possible approaches as detailed below - it's probably too
> much information so I'll ask my more basic question first:
>
> Does anyone have a basic recipe/approach for loading data in an RDD to a
> Hive table from a Spark application?
>
> 1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset.  There is
> a nice detailed email on how to do this 
> here<http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E>.
> I didn't get very far thought because as soon as I added an hbase
> dependency (corresponding to the version of hbase we are running) to my
> pom.xml file, I had an slf4j dependency conflict that caused my current
> application to explode.  I tried the latest released version and the slf4j
> dependency problem went away but then the deprecated class
> TableOutputFormat no longer exists.  Even if loading the data into hbase
> were trivially easy (and the detailed email suggests otherwise) I would
> then need to query HBase from Hive which seems a little clunky.
>
> 2) So, I decided that Shark might be an easier option.  All the examples
> provided in their documentation seem to assume that you are using Shark as
> an interactive application from a shell.  Various threads I've seen seem to
> indicate that Shark isn't really intended to be used as dependency in your
> Spark code (see 
> this<https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion>and
> that<https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion>.)
> It follows then that one can't add a Shark dependency to a pom.xml file
> because Shark isn't released via Maven Central (that I can tell.... perhaps
> it's in some other repo?)  Of course, there are ways of creating a local
> dependency in maven but it starts to feel very hacky.
>
> I realize that I've given sufficient detail to expose my ignorance in a
> myriad of ways.  Please feel free to shine light on any of my
> misconceptions!
>
> Thanks,
> Philip
>
>

Re: Writing an RDD to Hive

Reply via email to