Re: Writing an RDD to Hive

Philip Ogren Mon, 09 Dec 2013 09:54:47 -0800

Any chance you could sketch out the Shark APIs that you use for this?Matei's response suggests that the preferred API is coming in the nextrelease (i.e. RDDTable class in 0.8.1). Are you building Shark from thelatest in the repo and using that? Or have you figured out other APIcalls that accomplish something similar?


Thanks,
Philip


On 12/8/2013 2:44 AM, Christopher Nguyen wrote:

Philip, fwiw we do go with including Shark as a dependency for ourneeds, making a fat jar, and it works very well. It was quite a bit ofpain what with the Hadoop/Hive transitive dependencies, but for us itwas worth it.
I hope that serves as an existence proof that says Mt Everest has beenclimbed, likely by more than just ourselves. Going forward this shouldbe getting easier.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen <http://linkedin.com/in/ctnguyen>
On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren <[email protected]<mailto:[email protected]>> wrote:
    I have a simple scenario that I'm struggling to implement.  I
    would like to take a fairly simple RDD generated from a large log
    file, perform some transformations on it, and write the results
    out such that I can perform a Hive query either from Hive (via
    Hue) or Shark.  I'm having troubles with the last step.  I am able
    to write my data out to HDFS and then execute a Hive create table
    statement followed by a load data statement as a separate step.  I
    really dislike this separate manual step and would like to be able
    to have it all accomplished in my Spark application.  To this end,
    I have investigated two possible approaches as detailed below -
    it's probably too much information so I'll ask my more basic
    question first:

    Does anyone have a basic recipe/approach for loading data in an
    RDD to a Hive table from a Spark application?
1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset.There is a nice detailed email on how to do this here<http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E>.I didn't get very far thought because as soon as I added an hbase
    dependency (corresponding to the version of hbase we are running)
    to my pom.xml file, I had an slf4j dependency conflict that caused
    my current application to explode.  I tried the latest released
    version and the slf4j dependency problem went away but then the
    deprecated class TableOutputFormat no longer exists.  Even if
    loading the data into hbase were trivially easy (and the detailed
    email suggests otherwise) I would then need to query HBase from
    Hive which seems a little clunky.

    2) So, I decided that Shark might be an easier option. All the
    examples provided in their documentation seem to assume that you
are using Shark as an interactive application from a shell.Various threads I've seen seem to indicate that Shark isn't really
    intended to be used as dependency in your Spark code (see this
    
<https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion>
    and that
<https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion>.)It follows then that one can't add a Shark dependency to a pom.xml
    file because Shark isn't released via Maven Central (that I can
    tell.... perhaps it's in some other repo?)  Of course, there are
    ways of creating a local dependency in maven but it starts to feel
    very hacky.

    I realize that I've given sufficient detail to expose my ignorance
    in a myriad of ways.  Please feel free to shine light on any of my
    misconceptions!

    Thanks,
    Philip

Re: Writing an RDD to Hive

Reply via email to