Philip, fwiw we do go with including Shark as a dependency for our needs, making a fat jar, and it works very well. It was quite a bit of pain what with the Hadoop/Hive transitive dependencies, but for us it was worth it.
I hope that serves as an existence proof that says Mt Everest has been climbed, likely by more than just ourselves. Going forward this should be getting easier. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren <[email protected]>wrote: > I have a simple scenario that I'm struggling to implement. I would like > to take a fairly simple RDD generated from a large log file, perform some > transformations on it, and write the results out such that I can perform a > Hive query either from Hive (via Hue) or Shark. I'm having troubles with > the last step. I am able to write my data out to HDFS and then execute a > Hive create table statement followed by a load data statement as a separate > step. I really dislike this separate manual step and would like to be able > to have it all accomplished in my Spark application. To this end, I have > investigated two possible approaches as detailed below - it's probably too > much information so I'll ask my more basic question first: > > Does anyone have a basic recipe/approach for loading data in an RDD to a > Hive table from a Spark application? > > 1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There is > a nice detailed email on how to do this > here<http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E>. > I didn't get very far thought because as soon as I added an hbase > dependency (corresponding to the version of hbase we are running) to my > pom.xml file, I had an slf4j dependency conflict that caused my current > application to explode. I tried the latest released version and the slf4j > dependency problem went away but then the deprecated class > TableOutputFormat no longer exists. Even if loading the data into hbase > were trivially easy (and the detailed email suggests otherwise) I would > then need to query HBase from Hive which seems a little clunky. > > 2) So, I decided that Shark might be an easier option. All the examples > provided in their documentation seem to assume that you are using Shark as > an interactive application from a shell. Various threads I've seen seem to > indicate that Shark isn't really intended to be used as dependency in your > Spark code (see > this<https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion>and > that<https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion>.) > It follows then that one can't add a Shark dependency to a pom.xml file > because Shark isn't released via Maven Central (that I can tell.... perhaps > it's in some other repo?) Of course, there are ways of creating a local > dependency in maven but it starts to feel very hacky. > > I realize that I've given sufficient detail to expose my ignorance in a > myriad of ways. Please feel free to shine light on any of my > misconceptions! > > Thanks, > Philip > >
