I uncovered a fairly simple solution that I thought I would share for the curious. Hive provides a JDBC driver/client <https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC> which can be used to execute Hive statements (in my case to drop and create tables) from Java/Scala code. So, I execute a create table statement and then write my RDD in tab delimited form to the hdfs directory specified in the table create statement. It was really easy to code up after I connected the dots (it seems obvious now!) The only hiccup I ran into was caused by trying to use the wrong hive dependency. In my case we have a CDH4 cluster and so it worked when I added the following to my pom file:

        <repository>
            <id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
...
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>0.10.0-cdh4.3.2</version>
        </dependency>


On 12/6/2013 6:06 PM, Philip Ogren wrote:
I have a simple scenario that I'm struggling to implement. I would like to take a fairly simple RDD generated from a large log file, perform some transformations on it, and write the results out such that I can perform a Hive query either from Hive (via Hue) or Shark. I'm having troubles with the last step. I am able to write my data out to HDFS and then execute a Hive create table statement followed by a load data statement as a separate step. I really dislike this separate manual step and would like to be able to have it all accomplished in my Spark application. To this end, I have investigated two possible approaches as detailed below - it's probably too much information so I'll ask my more basic question first:

Does anyone have a basic recipe/approach for loading data in an RDD to a Hive table from a Spark application?

Reply via email to