Re: Writing an RDD to Hive

Philip Ogren Tue, 10 Dec 2013 16:27:49 -0800

I uncovered a fairly simple solution that I thought I would share forthe curious. Hive provides a JDBC driver/client<https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC>which can be used to execute Hive statements (in my case to drop andcreate tables) from Java/Scala code. So, I execute a create tablestatement and then write my RDD in tab delimited form to the hdfsdirectory specified in the table create statement. It was really easyto code up after I connected the dots (it seems obvious now!) The onlyhiccup I ran into was caused by trying to use the wrong hivedependency. In my case we have a CDH4 cluster and so it worked when Iadded the following to my pom file:


        <repository>
            <id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
...
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>0.10.0-cdh4.3.2</version>
        </dependency>



On 12/6/2013 6:06 PM, Philip Ogren wrote:

I have a simple scenario that I'm struggling to implement. I wouldlike to take a fairly simple RDD generated from a large log file,perform some transformations on it, and write the results out suchthat I can perform a Hive query either from Hive (via Hue) or Shark.I'm having troubles with the last step. I am able to write my dataout to HDFS and then execute a Hive create table statement followed bya load data statement as a separate step. I really dislike thisseparate manual step and would like to be able to have it allaccomplished in my Spark application. To this end, I haveinvestigated two possible approaches as detailed below - it's probablytoo much information so I'll ask my more basic question first:
Does anyone have a basic recipe/approach for loading data in an RDD toa Hive table from a Spark application?

Re: Writing an RDD to Hive

Reply via email to