I uncovered a fairly simple solution that I thought I would share for
the curious. Hive provides a JDBC driver/client
<https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC>
which can be used to execute Hive statements (in my case to drop and
create tables) from Java/Scala code. So, I execute a create table
statement and then write my RDD in tab delimited form to the hdfs
directory specified in the table create statement. It was really easy
to code up after I connected the dots (it seems obvious now!) The only
hiccup I ran into was caused by trying to use the wrong hive
dependency. In my case we have a CDH4 cluster and so it worked when I
added the following to my pom file:
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
...
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>0.10.0-cdh4.3.2</version>
</dependency>
On 12/6/2013 6:06 PM, Philip Ogren wrote:
I have a simple scenario that I'm struggling to implement. I would
like to take a fairly simple RDD generated from a large log file,
perform some transformations on it, and write the results out such
that I can perform a Hive query either from Hive (via Hue) or Shark.
I'm having troubles with the last step. I am able to write my data
out to HDFS and then execute a Hive create table statement followed by
a load data statement as a separate step. I really dislike this
separate manual step and would like to be able to have it all
accomplished in my Spark application. To this end, I have
investigated two possible approaches as detailed below - it's probably
too much information so I'll ask my more basic question first:
Does anyone have a basic recipe/approach for loading data in an RDD to
a Hive table from a Spark application?