Hey Nicholas, The best way to do this is to do rdd.mapPartitions() and pass a function that will open a JDBC connection to your database and write the range in each partition.
On the input path there is something called JDBC-RDD that is relevant: http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73 - Patrick On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > My fellow welders, > > (Can we make that a thing? Let's make that a thing. :) > > I'm trying to wedge Spark into an existing model where we process and > transform some data and then load it into an MPP database. I know that part > of the sell of Spark and Shark is that you shouldn't have to copy data > around like this, so please bear with me. :) > > Say I have an RDD of about 10GB in size that's cached in memory. What is the > best/fastest way to push that data into an MPP database like Redshift? Has > anyone done something like this? > > I'm assuming that pushing the data straight from memory into the database is > much faster than writing the RDD to HDFS and then COPY-ing it from there > into the database. > > Is there, for example, a way to perform a bulk load into the database that > runs on each partition of the in-memory RDD in parallel? > > Nick > > > ________________________________ > View this message in context: best practices for pushing an RDD into a > database > Sent from the Apache Spark User List mailing list archive at Nabble.com.