Thank you for the suggestions. I will look into both and report back. I'm looking at potentially a third option in Redshift's ability to COPY from SSH:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html Is there some relatively straightforward way a command sent via SSH to a worker node can yield all the data in the partition of an RDD that is resident on that node? (Sounds unlikely.) Nick 2014년 3월 13일 목요일, Sandy Ryza<sandy.r...@cloudera.com<javascript:_e(%7B%7D,'cvml','sandy.r...@cloudera.com');>>님이 작성한 메시지: > You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that > Hadoop provides: > > http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html > > > On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell <pwend...@gmail.com>wrote: > >> Hey Nicholas, >> >> The best way to do this is to do rdd.mapPartitions() and pass a >> function that will open a JDBC connection to your database and write >> the range in each partition. >> >> On the input path there is something called JDBC-RDD that is relevant: >> >> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD >> >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73 >> >> - Patrick >> >> On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas >> <nicholas.cham...@gmail.com> wrote: >> > My fellow welders, >> > >> > (Can we make that a thing? Let's make that a thing. :) >> > >> > I'm trying to wedge Spark into an existing model where we process and >> > transform some data and then load it into an MPP database. I know that >> part >> > of the sell of Spark and Shark is that you shouldn't have to copy data >> > around like this, so please bear with me. :) >> > >> > Say I have an RDD of about 10GB in size that's cached in memory. What >> is the >> > best/fastest way to push that data into an MPP database like Redshift? >> Has >> > anyone done something like this? >> > >> > I'm assuming that pushing the data straight from memory into the >> database is >> > much faster than writing the RDD to HDFS and then COPY-ing it from there >> > into the database. >> > >> > Is there, for example, a way to perform a bulk load into the database >> that >> > runs on each partition of the in-memory RDD in parallel? >> > >> > Nick >> > >> > >> > ________________________________ >> > View this message in context: best practices for pushing an RDD into a >> > database >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >