Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost.
We had been using 12 m3.xlarges, but increasing default parallelism (to 2x # of cpus across cluster) and increasing partitions during reading did not seem to help. On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <m...@databricks.com> wrote: > Michael is correct. Using direct connection to dump data would be slow > because there is only a single connection. Please use UNLOAD with ESCAPE > option to dump the table to S3. See instructions at > > http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html > > And then load them back using the redshift input format we wrote: > https://github.com/databricks/spark-redshift (we moved the implementation > to github/databricks). Right now all columns are loaded as string columns, > and you need to do type casting manually. We plan to add a parser that can > translate Redshift table schema directly to Spark SQL schema, but no ETA > yet. > > -Xiangrui > > On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mich...@databricks.com> > wrote: > > I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD > command used to produce the data. Xiangrui can correct me if I'm wrong > though. > > On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <malouf.g...@gmail.com> > wrote: > >> We have a bunch of data in RedShift tables that we'd like to pull in >> during job runs to Spark. What is the path/url format one uses to pull >> data from there? (This is in reference to using the >> https://github.com/mengxr/redshift-input-format) >> >> >> >> >> > >
