Re: Sourcing data from RedShift

Gary Malouf Fri, 14 Nov 2014 17:34:38 -0800

Hmm, we actually read the CSV data in S3 now and were looking to avoid
that.  Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.


We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
# of cpus across cluster) and increasing partitions during reading did not
seem to help.

On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng <m...@databricks.com> wrote:

> Michael is correct. Using direct connection to dump data would be slow
> because there is only a single connection. Please use UNLOAD with ESCAPE
> option to dump the table to S3. See instructions at
>
> http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
>
> And then load them back using the redshift input format we wrote:
> https://github.com/databricks/spark-redshift (we moved the implementation
> to github/databricks). Right now all columns are loaded as string columns,
> and you need to do type casting manually. We plan to add a parser that can
> translate Redshift table schema directly to Spark SQL schema, but no ETA
> yet.
>
> -Xiangrui
>
> On Nov 14, 2014, at 3:46 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
> command used to produce the data.  Xiangrui can correct me if I'm wrong
> though.
>
> On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf <malouf.g...@gmail.com>
> wrote:
>
>> We have a bunch of data in RedShift tables that we'd like to pull in
>> during job runs to Spark.  What is the path/url format one uses to pull
>> data from there?  (This is in reference to using the
>> https://github.com/mengxr/redshift-input-format)
>>
>>
>>
>>
>>
>
>

Re: Sourcing data from RedShift

Reply via email to