Re: Sourcing data from RedShift

2014-11-18 Thread Gary Malouf
Hi guys,

We ultimately needed to add 8 ec2 xl's to get better performance.  As was
suspected, we could not fit all the data into ram.

This worked great with files sized around 100-350MB in size as our initial
export task produced.  Unfortunately, for the partition settings that we
were able to get to work with GraphX (unable to change parallelism due to
bug), we are unable to keep writing files at this size - our output ends up
being closer to 1GB per file.

As a result, our job seems to struggle working with a 100GB worth of these
files.  We are in a rough spot because upgrading Spark right now is not
reasonable for us but this bug prevents solving the issue.

On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf malouf.g...@gmail.com wrote:

 I'll try this out and follow up with what I find.

 On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com
 wrote:

 For each node, if the CSV reader is implemented efficiently, you should
 be able to hit at least half of the theoretical network bandwidth, which is
 about 60MB/second/node. So if you just do counting, the expect time should
 be within 3 minutes.

 Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use
 the default spark.storage.memoryFraction, it can barely cache 100GB of
 data, not considering the overhead. So if your operation need to cache the
 data to be efficient, you may need a larger cluster or change the storage
 level to MEMORY_AND_DISK.

 -Xiangrui

 On Nov 14, 2014, at 5:32 PM, Gary Malouf malouf.g...@gmail.com wrote:

 Hmm, we actually read the CSV data in S3 now and were looking to avoid
 that.  Unfortunately, we've experienced dreadful performance reading 100GB
 of text data for a job directly from S3 - our hope had been connecting
 directly to Redshift would provide some boost.

 We had been using 12 m3.xlarges, but increasing default parallelism (to
 2x # of cpus across cluster) and increasing partitions during reading did
 not seem to help.

 On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com
 wrote:

 Michael is correct. Using direct connection to dump data would be slow
 because there is only a single connection. Please use UNLOAD with ESCAPE
 option to dump the table to S3. See instructions at

 http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

 And then load them back using the redshift input format we wrote:
 https://github.com/databricks/spark-redshift (we moved the
 implementation to github/databricks). Right now all columns are loaded as
 string columns, and you need to do type casting manually. We plan to add a
 parser that can translate Redshift table schema directly to Spark SQL
 schema, but no ETA yet.

 -Xiangrui

 On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
 command used to produce the data.  Xiangrui can correct me if I'm wrong
 though.

 On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com
 wrote:

 We have a bunch of data in RedShift tables that we'd like to pull in
 during job runs to Spark.  What is the path/url format one uses to pull
 data from there?  (This is in reference to using the
 https://github.com/mengxr/redshift-input-format)












Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark.  What is the path/url format one uses to pull data from
there?  (This is in reference to using the
https://github.com/mengxr/redshift-input-format)


Re: Sourcing data from RedShift

2014-11-14 Thread Michael Armbrust
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data.  Xiangrui can correct me if I'm wrong
though.

On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote:

 We have a bunch of data in RedShift tables that we'd like to pull in
 during job runs to Spark.  What is the path/url format one uses to pull
 data from there?  (This is in reference to using the
 https://github.com/mengxr/redshift-input-format)







Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that.  Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.

We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
# of cpus across cluster) and increasing partitions during reading did not
seem to help.

On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com wrote:

 Michael is correct. Using direct connection to dump data would be slow
 because there is only a single connection. Please use UNLOAD with ESCAPE
 option to dump the table to S3. See instructions at

 http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

 And then load them back using the redshift input format we wrote:
 https://github.com/databricks/spark-redshift (we moved the implementation
 to github/databricks). Right now all columns are loaded as string columns,
 and you need to do type casting manually. We plan to add a parser that can
 translate Redshift table schema directly to Spark SQL schema, but no ETA
 yet.

 -Xiangrui

 On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
 command used to produce the data.  Xiangrui can correct me if I'm wrong
 though.

 On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com
 wrote:

 We have a bunch of data in RedShift tables that we'd like to pull in
 during job runs to Spark.  What is the path/url format one uses to pull
 data from there?  (This is in reference to using the
 https://github.com/mengxr/redshift-input-format)









Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
I'll try this out and follow up with what I find.

On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote:

 For each node, if the CSV reader is implemented efficiently, you should be
 able to hit at least half of the theoretical network bandwidth, which is
 about 60MB/second/node. So if you just do counting, the expect time should
 be within 3 minutes.

 Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the
 default spark.storage.memoryFraction, it can barely cache 100GB of data,
 not considering the overhead. So if your operation need to cache the data
 to be efficient, you may need a larger cluster or change the storage level
 to MEMORY_AND_DISK.

 -Xiangrui

 On Nov 14, 2014, at 5:32 PM, Gary Malouf malouf.g...@gmail.com wrote:

 Hmm, we actually read the CSV data in S3 now and were looking to avoid
 that.  Unfortunately, we've experienced dreadful performance reading 100GB
 of text data for a job directly from S3 - our hope had been connecting
 directly to Redshift would provide some boost.

 We had been using 12 m3.xlarges, but increasing default parallelism (to 2x
 # of cpus across cluster) and increasing partitions during reading did not
 seem to help.

 On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com
 wrote:

 Michael is correct. Using direct connection to dump data would be slow
 because there is only a single connection. Please use UNLOAD with ESCAPE
 option to dump the table to S3. See instructions at

 http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

 And then load them back using the redshift input format we wrote:
 https://github.com/databricks/spark-redshift (we moved the
 implementation to github/databricks). Right now all columns are loaded as
 string columns, and you need to do type casting manually. We plan to add a
 parser that can translate Redshift table schema directly to Spark SQL
 schema, but no ETA yet.

 -Xiangrui

 On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
 command used to produce the data.  Xiangrui can correct me if I'm wrong
 though.

 On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com
 wrote:

 We have a bunch of data in RedShift tables that we'd like to pull in
 during job runs to Spark.  What is the path/url format one uses to pull
 data from there?  (This is in reference to using the
 https://github.com/mengxr/redshift-input-format)