Re: Sourcing data from RedShift
Hi guys, We ultimately needed to add 8 ec2 xl's to get better performance. As was suspected, we could not fit all the data into ram. This worked great with files sized around 100-350MB in size as our initial export task produced. Unfortunately, for the partition settings that we were able to get to work with GraphX (unable to change parallelism due to bug), we are unable to keep writing files at this size - our output ends up being closer to 1GB per file. As a result, our job seems to struggle working with a 100GB worth of these files. We are in a rough spot because upgrading Spark right now is not reasonable for us but this bug prevents solving the issue. On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf malouf.g...@gmail.com wrote: I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote: For each node, if the CSV reader is implemented efficiently, you should be able to hit at least half of the theoretical network bandwidth, which is about 60MB/second/node. So if you just do counting, the expect time should be within 3 minutes. Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the default spark.storage.memoryFraction, it can barely cache 100GB of data, not considering the overhead. So if your operation need to cache the data to be efficient, you may need a larger cluster or change the storage level to MEMORY_AND_DISK. -Xiangrui On Nov 14, 2014, at 5:32 PM, Gary Malouf malouf.g...@gmail.com wrote: Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges, but increasing default parallelism (to 2x # of cpus across cluster) and increasing partitions during reading did not seem to help. On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com wrote: Michael is correct. Using direct connection to dump data would be slow because there is only a single connection. Please use UNLOAD with ESCAPE option to dump the table to S3. See instructions at http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html And then load them back using the redshift input format we wrote: https://github.com/databricks/spark-redshift (we moved the implementation to github/databricks). Right now all columns are loaded as string columns, and you need to do type casting manually. We plan to add a parser that can translate Redshift table schema directly to Spark SQL schema, but no ETA yet. -Xiangrui On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com wrote: I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)
Sourcing data from RedShift
We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)
Re: Sourcing data from RedShift
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)
Re: Sourcing data from RedShift
Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges, but increasing default parallelism (to 2x # of cpus across cluster) and increasing partitions during reading did not seem to help. On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com wrote: Michael is correct. Using direct connection to dump data would be slow because there is only a single connection. Please use UNLOAD with ESCAPE option to dump the table to S3. See instructions at http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html And then load them back using the redshift input format we wrote: https://github.com/databricks/spark-redshift (we moved the implementation to github/databricks). Right now all columns are loaded as string columns, and you need to do type casting manually. We plan to add a parser that can translate Redshift table schema directly to Spark SQL schema, but no ETA yet. -Xiangrui On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com wrote: I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)
Re: Sourcing data from RedShift
I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng m...@databricks.com wrote: For each node, if the CSV reader is implemented efficiently, you should be able to hit at least half of the theoretical network bandwidth, which is about 60MB/second/node. So if you just do counting, the expect time should be within 3 minutes. Note that your cluster have 15GB * 12 = 180GB RAM in total. If you use the default spark.storage.memoryFraction, it can barely cache 100GB of data, not considering the overhead. So if your operation need to cache the data to be efficient, you may need a larger cluster or change the storage level to MEMORY_AND_DISK. -Xiangrui On Nov 14, 2014, at 5:32 PM, Gary Malouf malouf.g...@gmail.com wrote: Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges, but increasing default parallelism (to 2x # of cpus across cluster) and increasing partitions during reading did not seem to help. On Fri, Nov 14, 2014 at 6:51 PM, Xiangrui Meng m...@databricks.com wrote: Michael is correct. Using direct connection to dump data would be slow because there is only a single connection. Please use UNLOAD with ESCAPE option to dump the table to S3. See instructions at http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html And then load them back using the redshift input format we wrote: https://github.com/databricks/spark-redshift (we moved the implementation to github/databricks). Right now all columns are loaded as string columns, and you need to do type casting manually. We plan to add a parser that can translate Redshift table schema directly to Spark SQL schema, but no ETA yet. -Xiangrui On Nov 14, 2014, at 3:46 PM, Michael Armbrust mich...@databricks.com wrote: I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)